linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Fuzzy hash stuff.. (was Re: 2.1.xxx makes Electric Fence 22x slower)
       [not found] ` <no.id>
@ 1998-08-26  0:03   ` Jamie Lokier
  1998-09-10  6:34   ` GPS Leap Second Scheduled! Jamie Lokier
                     ` (202 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Jamie Lokier @ 1998-08-26  0:03 UTC (permalink / raw)
  To: linux-kernel

"David S. Miller" <davem@dm.cobaltmicro.com> wrote:
>As promised here is my work in progress fuzzy hash VMA lookup stuff.

On Tue, Aug 25, 1998 at 10:47:26PM +1000, Keith Owens wrote:
> Lots of code with very few comments snipped.  Come on Davem, make it
> understandable for us mere mortals.  If it took two people to fix
> quirks and bugs and converge the algorithm, surely a few notes on how
> it works would not go astray.

Nope, I can't see how it works either.

BTW, a splay tree would also be as fast as what we have now in the
common case, without need for a special one entry cache.  The root of
the tree automatically acts as the cache.

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: GPS Leap Second Scheduled!
       [not found] ` <no.id>
  1998-08-26  0:03   ` Fuzzy hash stuff.. (was Re: 2.1.xxx makes Electric Fence 22x slower) Jamie Lokier
@ 1998-09-10  6:34   ` Jamie Lokier
  1998-09-11  6:18     ` Michael Shields
  1998-12-11 14:16   ` Access to I/O-mapped / Memory-mapped resources Jamie Lokier
                     ` (201 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Jamie Lokier @ 1998-09-10  6:34 UTC (permalink / raw)
  To: linux-kernel

On Wed, Sep 09, 1998 at 09:35:59AM -0700, David Lang wrote:
> I am probably missing something, but can't you just ignore the leap second
> until you discover that the time is 1 sec off and then use the normal NTP
> procedure to get back to the exact time

Until the NTP procedure discovers and corrects this (a few minutes, plus
correction time), anything that expects synchronised time between
machines can go wrong.

Admittedly synchronisation isn't perfect anyway.

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/faq.html

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: GPS Leap Second Scheduled!
  1998-09-10  6:34   ` GPS Leap Second Scheduled! Jamie Lokier
@ 1998-09-11  6:18     ` Michael Shields
  0 siblings, 0 replies; 662+ messages in thread
From: Michael Shields @ 1998-09-11  6:18 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

In article <19980910073422.A13283@tantalophile.demon.co.uk>,
Jamie Lokier <lkd@tantalophile.demon.co.uk> wrote:
> On Wed, Sep 09, 1998 at 09:35:59AM -0700, David Lang wrote:
> > I am probably missing something, but can't you just ignore the leap second
> > until you discover that the time is 1 sec off and then use the normal NTP
> > procedure to get back to the exact time
> 
> Until the NTP procedure discovers and corrects this (a few minutes, plus
> correction time), anything that expects synchronised time between
> machines can go wrong.

NTP has the capability to know in advance that a leap second is
scheduled and act upon that at the correct time.

Check your logs the next time a leap second happens; xntpd does it.
-- 
Shields, CrossLink.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/faq.html

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Access to I/O-mapped / Memory-mapped resources
       [not found] ` <no.id>
  1998-08-26  0:03   ` Fuzzy hash stuff.. (was Re: 2.1.xxx makes Electric Fence 22x slower) Jamie Lokier
  1998-09-10  6:34   ` GPS Leap Second Scheduled! Jamie Lokier
@ 1998-12-11 14:16   ` Jamie Lokier
  2000-07-28 22:10   ` RLIM_INFINITY inconsistency between archs Adam Sampson
                     ` (200 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Jamie Lokier @ 1998-12-11 14:16 UTC (permalink / raw)
  To: Linux Lists, linux-kernel

On Wed, Dec 09, 1998 at 03:19:10PM -0800, Linux Lists wrote:
> I have a question: is there any reference in regards to how / when to use
> the virt_to_phys, virt_to_bus, ioremap, etc. ... functions, other than 
> /usr/src/linux/Documentation/IO-mapping.txt ?!? I'd like to understand it
> better, but this text has not been enough (for me, of course).

Well, the whole address thing is a bit messy anyway; I don't expect a
perfect understanding of it is even possible...

But in case it helps, try the document below.

> If there is no other way, I'll try to re-read it 1000 times to see if my
> understanding increases 1000 times as well ... ;)

Do that :-)

	linux/Documentation/IO-mapping.txt
        ----------------------------------

...is a little out of date but basically right.  Give it a read.
Then read this addendum; it might clarify things a bit.

Types of address
================

*bus* is an address you pass to devices.  E.g., what you'd write to a
PCI bus-mastering DMA device for its target address.  To access a bus
address from kernel C code, known as memory-mapped I/O, you must use
ioremap() to convert it to an *ioremap* address.  From C code, these
should always be accessed through readl(), writel() etc. and not as
ordinary memory references.  See <asm/io.h>.

*phys* is a CPU address after MMU translation.  It only appears in page
tables and things related to page tables.  Even this is hidden to some
extent because pte_page(*pte) returns a *virt* address despite appearances.
See <asm/pgtable.h>.

*virt* is a kernel direct-mapped address.  These are addresses you can
read and write from C, that correspond to main memory.  E.g., on x86,
the *virt* address 0xc0001000 means the 4097th byte of main memory.  See
<asm/page.h>.

There are other kinds of address too:

*user* addresses (such as passed to read() and write()) are
none of the above, and should always by accessed through get_user(),
put_user() etc.  See <asm/uaccess.h>.

*static* addresses are the addresses of functions and variables that are
declared in source code.  Because kernel code and modules are allocated
in various ways, you can't assume much about these addresses, but you
can always read and write them from kernel C code.

*vmalloc* addresses (returned by vmalloc()).  These are kernel virtual
address, which you can read and write from kernel C code.  But you can't
pass them to any of the virt_to_XXX macros, because they're *not* *virt*
addresses!  See <linux/vmalloc.h>.

*fixmap* addresses (returned by fix_to_virt()) (which has a misleading
name).  These are like *vmalloc* addresses: you can't pass them to the
virt_to_XXX macros, so they're _not_ *virt* addresses.  It's not very
clear if you're supposed to use readl() and writel() to access these.
See <asm/fixmap.h>.

*ioremap* addresses are returned by ioremap(), which takes a *bus*
address.  These have some similarity to *vmalloc* addresses, but you can
only use readl(), writel() etc. to access the device memory referred to
here.  Unhelpfully, just reading and writing these directly does work on
some architectures, and most older device drivers still do this.

Memory map
==========

The actual memory map varies a lot between architectures.  But since
someone asked, I'll give a quick summary of one particular memory map.
This example is for an i386 architecture: a particular 64MB Pentium II
dual processor box with PCI and an ISA bridge.

Virtual map
...........

This is what C code sees in user mode:

0x00000000-0xbfffffff  User space virtual memory mappings.

This is what C code sees in kernel mode:

0x00000000-0xbfffffff User space virtual memory mappings (current->mm context).
0xc0000000-0xc3ffffff 64MB kernel view of all of main memory, uses 4MB pages.
0xc4000000-0xc47fffff 8MB unmapped hole.
0xc4800000-0xffffbfff Kernel virtual mappings for vmalloc() and ioremap().
0xffffc000-0xffffcfff Memory mapped local APIC registers.
0xffffd000-0xffffdfff Memory mapped IO-APIC registers.

The 64MB view is subdivided like this (it depends on the PC's details):

0xc0000000-0xc00003ff Zero page, reserved for BIOS.
0xc0000000-0xc009ffff Low memory (first 640k minus zero page).
0xc00a0000-0xc00fffff Low memory-mapped I/O (especially VGA adapter) and ROMs.
0xc0100000-0xc3ffffff High memory (remaining 63MB).

The 64MB view at 0xc0000000 (= PAGE_OFFSET) is directly addressable main
memory.  This is simply memory addresses with PAGE_OFFSET added.  This
contains the main kernel image, and memory allocated with kmalloc(),
get_free_page() and the slab allocator.  Cached disk pages, network
buffers etc. are all addressed in this space.

The 64MB view is the *virt* addresses described earlier.  "Virtual" here
simply refers to the PAGE_OFFSET translation, nothing more.

The vmalloc() mappings are a different way to see this memory, used only
when a large, contiguous address range needs to be allocated.  This is
used to hold loaded modules amongst other things.

The ioremap() mappings occupy the same address range as the vmalloc()
mappings, but are a view onto memory-mapped I/O space (MMIO).  Not all
devices are mapped with ioremap() -- those in the low memory-mapped area
aren't.  In theory you are supposed to use readl(), writel(),
memcpy_fromio() etc. to access memory-mapped I/O space, but many older
drivers fail to do this and work fine on the current x86 implementation.

Physical map
............

Physical addresses are the result of virtual address translation on
board the CPU.  C code (and assembly code) doesn't see these directly,
but they are used in page tables, which control the address translation.

There is some confusion in the kernel page table code about whether the
physical addresses passed around are *phys* addresses (also known as
*linear*), or *virt* address, which are a restricted subset of *phys*
with PAGE_OFFSET added.

When setting entries, *phys* tends to be used, but when reading entries
*virt* tends to be returned.  This sometimes loses information, so
breaking some device drivers.  Perhaps those drivers are broken by
design anyway.

Bus map
.......

This view is completely different to the virtual address view, and the
*virt* view which is a subset of virtual addresses.  Bus addresses can
overlap virtual addresses in an arbitrary way, 

The bus map is the view seen by peripheral devices, like video cards and
disk controllers.  Although different from the CPU's physical map (which
is a sort of private bus map for the CPU), the bus addresses tend to be
consistent between different devices in a single machine.

On a PC, the bus map is arranged by the system BIOS at boot time,
according to rules of Plug'n'Play and other rules.  For the PCI bus,
regions of prefetchable and non-prefetchable memory are mixed
arbitrarily: there's no particularly significant address where one kind
stops and another starts.  (Although your BIOS might make it appear so).
You don't have to worry about the differences, as long as your BIOS
configured everything properly.

`lspci -vb' will show the bus addresses of all PCI devices on a system.

Memory mapped ISA cards tend to have rather low addresses (in the first
megabyte), while PCI cards can be mapped to all sorts of addresses, high
and low depending on the BIOS.

Non-PC architectures have different rules.

On an i386 architecture, bus addresses and *phys* addresses are the
same.  This is convenient but it does tend to hide some problems.  On
other architectures, these two are often different.

Devices with *bus* addresses are supposed to be memory mapped using
ioremap(), and then accessed using readl(), writel() etc.  Because none
of this was necessary on the i386 with the 2.0.x kernels, and the other
platforms weren't very well supported then, many older device drivers
simply access device bus addresses as if they were memory.

This poses big problems with some non-i386 architectures, which require
readl() etc. for the drivers to work.  These days it also poses problems
with the i386, because in 2.1.x kernels the memory layout was changed to
make communication between user space and kernel space more efficient.
As a result, ioremap() is required to get a virtual address which you
can pass to readl(), writel() etc.

Note: there is an ironic twist.  The virtual address returned by
ioremap() is not a *virt* address, so you can't expect meaningful
results if you pass it to virt_to_bus() or virt_to_phys().

Another note: at least on a PC, you don't need ioremap() to access
devices in the first 1MB of the bus address range.  This includes most
ISA devices (but not video cards).

I/O port map
............

This is similar to the bus map, but refers to I/O ports that are
accessed by special I/O instructions from the CPU, if it is an i386
based architecture.  For some other architectures, the I/O ports are
actually quite similar to memory-mapped I/O but using different
addresses.  Although some buses support more than 64k I/O ports, the
i386 architecture does not so this address range is restricted to
0x0000-0xffff.

`cat /proc/ioports' shows all the I/O ports used by drivers currently
loaded on a system.  `lspci -v' shows all the I/O ports used by PCI
devices.

Use inb(), outb() etc. to access I/O ports.  This has been required ever
since the earliest versions of Linux, so all drivers that use I/O ports
get this right.  There is no equivalent to ioremap().

Translating virtual addresses
=============================

Some people try to look up page tables to convert a *user* address or
*vmalloc* address to a *bus* address or *virt* address.  This works for
some things, but breaks others.  It makes a number of assumptions that
are incorrect and won't work when you want to use the driver in a new
way one day, or on a new architecture.

This mess will be cleaned up sometime in version 2.3.  So if you want it
cleaned up, perhaps the best way is to help ensure 2.2 is ready for
release.  Hint :-)

Hope this helps,
-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: RLIM_INFINITY inconsistency between archs
       [not found] ` <no.id>
                     ` (2 preceding siblings ...)
  1998-12-11 14:16   ` Access to I/O-mapped / Memory-mapped resources Jamie Lokier
@ 2000-07-28 22:10   ` Adam Sampson
  2000-07-28 22:20   ` Adam Sampson
                     ` (199 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Adam Sampson @ 2000-07-28 22:10 UTC (permalink / raw)
  To: linux-kernel

On Thu, Jul 27, 2000 at 12:39:51AM -0700, Linus Torvalds wrote:
> Is there some documentation file that I've not updated and that people
> are slavishly following outdated information in? I don't read the
> documentation myself, so I'd never notice ;)

Yes; the glibc installation instructions.

-- 

Adam Sampson
azz@gnu.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: RLIM_INFINITY inconsistency between archs
       [not found] ` <no.id>
                     ` (3 preceding siblings ...)
  2000-07-28 22:10   ` RLIM_INFINITY inconsistency between archs Adam Sampson
@ 2000-07-28 22:20   ` Adam Sampson
  2000-07-29 13:23     ` Miquel van Smoorenburg
  2001-04-27 23:30   ` [patch] linux likes to kill bad inodes Andreas Dilger
                     ` (198 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Adam Sampson @ 2000-07-28 22:20 UTC (permalink / raw)
  To: linux-kernel

On Thu, Jul 27, 2000 at 07:03:57PM +0200, Jamie Lokier wrote:
> But instead, how about a script: /lib/modules/VERSION/compile-module.
> The script would know where to find the kernel headers.  That could be
> /lib/modules/include for distributions, and /my/kernel/tree/include for
> folks who used `make modules_install' recently.

I'll second that suggestion. This kind of thing works very well indeed for
projects like Apache.

-- 

Adam Sampson
azz@gnu.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: RLIM_INFINITY inconsistency between archs
  2000-07-28 22:20   ` Adam Sampson
@ 2000-07-29 13:23     ` Miquel van Smoorenburg
  0 siblings, 0 replies; 662+ messages in thread
From: Miquel van Smoorenburg @ 2000-07-29 13:23 UTC (permalink / raw)
  To: linux-kernel

In article <cistron.20000728232030.C8868@gnu.org>,
Adam Sampson  <azz@gnu.org> wrote:
>On Thu, Jul 27, 2000 at 07:03:57PM +0200, Jamie Lokier wrote:
>> But instead, how about a script: /lib/modules/VERSION/compile-module.
>> The script would know where to find the kernel headers.  That could be
>> /lib/modules/include for distributions, and /my/kernel/tree/include for
>> folks who used `make modules_install' recently.
>
>I'll second that suggestion. This kind of thing works very well indeed for
>projects like Apache.

It is indeed a very good idea. The script could just spit out the
CFLAGS used for kernel compilation like this:

#! /bin/sh
cat <<EOF
-D__KERNEL__ -I/usr/src/linux-2.2.15/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe -fno-strength-reduce -m486 -malign-loops=2 -malign-jumps=2 -malign-functions=2 -DCPU=686 -DUTS_MACHINE='"i386"'
EOF

Then a module Makefile would be as simple as

# Set KVER manually if you want to compile against another kernel version
KVER=$(shell uname -r)
CFLAGS=$(shell /lib/modules/$(KVER)/kernel-config)

module.o: module.c module.h

I've tried this, it works.

Mike.
-- 
Cistron Certified Internetwork Expert #1. Think free speech; drink free beer.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [patch] linux likes to kill bad inodes
       [not found] ` <no.id>
                     ` (4 preceding siblings ...)
  2000-07-28 22:20   ` Adam Sampson
@ 2001-04-27 23:30   ` Andreas Dilger
  2001-06-26 22:24   ` Tracking down semaphore usage/leak Ken Brownfield
                     ` (197 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Andreas Dilger @ 2001-04-27 23:30 UTC (permalink / raw)
  To: torvalds; +Cc: Pavel Machek, Chris Mason, viro, kernel list, jack

I previously wrote:
> I will post a patch separately which handles a couple of cases where
> *_delete_inode() does not call clear_inode() in all cases.

OK, here it is.  The ext2_delete_inode() change isn't exactly a bug fix,
but rather a "performance" change.  No need to hold BLK to check status
or call clear_inode() (we call clear_inode() outside BLK in VFS if
delete_inode() method does not exist).

Cheers, Andreas
=======================================================================
diff -ru linux-2.4.4p1.orig/fs/ext2/inode.c linux/fs/ext2/inode.c
--- linux-2.4.4p1.orig/fs/ext2/inode.c	Tue Apr 10 16:44:49 2001
+++ linux/fs/ext2/inode.c	Fri Apr 27 13:51:15 2001
@@ -44,12 +47,12 @@
  */
 void ext2_delete_inode (struct inode * inode)
 {
-	lock_kernel();
-
 	if (is_bad_inode(inode) ||
 	    inode->i_ino == EXT2_ACL_IDX_INO ||
 	    inode->i_ino == EXT2_ACL_DATA_INO)
 		goto no_delete;
+
+	lock_kernel();
 	inode->u.ext2_i.i_dtime	= CURRENT_TIME;
 	mark_inode_dirty(inode);
 	ext2_update_inode(inode, IS_SYNC(inode));
@@ -59,9 +62,7 @@
 	ext2_free_inode (inode);
 
 	unlock_kernel();
 	return;
 no_delete:
-	unlock_kernel();
 	clear_inode(inode);	/* We must guarantee clearing of inode... */
 }
 
diff -ru linux-2.4.4p1.orig/fs/bfs/inode.c linux/fs/bfs/inode.c
--- linux-2.4.4p1.orig/fs/bfs/inode.c	Tue Apr 10 16:44:49 2001
+++ linux/fs/bfs/inode.c	Fri Apr 27 15:45:31 2001
@@ -145,7 +145,7 @@
 	if (is_bad_inode(inode) || inode->i_ino < BFS_ROOT_INO ||
 	    inode->i_ino > inode->i_sb->su_lasti) {
 		printf("invalid ino=%08lx\n", inode->i_ino);
-		return;
+		goto bad_inode;
 	}
 	
 	inode->i_size = 0;
@@ -155,8 +156,7 @@
 	bh = bread(dev, block, BFS_BSIZE);
 	if (!bh) {
 		printf("Unable to read inode %s:%08lx\n", bdevname(dev), ino);
-		unlock_kernel();
-		return;
+		goto bad_unlock;
 	}
 	off = (ino - BFS_ROOT_INO)%BFS_INODES_PER_BLOCK;
 	di = (struct bfs_inode *)bh->b_data + off;
@@ -178,7 +178,9 @@
 		s->su_lf_eblk = inode->iu_sblock - 1;
 		mark_buffer_dirty(s->su_sbh);
 	}
+bad_unlock:
 	unlock_kernel();
+bad_inode:
 	clear_inode(inode);
 }
 
diff -ru linux-2.4.4p1.orig/fs/ufs/ialloc.c linux/fs/ufs/ialloc.c
--- linux-2.4.4p1.orig/fs/ufs/ialloc.c	Thu Nov 16 14:18:26 2000
+++ linux/fs/ufs/ialloc.c	Fri Apr 27 15:53:26 2001
@@ -82,6 +82,7 @@
 	if (!((ino > 1) && (ino < (uspi->s_ncg * uspi->s_ipg )))) {
 		ufs_warning(sb, "ufs_free_inode", "reserved inode or nonexistent inode %u\n", ino);
 		unlock_super (sb);
+		clear_inode (inode);
 		return;
 	}
 	
@@ -90,6 +91,7 @@
 	ucpi = ufs_load_cylinder (sb, cg);
 	if (!ucpi) {
 		unlock_super (sb);
+		clear_inode (inode);
 		return;
 	}
 	ucg = ubh_get_ucg(UCPI_UBH);
-- 
Andreas Dilger                               TurboLabs filesystem development
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Tracking down semaphore usage/leak
       [not found] ` <no.id>
                     ` (5 preceding siblings ...)
  2001-04-27 23:30   ` [patch] linux likes to kill bad inodes Andreas Dilger
@ 2001-06-26 22:24   ` Ken Brownfield
  2001-07-23 20:57   ` user-mode port 0.44-2.4.7 Alan Cox
                     ` (196 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Ken Brownfield @ 2001-06-26 22:24 UTC (permalink / raw)
  To: linux-kernel

Urgh, learn something new everyday (ipcs, ipcrm).  My apologies; apropos
didn't catch it on my boxes. :-(
-- 
Ken.
brownfld@irridia.com

On Tue, Jun 26, 2001 at 02:09:16PM -0700, Ken Brownfield wrote:
| With RedHat's new Samba 2.0.10 RPM (the one to patch the latest 
| vulnerability) they seem to have sniffed enough glue to start using SysV 
| IPC semaphores which apparently leak until SEM??? are reached.  semget() 
| is returning "No space left on device", and disk/inodes/memory are all 
| fine.
| 
| Anyway, could someone give me a very quick rundown of the options for 
| tracking/force-freeing semaphores, or how to determine from proc, if 
| possible, what the current semaphore allocation status is?  Or did RH 
| slay a machine I really don't want to reboot?  I've restarted all 
| semaphore-using processes to no avail, but even so the SEM??? limits are 
| far above the normal needs of this machine.
| 
| Thanks much.  Searched the archives/Google/FAQ/semaphore docs; sorry if 
| it's been covered.  I'll summarize if folks want to hit me on or off the 
| list.
| --
| Ken.
| brownfld@irridia.com

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: user-mode port 0.44-2.4.7
       [not found] ` <no.id>
                     ` (6 preceding siblings ...)
  2001-06-26 22:24   ` Tracking down semaphore usage/leak Ken Brownfield
@ 2001-07-23 20:57   ` Alan Cox
  2001-07-23 21:14     ` Chris Friesen
  2001-07-24 17:51   ` patch for allowing msdos/vfat nfs exports Alan Cox
                     ` (195 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-23 20:57 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Linus Torvalds, Andrea Arcangeli, Jeff Dike,
	user-mode-linux-user, linux-kernel, Jan Hubicka

> Suppose I loop against xtime reaching a particular value.  While this is

xtime isnt used this way that I can see. jiffies however is. There are good
arguments for getting rid of most [ab]use of jiffies however. For one its
pretty important to scaling on both big mainframes and beowulf setups doing
heavy computation to reduce timer ticks

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: user-mode port 0.44-2.4.7
  2001-07-23 20:57   ` user-mode port 0.44-2.4.7 Alan Cox
@ 2001-07-23 21:14     ` Chris Friesen
  0 siblings, 0 replies; 662+ messages in thread
From: Chris Friesen @ 2001-07-23 21:14 UTC (permalink / raw)
  To: Alan Cox
  Cc: Friesen, Christopher [CAR:VS16:EXCH],
	Linus Torvalds, Andrea Arcangeli, Jeff Dike,
	user-mode-linux-user, linux-kernel, Jan Hubicka

Alan Cox wrote:
> 
> > Suppose I loop against xtime reaching a particular value.  While this is
> 
> xtime isnt used this way that I can see. jiffies however is. There are good
> arguments for getting rid of most [ab]use of jiffies however. For one its
> pretty important to scaling on both big mainframes and beowulf setups doing
> heavy computation to reduce timer ticks

jiffies is (as of 2.4.7 anyways) marked as volatile, so we're safe there.  My
point is this--should someone writing badly designed (but technically correct)
code be able to totally hose the system?

The only difference between volatile and normal is that if it is marked as
volatile it must be accessed every time rather than being pre-cached.  If we
never spin on accessing xtime, then the fact that we can't optimize it shouldn't
hurt. (Am I wrong here?  If I am then please explain because I'm missing
something...)  If someone ever *does* spin on xtime, then we really don't want
to optimize that access out of the loop, because doing so could cause nasty
problems.


-- 
Chris Friesen                    | MailStop: 043/33/F10  
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: patch for allowing msdos/vfat nfs exports
       [not found] ` <no.id>
                     ` (7 preceding siblings ...)
  2001-07-23 20:57   ` user-mode port 0.44-2.4.7 Alan Cox
@ 2001-07-24 17:51   ` Alan Cox
  2001-07-24 17:56   ` Externally transparent routing Alan Cox
                     ` (194 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-24 17:51 UTC (permalink / raw)
  To: Nathan Laredo; +Cc: linux-kernel

> I've been using it for half a day now and so far it hasn't done
> anything bad, but please be careful if you decide to test it and
> backup your data and after testing, be sure to compare your data
> to your backup.

Rename ?

> +	struct inode *inode = dentry->d_inode;
> +	unsigned int i_pos = MSDOS_I(inode)->i_location;

i_location is not a constant across renames or other operations, so you may
inadvertantly do I/O to completely the wrong file


The infrastructure looks great, I just don't think your handles are safe 

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Externally transparent routing
       [not found] ` <no.id>
                     ` (8 preceding siblings ...)
  2001-07-24 17:51   ` patch for allowing msdos/vfat nfs exports Alan Cox
@ 2001-07-24 17:56   ` Alan Cox
  2001-07-25  9:43     ` Jordi Verwer
  2001-07-25 19:12   ` user-mode port 0.44-2.4.7 Alan Cox
                     ` (193 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-24 17:56 UTC (permalink / raw)
  To: Jordi Verwer; +Cc: Linux Kernel Mailing List

> To prevent my NAT-box from showing up on traceroutes I'd like to let it
> route without decreasing the TTL. I was told that proxy arp also archieves

And what happens if you get a routing loop ?

A NAT box really does need to drop the TTL. Nothing stops you giving it a
more bizarre name, or indeed you can do what a few folks have found
excruciatingly funny to do to tracerouters which is to spoof totally bogus
icmp unreachables so they see crazy paths

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Externally transparent routing
  2001-07-24 17:56   ` Externally transparent routing Alan Cox
@ 2001-07-25  9:43     ` Jordi Verwer
  0 siblings, 0 replies; 662+ messages in thread
From: Jordi Verwer @ 2001-07-25  9:43 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List

> And what happens if you get a routing loop ?
Bad Things would happen, but I only have one router and since it's a NAT box
it isn't very likely to end up in a routing loop anyway.

> A NAT box really does need to drop the TTL. Nothing stops you giving it a
> more bizarre name, or indeed you can do what a few folks have found
> excruciatingly funny to do to tracerouters which is to spoof totally bogus
> icmp unreachables so they see crazy paths
What I wanted to do was be able to send my traceroutes to websites that
don't function properly, but since my NAT box is headless and I'd like to
avoid the hassle of SSH-ing to it, I do these traceroutes from one of my
internal machines. If I don't manually remove my NAT box from the list, the
braindead webmaster will allways blame my NAT box (which naturally is
innocent;)). But I suppose you do not want this to be possible. That is
understandable, but still BSD has a very clean implementation of
transrouting and I see no reason not to let Linux do this.

Jordi Verwer
P.S.: I adjust my computer's (which isn't mine btw, but belongs to my
"boss") date.
P.P.S.: Still not subscribed, so please CC any replies to me. Thank you.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: user-mode port 0.44-2.4.7
       [not found] ` <no.id>
                     ` (9 preceding siblings ...)
  2001-07-24 17:56   ` Externally transparent routing Alan Cox
@ 2001-07-25 19:12   ` Alan Cox
  2001-07-25 19:45     ` my patches won't compile under 2.4.7 Kirk Reiser
  2001-07-25 23:49   ` user-mode port 0.44-2.4.7 Alan Cox
                     ` (192 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-25 19:12 UTC (permalink / raw)
  To: James W. Lake; +Cc: linux-kernel

> Should head and tail be volatile in the definition, or should they be
> accessed with:
> int head = (volatile)myqueue.head;
> or with barrier() around the read/write?

The best way is to use barrier calls. It makes your assumptions about
ordering absolutely explicit. However you should still be careful - you
can't be sure that head will be read atomically or written atomically on
all processors eg if it was

	struct
	{
		unsigned char head;
		unsigned char tail;
		char buf[256];
	}

you would get some suprisingly unpleasant suprises on SMP Alpha. Currently
"int" is probably safe for all processors.

So unless this is a precision tuned fast path it is better to play safe with
this and use atomic_t or locking. The spinlock cost on an Athlon or a later
PIII is pretty good in most cases. Using the -ac prefetch stuff can make it
good in almost all cases, but thats probably a 2.5 thing for the generic
case.

Basically locks are getting cheaper on x86, the suprises are getting more
interesting on non-x86

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* my patches won't compile under 2.4.7
  2001-07-25 19:12   ` user-mode port 0.44-2.4.7 Alan Cox
@ 2001-07-25 19:45     ` Kirk Reiser
  2001-07-25 19:58       ` Alan Cox
  2001-07-31 21:54       ` Richard Gooch
  0 siblings, 2 replies; 662+ messages in thread
From: Kirk Reiser @ 2001-07-25 19:45 UTC (permalink / raw)
  To: linux-kernel

As of 2.4.7 my patches to the kernel won't compile.  It appears to be
something to do with devfs_fs_kernel.h being part of miscdevices.h.  I
have sifted through the code but have not been able to determine
exactly why they won't work any more.  Here is the error output from
my compile:

gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i586    -c -o speakup.o speakup.c
In file included from /usr/src/linux/include/linux/locks.h:8,
                 from /usr/src/linux/include/linux/devfs_fs_kernel.h:6,
                 from /usr/src/linux/include/linux/miscdevice.h:4,
                 from speakup.c:63:
/usr/src/linux/include/linux/pagemap.h:35: `currcons' undeclared here (not in a function)
/usr/src/linux/include/linux/pagemap.h:35: parse error before `.'
make[4]: *** [speakup.o] Error 1

I'm not sure even where to start trying to describe what I've looked
at and what I don't understand.  It appears that page_cache_alloc() is
now an inline function with an argument passed to it, where it used to
be a #define with no arguments.  I see that struct misc_device now has
a new member devfs_handle but the other drivers I've looked at rtc.c
haven't changed their structure members to take this into account.  It
seems nothing new is necessary because misc_register checks if it's
been set or not.  The two error lines don't look to me to have anything
to do with any of these things either currcons isn't used in any of
the misc_device structure or anything I can see which might end up
calling page_cache_alloc().  Can anyone give me any ideas what I
should check to hunt down exactly what's going on here?  It almost
looks like gcc is getting screwed up in it's parsing or something.

Any ideas will greatefully be accepted I'm lost!

  Kirk

-- 

Kirk Reiser				The Computer Braille Facility
e-mail: kirk@braille.uwo.ca		University of Western Ontario
phone: (519) 661-3061

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: my patches won't compile under 2.4.7
  2001-07-25 19:45     ` my patches won't compile under 2.4.7 Kirk Reiser
@ 2001-07-25 19:58       ` Alan Cox
  2001-07-25 20:10         ` Kirk Reiser
  2001-07-31 21:54       ` Richard Gooch
  1 sibling, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-25 19:58 UTC (permalink / raw)
  To: Kirk Reiser; +Cc: linux-kernel

> 
> As of 2.4.7 my patches to the kernel won't compile.  It appears to be
> something to do with devfs_fs_kernel.h being part of miscdevices.h.  I
> have sifted through the code but have not been able to determine
> exactly why they won't work any more.  Here is the error output from
> my compile:
> 
> gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i586    -c -o speakup.o speakup.c
> In file included from /usr/src/linux/include/linux/locks.h:8,
>                  from /usr/src/linux/include/linux/devfs_fs_kernel.h:6,
>                  from /usr/src/linux/include/linux/miscdevice.h:4,
>                  from speakup.c:63:
> /usr/src/linux/include/linux/pagemap.h:35: `currcons' undeclared here (not in a function)
> /usr/src/linux/include/linux/pagemap.h:35: parse error before `.'
> make[4]: *** [speakup.o] Error 1
> 
> I'm not sure even where to start trying to describe what I've looked
> at and what I don't understand.  It appears that page_cache_alloc() is
> now an inline function with an argument passed to it, where it used to
> be a #define with no arguments.  I see that struct misc_device now has
> a new member devfs_handle but the other drivers I've looked at rtc.c
> haven't changed their structure members to take this into account.  It
> seems nothing new is necessary because misc_register checks if it's
> been set or not.  The two error lines don't look to me to have anything
> to do with any of these things either currcons isn't used in any of
> the misc_device structure or anything I can see which might end up
> calling page_cache_alloc().  Can anyone give me any ideas what I
> should check to hunt down exactly what's going on here?  It almost
> looks like gcc is getting screwed up in it's parsing or something.
> 
> Any ideas will greatefully be accepted I'm lost!
> 
>   Kirk
> 
> -- 
> 
> Kirk Reiser				The Computer Braille Facility
> e-mail: kirk@braille.uwo.ca		University of Western Ontario
> phone: (519) 661-3061
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: my patches won't compile under 2.4.7
  2001-07-25 19:58       ` Alan Cox
@ 2001-07-25 20:10         ` Kirk Reiser
  0 siblings, 0 replies; 662+ messages in thread
From: Kirk Reiser @ 2001-07-25 20:10 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

Huh?  Did you actually write something below Alan? or are you just
making me feel insecure? 'grin'

  Kirk

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> > 
> > As of 2.4.7 my patches to the kernel won't compile.  It appears to be
> > something to do with devfs_fs_kernel.h being part of miscdevices.h.  I
> > have sifted through the code but have not been able to determine
> > exactly why they won't work any more.  Here is the error output from
> > my compile:
> > 
> > gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i586    -c -o speakup.o speakup.c
> > In file included from /usr/src/linux/include/linux/locks.h:8,
> >                  from /usr/src/linux/include/linux/devfs_fs_kernel.h:6,
> >                  from /usr/src/linux/include/linux/miscdevice.h:4,
> >                  from speakup.c:63:
> > /usr/src/linux/include/linux/pagemap.h:35: `currcons' undeclared here (not in a function)
> > /usr/src/linux/include/linux/pagemap.h:35: parse error before `.'
> > make[4]: *** [speakup.o] Error 1
> > 
> > I'm not sure even where to start trying to describe what I've looked
> > at and what I don't understand.  It appears that page_cache_alloc() is
> > now an inline function with an argument passed to it, where it used to
> > be a #define with no arguments.  I see that struct misc_device now has
> > a new member devfs_handle but the other drivers I've looked at rtc.c
> > haven't changed their structure members to take this into account.  It
> > seems nothing new is necessary because misc_register checks if it's
> > been set or not.  The two error lines don't look to me to have anything
> > to do with any of these things either currcons isn't used in any of
> > the misc_device structure or anything I can see which might end up
> > calling page_cache_alloc().  Can anyone give me any ideas what I
> > should check to hunt down exactly what's going on here?  It almost
> > looks like gcc is getting screwed up in it's parsing or something.
> > 
> > Any ideas will greatefully be accepted I'm lost!
> > 
> >   Kirk
> > 
> > -- 
> > 
> > Kirk Reiser				The Computer Braille Facility
> > e-mail: kirk@braille.uwo.ca		University of Western Ontario
> > phone: (519) 661-3061
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 
> 

-- 

Kirk Reiser				The Computer Braille Facility
e-mail: kirk@braille.uwo.ca		University of Western Ontario
phone: (519) 661-3061

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: user-mode port 0.44-2.4.7
       [not found] ` <no.id>
                     ` (10 preceding siblings ...)
  2001-07-25 19:12   ` user-mode port 0.44-2.4.7 Alan Cox
@ 2001-07-25 23:49   ` Alan Cox
  2001-07-26 11:59   ` IGMP join/leave time variability Alan Cox
                     ` (191 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-25 23:49 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Linus Torvalds, linux-kernel

> > This is not a gcc issue. Even if gcc _were_ to generate bad code, the
> > global volatile _still_ wouldn't be the correct answer.
> 
> I think his worry is the pedantic reason that without the volatile gcc is
> allowed to write code that chokes and dies if xtime happens to change in a
> volatile manner.  The example given earlier was:

Make the volatility explicit where it is needed, either to express a barrier
with barrier() or an assignment as in

	foo = (volatile)xtime

This makes it clear where the barriers are and avoids unpleasant
optimisation hits elsewhere.

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* ext3-2.4-0.9.4
@ 2001-07-26  7:34 Andrew Morton
  2001-07-26 11:08 ` ext3-2.4-0.9.4 Matthias Andree
                   ` (2 more replies)
  0 siblings, 3 replies; 662+ messages in thread
From: Andrew Morton @ 2001-07-26  7:34 UTC (permalink / raw)
  To: lkml, ext3-users

An update to the ext3 filesystem for 2.4 kernels is available at

	http://www.uow.edu.au/~andrewm/linux/ext3/

The diffs are against linux-2.4.7 and linux-2.4.6-ac5.

The changelog is there.  One rarely-occurring but oopsable bug
was fixed and several quite significant performance enhancements
have been made.  These are in addition to the performance fixes
which went into 0.9.3.

Ted has put out a prelease of e2fsprogs-1.23 which supports
filesystem type `auto' in /etc/fstab, so it is now possible to
switch between ext3- and non-ext3-kernels without changing
any configuration.

It is recommended that users of earlier ext3 releases upgrade
to 0.9.4.

For people who are undertaking performance testing, it is perhaps
useful to point out that ext3 operates in one of three different
journalling modes, and that these modes have very different
functionality and very different performance characteristics.
Really, you need to test all three and balance the functionality
which each mode offers against the throughput which you obtain
in your application.


The modes are:

data=writeback

  This is classic metadata-only journalling.  File data is written
  back to the main fs lazily.  After a crash+recovery the fs's
  structural integrity is preserved, but the *contents* of files
  can and will contain old, stale data.  Potentially hundreds of
  megabytes of it.

  This is the fastest mode for normal filesystem applications.

data=ordered

  The fs ensures that file data is written into the main fs prior
  to committing its metadata.  Hence after a crash+recovery, your
  files will contain the correct data.

  This is the default operating mode and throughput is good. It
  adds about one second to a four minute kernel compile when
  compared with ext2.   Under heavier loads the difference
  becomes larger.

data=journal

  All data (as well as to metadata) is written to the journal
  before it is released to the main fs for writeback.
  
  This is a specialised mode - for normal fs usage you're better
  off using ordered data, which has the same benefits of not corrupting
  data after crash+recovery.  However for applications which require
  synchronous operation such as mail spools and synchronously exported
  NFS servers, this can be a performance win.  I have seen dbench
  figures in this mode (where the files were opened O_SYNC) running
  at ten times the throughput of ext2.  Not that this is the expected
  benefit for other applications!


Looking at the above issues, one may initially think that the
post-recovery data corruption is a serious issue with writeback mode,
and that there are big advantages to using journalled or ordered data.

However, even in these modes the affected files may be shorter-than-expected
after recovery, because the app hadn't finished writing them yet.  And
usually, a truncated file is just as useless as one which contains
garbage - it needs to be deleted.

It's not really as simple as that - for small (< a few hundred k) files,
it tends to be the case that either the whole file is intact after a crash,
or none of it is.  This is because the journalling mechanism starts a
new transaction every five seconds, and a typical open/write/close operation
usually fits entirely inside this window.

There is also a security issue to be considered: a recovered writeback-mode
filesystem will expose other people's old data to unintended recipients.


Hopefully this description will help people make their deployment choices.
If not, assistance is available on the ext3-users@redhat.com mailing list.

-

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26  7:34 ext3-2.4-0.9.4 Andrew Morton
@ 2001-07-26 11:08 ` Matthias Andree
  2001-07-26 11:42   ` ext3-2.4-0.9.4 Andrew Morton
  2001-07-27  9:32 ` Strange remount behaviour with ext3-2.4-0.9.4 Sean Hunter
  2001-07-30  6:37 ` ext3-2.4-0.9.4 Philipp Matthias Hahn
  2 siblings, 1 reply; 662+ messages in thread
From: Matthias Andree @ 2001-07-26 11:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, ext3-users

On Thu, 26 Jul 2001, Andrew Morton wrote:

> data=journal
> 
>   All data (as well as to metadata) is written to the journal
>   before it is released to the main fs for writeback.
>   
>   This is a specialised mode - for normal fs usage you're better
>   off using ordered data, which has the same benefits of not corrupting
>   data after crash+recovery.  However for applications which require
>   synchronous operation such as mail spools and synchronously exported
>   NFS servers, this can be a performance win.  I have seen dbench

In ordered and journal mode, are meta data operations, namely creating a
file, rename(), link(), unlink() "synchronous" in the sense that after
the call has returned, the effect of this call is never lost, i. e., if
link(2) has returned and the machine crashes immediately, will the next
recovery ALWAYS recover the link?

Or will ext3 still need chattr +S?

Does it still support chattr +S at all?

Synchronous meta data operations are crucial for mail transfer agents
such as Postfix or qmail. Postfix has up until now been setting
chattr +S /var/spool/postfix, making original (esp. soft-updating) BSD
file systems significantly faster for data (payload) writes in this
directory than ext2.

Note: I'm not on the ext3-users list. Please Cc: back replies.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 11:08 ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-26 11:42   ` Andrew Morton
  2001-07-26 12:30     ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 12:32     ` ext3-2.4-0.9.4 Chris Wedgwood
  0 siblings, 2 replies; 662+ messages in thread
From: Andrew Morton @ 2001-07-26 11:42 UTC (permalink / raw)
  To: Matthias Andree; +Cc: lkml, ext3-users

Matthias Andree wrote:
> 
> On Thu, 26 Jul 2001, Andrew Morton wrote:
> 
> > data=journal
> >
> >   All data (as well as to metadata) is written to the journal
> >   before it is released to the main fs for writeback.
> >
> >   This is a specialised mode - for normal fs usage you're better
> >   off using ordered data, which has the same benefits of not corrupting
> >   data after crash+recovery.  However for applications which require
> >   synchronous operation such as mail spools and synchronously exported
> >   NFS servers, this can be a performance win.  I have seen dbench
> 
> In ordered and journal mode, are meta data operations, namely creating a
> file, rename(), link(), unlink() "synchronous" in the sense that after
> the call has returned, the effect of this call is never lost, i. e., if
> link(2) has returned and the machine crashes immediately, will the next
> recovery ALWAYS recover the link?

No, they're not synchronous by default.  After recovery they
will either be wholly intact, or wholly absent.

> Or will ext3 still need chattr +S?

Yes, if the app doesn't support O_SYNC or fsync().  I believe
that MTA's *do* support those things.
 
> Does it still support chattr +S at all?

Yes.

> Synchronous meta data operations are crucial for mail transfer agents
> such as Postfix or qmail. Postfix has up until now been setting
> chattr +S /var/spool/postfix, making original (esp. soft-updating) BSD
> file systems significantly faster for data (payload) writes in this
> directory than ext2.

If postfix is capable of opening the files O_SYNC or of doing
fsync() on them then the `chattr +s' is no longer necessary - unlike
ext2, when the O_SYNC write() or the fsync() return, the directory
contents (as well as the inode, bitmaps, data, etc) will all be tight on
disk and will be restored after a crash.

This should speed things up considerably, especially with journalled-data
mode.  I need to test and characterise this some more to come up with some
quantitative results and configuration recommendations.


BTW, if you have more-than-modest throughput requirements, don't
even *think* of mounting the fs with `mount -o sync'. Our performance
in this mode is terrible :(

I have a hack somewhere which fixes this as much as it can be fixed, but
I didn't even bother committing it.  It's feasible, but tiresome.

A better solution is to fix some lock inversion problems in the core
kernel which prevent optimal implementation of data-journalling
filesystems.  I don't really expect this to occur medium-term or ever.

A middle-ground solution may be to add an fs-private `osync' mount
option, so all files are treated similarly to O_SYNC, which would
work well.

-

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: IGMP join/leave time variability
       [not found] ` <no.id>
                     ` (11 preceding siblings ...)
  2001-07-25 23:49   ` user-mode port 0.44-2.4.7 Alan Cox
@ 2001-07-26 11:59   ` Alan Cox
  2001-07-26 15:52   ` Validating Pointers Alan Cox
                     ` (190 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-26 11:59 UTC (permalink / raw)
  To: Nat Ersoz; +Cc: linux-kernel

> ASAP with respect to the user mode calls.
> 1. What would be the harm if I set IGMP_Initial_Report_Delay to something
> very small like 5 to 10 (jiffies)?  No need for net_random() I'de expect in
> that case?

Read the IGMP RFC documents they discuss in detail the cases where time
delays and randomness are needed and important. 

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 11:42   ` ext3-2.4-0.9.4 Andrew Morton
@ 2001-07-26 12:30     ` Matthias Andree
  2001-07-26 12:58       ` ext3-2.4-0.9.4 Rik van Riel
                         ` (2 more replies)
  2001-07-26 12:32     ` ext3-2.4-0.9.4 Chris Wedgwood
  1 sibling, 3 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-26 12:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthias Andree, lkml, ext3-users

On Thu, 26 Jul 2001, Andrew Morton wrote:

> > In ordered and journal mode, are meta data operations, namely creating a
> > file, rename(), link(), unlink() "synchronous" in the sense that after
> > the call has returned, the effect of this call is never lost, i. e., if
> > link(2) has returned and the machine crashes immediately, will the next
> > recovery ALWAYS recover the link?
> 
> No, they're not synchronous by default.  After recovery they
> will either be wholly intact, or wholly absent.
> 
> > Or will ext3 still need chattr +S?
> 
> Yes, if the app doesn't support O_SYNC or fsync().  I believe
> that MTA's *do* support those things.
>  
> > Does it still support chattr +S at all?
> 
> Yes.
> 
> > Synchronous meta data operations are crucial for mail transfer agents
> > such as Postfix or qmail. Postfix has up until now been setting
...
> A middle-ground solution may be to add an fs-private `osync' mount
> option, so all files are treated similarly to O_SYNC, which would
> work well.

You seem to be missing the point, because I wasn't verbose enough, so I
will try to rephrase this and explain. This may turn out to be a feature
request. :-}

Before going into detail, MTAs do know about fsync(). ext3 synching
relevant directory parts as part of fsync() is a great achievement.
Finally, more than five years after initial complaints, Linux is SLOWLY
getting somewhere for speeding up reliable MTA operation.

But that's the smaller piece. Common MTAs such as Postfix or qmail
rename or link files into place (their queues, the mail spool). With the
advent of journalling came the atomicity of rename operations. That's
also a great achievement.

However, the remaining problem is being synchronous with respect to open
(fixed for ext3 with your fsync() as I understand it), rename, link and
unlink. With ext2, and as you write it, with ext3 as well, there is
currently no way to tell when the link/rename has been committed to
disk, unless you set mount -o sync or chattr +S or call sync() (the
former is not an option because it's far too expensive).


The official statement by Dr. Wietse Venema (who wrote Postfix) is,
Postfix REQUIRES synchronous directory updates (open, rename, link,
unlink, in order of decreasing importance). Wietse refuses to wrap all
these calls for Linux.

Similar assumptions hold for qmail.


So, what would help the common MTA? osync wouldn't, MTAs know how to use
fsync().  dirsync or bsdstyle or however it's called, as chattr and
mount options, would help. This option should make all directory
operations (open/creat/fsync, rename, link, unlink, symlink, possibly
close) synchronous in respect to affected directory and meta data while
leaving application data (payload) operations asynchronous (applications
can then choose when to call fsync() to flush the data to disk).

A much better file system for an MTA might be ext3fs with
data=journalled and dirsync mount/chattr option. Would you deem it
possible to get such an option done before ext3fs 1.0.0?

I hope this makes the requirements of this particular group of
applications clear.

Thanks again to everyone involved with the ext3fs development.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 11:42   ` ext3-2.4-0.9.4 Andrew Morton
  2001-07-26 12:30     ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-26 12:32     ` Chris Wedgwood
  1 sibling, 0 replies; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-26 12:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthias Andree, lkml, ext3-users

On Thu, Jul 26, 2001 at 09:42:37PM +1000, Andrew Morton wrote:

    If postfix is capable of opening the files O_SYNC or of doing
    fsync() on them then the `chattr +s' is no longer necessary -
    unlike ext2, when the O_SYNC write() or the fsync() return, the
    directory contents (as well as the inode, bitmaps, data, etc) will
    all be tight on disk and will be restored after a crash.

    This should speed things up considerably, especially with
    journalled-data mode.  I need to test and characterise this some
    more to come up with some quantitative results and configuration
    recommendations.

Postfix does an fsync on file before closing them, it then does a
rename and expects once rename as returned, the renamed actually
occured --- even if the fs crashes.  It also expects if you fsync a
file, then it will appear in the parent directory with certainty and
not say /lost+found after fsck on reboot.

Without +s under ext2, you can loose file(s) in /lost+found because
open+write+fsync+close works and ensures the data is on disk, but the
parent directory doesn't get synced to disk, so it might get lost.




  --cw

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 12:30     ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-26 12:58       ` Rik van Riel
  2001-07-26 13:17         ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 14:09       ` ext3-2.4-0.9.4 Andrew Morton
  2001-07-26 15:51       ` ext3-2.4-0.9.4 Linus Torvalds
  2 siblings, 1 reply; 662+ messages in thread
From: Rik van Riel @ 2001-07-26 12:58 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Andrew Morton, lkml, ext3-users

On Thu, 26 Jul 2001, Matthias Andree wrote:

> So, what would help the common MTA?

Not relying on non-supported semantics to save your ass.

Rename() is atomic in the sense that you either see the
old name or the new name, but I don't know of systems
which guarantee atomicity across a system crash.

In fact, knowing how hard disks work mechanically, only
journaling filesystems could have an extention to make
this work.  Ie. this is NOT something you can rely on ;)

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 12:58       ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-26 13:17         ` Matthias Andree
  2001-07-26 13:23           ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-26 13:52           ` ext3-2.4-0.9.4 Alan Cox
  0 siblings, 2 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-26 13:17 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Matthias Andree, Andrew Morton, lkml, ext3-users

On Thu, 26 Jul 2001, Rik van Riel wrote:

> In fact, knowing how hard disks work mechanically, only
> journaling filesystems could have an extention to make
> this work.  Ie. this is NOT something you can rely on ;)

This is not about failing hard disks. It is about premature
acknowledgment of something which has not happened at that time.

Linux cannot possibly fix all incomplete protocols, specifications and
implementation, but it can fix its own behaviour.

Everything is about speed, and allowing the MTA to use a (weaker)
dirsync rather than allsync option would speed things up without
sacrificing reliability.

MTA reliability is NOT about failing disk drives. If it falls over, you
notice that. If files are in the wrong directory or not there at all,
you don't necessarily notice until someone complains.

Please don't get in the way of finally fixing things just because
someone might have a broken item that could endanger your data. I have a
huge magnet here...

The competition is there and it has names: BSD + ufs + softupdates,
Solaris + logging ufs. Read MTA mailing lists before obstructing.

Thanks.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 13:17         ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-26 13:23           ` Rik van Riel
  2001-07-26 13:58             ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 13:52           ` ext3-2.4-0.9.4 Alan Cox
  1 sibling, 1 reply; 662+ messages in thread
From: Rik van Riel @ 2001-07-26 13:23 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Andrew Morton, lkml, ext3-users

On Thu, 26 Jul 2001, Matthias Andree wrote:
> On Thu, 26 Jul 2001, Rik van Riel wrote:
>
> > In fact, knowing how hard disks work mechanically, only
> > journaling filesystems could have an extention to make
> > this work.  Ie. this is NOT something you can rely on ;)
>
> This is not about failing hard disks. It is about premature
> acknowledgment of something which has not happened at that time.

So you didn't read what I was writing.

Let me explain it to you slowly:

Disks.  Write.  One.  Write.  At.  A.  Time.

A rename often needs as many as 4 or 5 writes,
ergo, you CANNOT do a rename atomically without
journaling and transactions.

> The competition is there and it has names: BSD + ufs + softupdates,
> Solaris + logging ufs. Read MTA mailing lists before obstructing.

BSD + softupdates is physically incapable of doing what
you suggest it does.  This can be proven from the lack
of transactions and the way hard disks work physically.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 13:17         ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 13:23           ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-26 13:52           ` Alan Cox
  2001-07-26 13:55             ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-26 14:32             ` ext3-2.4-0.9.4 Matthias Andree
  1 sibling, 2 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-26 13:52 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Rik van Riel, Matthias Andree, Andrew Morton, lkml, ext3-users

> On Thu, 26 Jul 2001, Rik van Riel wrote:
> > In fact, knowing how hard disks work mechanically, only
> > journaling filesystems could have an extention to make
> > this work.  Ie. this is NOT something you can rely on ;)
> 
> This is not about failing hard disks. It is about premature
> acknowledgment of something which has not happened at that time.

Rik is right. It isnt just about premature notification - its about 
atomicity. At the point you are notified the data has been queued for disk
I/O. Even on traditional BSD ufs with synchronous metadata you still had
points where a crash left the rename partially complete and nothing but a
log or an atomic update system is going to fix that.

> The competition is there and it has names: BSD + ufs + softupdates,
> Solaris + logging ufs. Read MTA mailing lists before obstructing.

All of which are - not unsuprisingly - using a log. In fact Solaris logging
ufs and ext3 are very similar ideas - adding a log to an existing fs.

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 13:52           ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-26 13:55             ` Rik van Riel
  2001-07-26 14:12               ` ext3-2.4-0.9.4 Andrew Morton
  2001-07-26 14:45               ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 14:32             ` ext3-2.4-0.9.4 Matthias Andree
  1 sibling, 2 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-26 13:55 UTC (permalink / raw)
  To: Alan Cox; +Cc: Matthias Andree, Andrew Morton, lkml, ext3-users

On Thu, 26 Jul 2001, Alan Cox wrote:

> > The competition is there and it has names: BSD + ufs + softupdates,
> > Solaris + logging ufs. Read MTA mailing lists before obstructing.
>
> All of which are - not unsuprisingly - using a log. In fact
> Solaris logging ufs and ext3 are very similar ideas - adding a
> log to an existing fs.

Softupdates isn't using logging.  Furthermore, even
the journaling filesystems won't all guarantee that
the various parts of a rename() operation will all
be in the same transaction.

An MTA which relies on this is therefore Broken(tm).

cheers,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 13:23           ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-26 13:58             ` Matthias Andree
  0 siblings, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-26 13:58 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Matthias Andree, Andrew Morton, lkml, ext3-users

On Thu, 26 Jul 2001, Rik van Riel wrote:

> On Thu, 26 Jul 2001, Matthias Andree wrote:
> > On Thu, 26 Jul 2001, Rik van Riel wrote:
> >
> > > In fact, knowing how hard disks work mechanically, only
> > > journaling filesystems could have an extention to make
> > > this work.  Ie. this is NOT something you can rely on ;)
> >
> > This is not about failing hard disks. It is about premature
> > acknowledgment of something which has not happened at that time.
> 
> So you didn't read what I was writing.

Sorry.

> Let me explain it to you slowly:
> 
> Disks.  Write.  One.  Write.  At.  A.  Time.
> 
> A rename often needs as many as 4 or 5 writes,
> ergo, you CANNOT do a rename atomically without
> journaling and transactions.

You're missing the point, with this as the previous mail. The MTA is not
going to change from one unsupported/incompatible interface (that only
Linux suffers from) and if it did, it would still do the wrong thing.

MTAs often run multiple processes, and if these all open the same
directory and sync it while others have changes open that don't need a
sync at that time and will sync later, you're getting no further than
with chattr +S or mount -o sync.

It's not about atomicity itself, but about
first. write. all. required. blocks. for. a. certain. change.
physically. to. disc.   and. only. after. this. do. return. from.
rename, link, unlink. function. calls.

I'm aware of phase-tree concepts ("single block write switches from one
consistent state to another") and I'm aware that ext3fs and reiserfs do
feature atomic renames (after crash recovery).

That a drive might fall over or the power might fail before all writes
of a certain rename operation have completed is harmless UNLESS you lied
to someone that the operation was already complete (when it wasn't).

> > The competition is there and it has names: BSD + ufs + softupdates,
> > Solaris + logging ufs. Read MTA mailing lists before obstructing.
> 
> BSD + softupdates is physically incapable of doing what
> you suggest it does.  This can be proven from the lack
> of transactions and the way hard disks work physically.

You misunderstood me. I'm not talking about atomicity.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 12:30     ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 12:58       ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-26 14:09       ` Andrew Morton
  2001-07-26 15:07         ` RFC: chattr/lsattr +DS? was: ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 15:51       ` ext3-2.4-0.9.4 Linus Torvalds
  2 siblings, 1 reply; 662+ messages in thread
From: Andrew Morton @ 2001-07-26 14:09 UTC (permalink / raw)
  To: Matthias Andree; +Cc: lkml, ext3-users

Matthias Andree wrote:
>  
> A much better file system for an MTA might be ext3fs with
> data=journalled and dirsync mount/chattr option.

OK, I've taken a closer look at this.  ext3 has picked up some
cruft from ext2's sync handling which it does not need in the
least.

It will be fairly straightforward and a useful cleanup to
provide the following semantics for either synchronous
mounts or `chattr +S' directories:

* All metadata operations (rename, unlink, link, symlink, etc)
  will be synchronous.  So when the system call returns, the data
  is crash-proofed.

* All write()s will be synchronous.  So when the write() system
  call returns, the data written and all associated metadata
  will be crash-proofed.

  O_SYNC and fsync() will not be necessary - in fact they'll
  slow things down slightly by forcing an unnecessary and
  probably empty commit.

If you crash in the middle of a write, you may end up with a truncated
file on recovery.

This is in fact the behaviour right now, but the performance is
not good.

The performance problem at present is that large write()s have unnecessary
commits in the middle of them.  This is due to the abovementioned
cruft in ext3_get_block() and the things it calls.

> Would you deem it
> possible to get such an option done before ext3fs 1.0.0?

We'd prefer not - we're trying to stabilise things quite
sternly at present. However that doesn't prevent work
on 1.1.0 :)

Seems like a worthwhile thing to do - I'll cut a branch
and do this.  It'll take a couple of weeks - as usual, most
of the work is in development and use of test tools...
But I can't predict at this time when we'll merge it into
the mainline fs.

> I hope this makes the requirements of this particular group of
> applications clear.

Yes, it was useful - thanks.

-

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 13:55             ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-26 14:12               ` Andrew Morton
  2001-07-26 14:45               ` ext3-2.4-0.9.4 Matthias Andree
  1 sibling, 0 replies; 662+ messages in thread
From: Andrew Morton @ 2001-07-26 14:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alan Cox, Matthias Andree, lkml, ext3-users

Rik van Riel wrote:
> 
> 
> Furthermore, even the journaling filesystems won't all guarantee that
> the various parts of a rename() operation will all be in the same
> transaction.

umm..  I'd certainly hope that they do guarantee this.

The only operations which can't trivially fit into a single
transaction are write() and truncate().
 
-

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 13:52           ` ext3-2.4-0.9.4 Alan Cox
  2001-07-26 13:55             ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-26 14:32             ` Matthias Andree
  2001-07-26 15:31               ` ext3-2.4-0.9.4 Daniel Phillips
  1 sibling, 1 reply; 662+ messages in thread
From: Matthias Andree @ 2001-07-26 14:32 UTC (permalink / raw)
  To: Alan Cox; +Cc: Matthias Andree, Rik van Riel, Andrew Morton, lkml, ext3-users

On Thu, 26 Jul 2001, Alan Cox wrote:

> Rik is right. It isnt just about premature notification - its about 
> atomicity. At the point you are notified the data has been queued for disk
> I/O. Even on traditional BSD ufs with synchronous metadata you still had
> points where a crash left the rename partially complete and nothing but a
> log or an atomic update system is going to fix that.

No. Atomic update systems and logs can by no means fix premature
acknowledgements:

Proof:

Assume the OS has a phase tree kind of thing or log that requires
just a single-block write for an atomic rename.

Assume an MTA calls rename(), and the OS by whatever means notifies it of
completion, but actually, the data is only queued, not written.

Assume The MTA receives the acknowledgement (e. g. rename call
returned), sends a "250 mail action complete" packet across the network.

Assume the machine sends the network packed, but not the queued disk
block and then crashes.

--> The single block is lost, the rename operation is lost, but the
operation had been acknowledged. Consequence: the mail is lost. q. e. d.

All this boils down to: 

1. The OS _MUST_ know when a write operation has been physically
committed to non-volatile storage.

2. The OS _MUST_ _NOT_ acknowledge the (assumedly synchronous operation)
any earlier. (This may well include switching off drive write
buffering.)

If the OS cannot fulfill these two basic requirements, I can save all
the log or FS atomicity efforts because they don't get me anywhere.

The problem is not that the operation can fail, the problem IS premature
acknowledgement. Even with atomic updates, as shown above.

Note, of course there is no premature acknowledgement for the
Linux-default asynchronous directory update. There IS for -o sync or
chattr +S -- and that's what MTAs to to guarantee data integrity, and
that's why I'm still suggesting dirsync or something to remedy the
negative data write performance of full-sync.

If the OS tell me "write completed" when it means "I queued your data
for writing", it is BROKEN.

That's my point.

And since the common POSIX OS lacks a dedicated notification feature for
e. g. rename, MTAs have no other choice than to rely on "has completed
when the syscall returns".

BTW, my Linux rename(2) man page doesn't document EIO condition, FreeBSD
4.3-STABLE and SUS v2 do.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 13:55             ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-26 14:12               ` ext3-2.4-0.9.4 Andrew Morton
@ 2001-07-26 14:45               ` Matthias Andree
  2001-07-26 15:02                 ` ext3-2.4-0.9.4 Christoph Hellwig
  2001-07-26 15:28                 ` ext3-2.4-0.9.4 Alan Cox
  1 sibling, 2 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-26 14:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alan Cox, Matthias Andree, Andrew Morton, lkml, ext3-users

On Thu, 26 Jul 2001, Rik van Riel wrote:

> An MTA which relies on this is therefore Broken(tm).

MTAs rely on TRULY, ULTIMATELY AND DEFINITELY SYNCHRONOUS directory
updates, nothing else. And because they do so, and most systems have
them, and MTAs are portable, they choose chattr +S on Linux. And that's
a performance problem because it doesn't come for free, but also with
synchronous data updates, which are unnecessary because there is
fsync().

That's already the complete story about MTAs on Linux.

If Linux HAD a mode (it doesn't) to have just synchronous directory
updates, MTAs could stop using chattr +S and be faster.


MTAs do NOT care how the file system is internally managed, they only
rely on the rename operation having completed physically on disk before
the "my rename call has returned 0" event. They expect that with the
call returning the rename operation has completed ultimately, finally,
for good, definitely and the old file will not reappear after a crash.

(Note that the atomicity addressed in the man pages and Unix
specifications is a different one: it deals with the visibility of the
changes in the system, not with the functioning of the file system.)

That's why *BSD + softupdates is still recommended over Linux for pure
mail transfer agents by people.

This still implies the drive doesn't lie to the OS about the completion
of write requests: write cache == off.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 14:45               ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-26 15:02                 ` Christoph Hellwig
  2001-07-26 15:48                   ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 15:28                 ` ext3-2.4-0.9.4 Alan Cox
  1 sibling, 1 reply; 662+ messages in thread
From: Christoph Hellwig @ 2001-07-26 15:02 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Alan Cox, Andrew Morton, lkml, ext3-users, Rik van Riel

In article <20010726164516.R17244@emma1.emma.line.org> you wrote:
> On Thu, 26 Jul 2001, Rik van Riel wrote:
>
>> An MTA which relies on this is therefore Broken(tm).

> MTAs rely on TRULY, ULTIMATELY AND DEFINITELY SYNCHRONOUS directory
> updates, nothing else.

And thus they are broken, all caps don't make that less true.

> And because they do so, and most systems have them,

"and most systems have them"...

> MTAs do NOT care how the file system is internally managed, they only
> rely on the rename operation having completed physically on disk before
> the "my rename call has returned 0" event. They expect that with the
> call returning the rename operation has completed ultimately, finally,
> for good, definitely and the old file will not reappear after a crash.

So they rely on undocumented and non standadisized semantics of some
implementations.  I'd call this buggy.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* RFC: chattr/lsattr +DS? was: ext3-2.4-0.9.4
  2001-07-26 14:09       ` ext3-2.4-0.9.4 Andrew Morton
@ 2001-07-26 15:07         ` Matthias Andree
  2001-07-26 15:40           ` Andrew Morton
  0 siblings, 1 reply; 662+ messages in thread
From: Matthias Andree @ 2001-07-26 15:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthias Andree, lkml

On Fri, 27 Jul 2001, Andrew Morton wrote:

> > Would you deem it
> > possible to get such an option done before ext3fs 1.0.0?
> 
> We'd prefer not - we're trying to stabilise things quite
> sternly at present. However that doesn't prevent work
> on 1.1.0 :)
> 
> Seems like a worthwhile thing to do - I'll cut a branch
> and do this.  It'll take a couple of weeks - as usual, most
> of the work is in development and use of test tools...
> But I can't predict at this time when we'll merge it into
> the mainline fs.

So the summary of all this is, as I understand it: for ext3fs 1.0, treat
it with chattr +S and the like as if it was ext2fs, it may or may not be
faster with "mount -o data=journalled" and is well worthwhile for an MTA
to try, a weaker sync option may be introduced after ext3fs 1.0.

Sounds good.

I'm dropping the ext3-users mailing list for now since this is getting
more general.


However, since the ReiserFS team also showed interest in a similar
functionality, and they don't yet support chattr, would it be useful to
specify a "D" option for chattr already?

I have a suggestion: if D is set, but S isn't, no effect. If S is set,
but D is unset, treat S as in the past. If S is set, and D is set,
directory updates are synchronous like with S, but data updates are
asynchronous in spite of S.

This way, booting a kernel without chattr "D" flag support or mounting
the file system as ext2 would have it default to the safer
everything-synchronously mode.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 14:45               ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 15:02                 ` ext3-2.4-0.9.4 Christoph Hellwig
@ 2001-07-26 15:28                 ` Alan Cox
  2001-07-26 20:23                   ` ext3-2.4-0.9.4 Gérard Roudier
  1 sibling, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-26 15:28 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Rik van Riel, Alan Cox, Matthias Andree, Andrew Morton, lkml, ext3-users

> them, and MTAs are portable, they choose chattr +S on Linux. And that's
> a performance problem because it doesn't come for free, but also with
> synchronous data updates, which are unnecessary because there is
> fsync().

chattr +S and atomic updates hitting disk then returning to the app will
give the same performance. You can also fsync() the directory. 

> the "my rename call has returned 0" event. They expect that with the
> call returning the rename operation has completed ultimately, finally,
> for good, definitely and the old file will not reappear after a crash.

Actually the old file re-appearing after the crash is irrelevant. It will
have a previously logged message id. And if you are not doing message id
histories then you have replay races at the SMTP level anyway

> This still implies the drive doesn't lie to the OS about the completion
> of write requests: write cache == off.

Write cache off is not a feature available on many modern disks. You
already lost the battle before you started.

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 14:32             ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-26 15:31               ` Daniel Phillips
  2001-07-26 15:49                 ` ext3-2.4-0.9.4 Andrew Morton
  2001-07-26 15:58                 ` ext3-2.4-0.9.4 Matthias Andree
  0 siblings, 2 replies; 662+ messages in thread
From: Daniel Phillips @ 2001-07-26 15:31 UTC (permalink / raw)
  To: Matthias Andree, Alan Cox
  Cc: Matthias Andree, Rik van Riel, Andrew Morton, lkml, ext3-users

On Thursday 26 July 2001 16:32, Matthias Andree wrote:
> On Thu, 26 Jul 2001, Alan Cox wrote:
> > Rik is right. It isnt just about premature notification - its about
> > atomicity. At the point you are notified the data has been queued
> > for disk I/O. Even on traditional BSD ufs with synchronous metadata
> > you still had points where a crash left the rename partially
> > complete and nothing but a log or an atomic update system is going
> > to fix that.
>
> No. Atomic update systems and logs can by no means fix premature
> acknowledgements:
>
> Proof:
>
> Assume the OS has a phase tree kind of thing or log that requires
> just a single-block write for an atomic rename.
>
> Assume an MTA calls rename(), and the OS by whatever means notifies
> it of completion, but actually, the data is only queued, not written.
>
> Assume The MTA receives the acknowledgement (e. g. rename call
> returned), sends a "250 mail action complete" packet across the
> network.
>
> Assume the machine sends the network packed, but not the queued disk
> block and then crashes.
>
> --> The single block is lost, the rename operation is lost, but the
> operation had been acknowledged. Consequence: the mail is lost. q. e.
> d.
>
> All this boils down to:
>
> 1. The OS _MUST_ know when a write operation has been physically
> committed to non-volatile storage.

We're working on that, see the "[PATCH] 64 bit scsi read/write" thread 
on linux-fsdevel.  About half of it is devoted to investigating the 
detailed semantics of physical write completion.

> 2. The OS _MUST_ _NOT_ acknowledge the (assumedly synchronous
> operation) any earlier. (This may well include switching off drive
> write buffering.)

Yes, for now that's how you have to do it.

> If the OS cannot fulfill these two basic requirements, I can save all
> the log or FS atomicity efforts because they don't get me anywhere.
>
> The problem is not that the operation can fail, the problem IS
> premature acknowledgement. Even with atomic updates, as shown above.

Right now the interface for determining that the operation has actually 
completed is "sync".  Yes, that sucks but with journalling or atomic 
commit it's not nearly as expensive as you might think.  My early flush 
patch does nearly the equivalent of sync, 10 times a second and it 
actually improves performance (it does not attempt to do this under 
high load of course).

We *should* have something like sys_sync_dev(majorminor) or 
sys_sync_fs(mountpoint) (whatever that would look like).  For 
phase-tree the semantics are that the call doesn't return until the 
metaroot of the then-current "branching" tree is known to be safely on 
disk.  (Side note: it's ok to allow subsequent updates on the same 
filesystem to procede while an outstanding sync_dev is waiting for 
confirmation from the block layer, because these don't affect the 
filesystem state the sync_fs is waiting on.)

As I understand it, Ext2 allows much the same semantics.  While we do 
need to do something about exposing a more elegant interface, with Ext3 
you should be ok with +S and a "sync" just before you report to the 
world that the mail transaction is complete.  Ext3 does *not* leave a 
lot of dirty blocks hanging around in normal operation, so sync is not 
nearly as slow as it is with good old Ext2.

> Note, of course there is no premature acknowledgement for the
> Linux-default asynchronous directory update. There IS for -o sync or
> chattr +S -- and that's what MTAs to to guarantee data integrity, and
> that's why I'm still suggesting dirsync or something to remedy the
> negative data write performance of full-sync.
>
> If the OS tell me "write completed" when it means "I queued your data
> for writing", it is BROKEN.
>
> That's my point.
>
> And since the common POSIX OS lacks a dedicated notification feature
> for e. g. rename, MTAs have no other choice than to rely on "has
> completed when the syscall returns".
>
> BTW, my Linux rename(2) man page doesn't document EIO condition,
> FreeBSD 4.3-STABLE and SUS v2 do.

Sounds like a man page bug.

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: RFC: chattr/lsattr +DS? was: ext3-2.4-0.9.4
  2001-07-26 15:07         ` RFC: chattr/lsattr +DS? was: ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-26 15:40           ` Andrew Morton
  0 siblings, 0 replies; 662+ messages in thread
From: Andrew Morton @ 2001-07-26 15:40 UTC (permalink / raw)
  To: Matthias Andree; +Cc: lkml

Matthias Andree wrote:
> 
> On Fri, 27 Jul 2001, Andrew Morton wrote:
> 
> > > Would you deem it
> > > possible to get such an option done before ext3fs 1.0.0?
> >
> > We'd prefer not - we're trying to stabilise things quite
> > sternly at present. However that doesn't prevent work
> > on 1.1.0 :)
> >
> > Seems like a worthwhile thing to do - I'll cut a branch
> > and do this.  It'll take a couple of weeks - as usual, most
> > of the work is in development and use of test tools...
> > But I can't predict at this time when we'll merge it into
> > the mainline fs.
> 
> So the summary of all this is, as I understand it: for ext3fs 1.0, treat
> it with chattr +S and the like as if it was ext2fs, it may or may not be
> faster with "mount -o data=journalled" and is well worthwhile for an MTA
> to try, a weaker sync option may be introduced after ext3fs 1.0.
> 
> Sounds good.
> 
> I'm dropping the ext3-users mailing list for now since this is getting
> more general.
> 
> However, since the ReiserFS team also showed interest in a similar
> functionality, and they don't yet support chattr, would it be useful to
> specify a "D" option for chattr already?

chattr is an ext[23]-specific thing.  reiserfs could certainly
support a similar thing if they have a few bits spare in the
inode.

> I have a suggestion: if D is set, but S isn't, no effect. If S is set,
> but D is unset, treat S as in the past. If S is set, and D is set,
> directory updates are synchronous like with S, but data updates are
> asynchronous in spite of S.

I don't think this would be needed until really proven necessary - for
data, fsync() should work for all filesystems.

There would be one benefit in splitting sync from datasync,
and that is for applications which do not write() their
data in large enough chunks.

When I fix the get_block thing, O_SYNC, `chattr +S' and `mount
-o sync' will provide good, fast synchronous write()s - the
fs will run a commit at the end of the write().  That's just fine as long
as the app is writing its data in goodly chunks.  If it is is using 4k
or 8k chunks (eg: default stdio) then throughput will suffer.  That
would be rather silly of it though.

-

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 15:02                 ` ext3-2.4-0.9.4 Christoph Hellwig
@ 2001-07-26 15:48                   ` Matthias Andree
  2001-07-26 15:54                     ` ext3-2.4-0.9.4 Alan Cox
  2001-07-26 16:13                     ` ext3-2.4-0.9.4 Rik van Riel
  0 siblings, 2 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-26 15:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: lkml

Christoph Hellwig schrieb am Donnerstag, den 26. Juli 2001:

> > MTAs do NOT care how the file system is internally managed, they only
> > rely on the rename operation having completed physically on disk before
> > the "my rename call has returned 0" event. They expect that with the
> > call returning the rename operation has completed ultimately, finally,
> > for good, definitely and the old file will not reappear after a crash.
> 
> So they rely on undocumented and non standadisized semantics of some
> implementations.  I'd call this buggy.

If each in the set of "supported systems" document this behaviour for
themselves, there is no bug. I didn't check however for systems other
than FreeBSD 4.x and Linux. And "Linux support" forces these semantics
with chattr +S, at a high price.

Go tell your opinion to those people that refuse to wrap their
rename/link calls with open()/fsync() calls to the respective parents,
particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
don't possibly know all MTAs.

You will encounter these or similar questions/objections:

1. what systems apart from Linux need this kind of Pampers?

2. manual lookups of parent directories cause additional overhead better
avoided in performance critical systems.

You would not be the first one to tell them...

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 15:31               ` ext3-2.4-0.9.4 Daniel Phillips
@ 2001-07-26 15:49                 ` Andrew Morton
  2001-07-26 20:45                   ` ext3-2.4-0.9.4 Daniel Phillips
  2001-07-26 15:58                 ` ext3-2.4-0.9.4 Matthias Andree
  1 sibling, 1 reply; 662+ messages in thread
From: Andrew Morton @ 2001-07-26 15:49 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Matthias Andree, Alan Cox, Rik van Riel, lkml, ext3-users

Daniel Phillips wrote:
> 
> Ext3 does *not* leave a
> lot of dirty blocks hanging around in normal operation, so sync is not
> nearly as slow as it is with good old Ext2.

eek.

In fully-journalled data mode, we write everything to the journal
in a linear chunk, wait on it, write a commit block, wait on that
and then release all the just-journalled data into the main
filesystem for conventional bdflush/kupdate writeback in twenty
seconds time.

Calling anything which uses fsync_dev() would cause all that writeback
data to be written out and waited on, with the consequential seeking
storm.  Disastrous.

Note that fsync() is OK - in full data journalling mode nothing
is ever attached to i_dirty_buffers.

-

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 12:30     ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 12:58       ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-26 14:09       ` ext3-2.4-0.9.4 Andrew Morton
@ 2001-07-26 15:51       ` Linus Torvalds
  2001-07-31  0:21         ` ext3-2.4-0.9.4 Matti Aarnio
  2001-07-31  0:57         ` ext3-2.4-0.9.4 Matthias Andree
  2 siblings, 2 replies; 662+ messages in thread
From: Linus Torvalds @ 2001-07-26 15:51 UTC (permalink / raw)
  To: linux-kernel

In article <20010726143002.E17244@emma1.emma.line.org>,
Matthias Andree  <matthias.andree@stud.uni-dortmund.de> wrote:
>
>However, the remaining problem is being synchronous with respect to open
>(fixed for ext3 with your fsync() as I understand it), rename, link and
>unlink. With ext2, and as you write it, with ext3 as well, there is
>currently no way to tell when the link/rename has been committed to
>disk, unless you set mount -o sync or chattr +S or call sync() (the
>former is not an option because it's far too expensive).

Congratulations. You have been brainwashed by Dan Bernstein.

Use fsync() on the directory. 

Logical, isn't it?

		Linus

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Validating Pointers
       [not found] ` <no.id>
                     ` (12 preceding siblings ...)
  2001-07-26 11:59   ` IGMP join/leave time variability Alan Cox
@ 2001-07-26 15:52   ` Alan Cox
  2001-07-26 17:09     ` tpepper
  2001-07-26 17:51   ` IGMP join/leave time variability Alan Cox
                     ` (189 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-26 15:52 UTC (permalink / raw)
  To: Cress, Andrew R; +Cc: linux-kernel

> Is there a general (correct) kernel subroutine to validate a pointer
> received in a routine as input from the outside world?  Is access_ok() a
> good one to use?

access_ok may do minimal checks, or no checking at all. The only point at
which you can validate a user point is when you use copy*user and
get/put_user to access the data.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 15:48                   ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-26 15:54                     ` Alan Cox
  2001-07-26 16:18                       ` ext3-2.4-0.9.4 Linus Torvalds
  2001-07-26 16:13                     ` ext3-2.4-0.9.4 Rik van Riel
  1 sibling, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-26 15:54 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Christoph Hellwig, lkml

> Go tell your opinion to those people that refuse to wrap their
> rename/link calls with open()/fsync() calls to the respective parents,
> particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
> don't possibly know all MTAs.

I've pointed things out to Mr Bernstein before. His normal replies are not
helpful and generally vary between random ravings and threatening to sue
people who publish things on web pages he disagrees with.

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 15:31               ` ext3-2.4-0.9.4 Daniel Phillips
  2001-07-26 15:49                 ` ext3-2.4-0.9.4 Andrew Morton
@ 2001-07-26 15:58                 ` Matthias Andree
  1 sibling, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-26 15:58 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: lkml, ext3-users

On Thu, 26 Jul 2001, Daniel Phillips wrote:

> As I understand it, Ext2 allows much the same semantics.  While we do 
> need to do something about exposing a more elegant interface, with Ext3 
> you should be ok with +S and a "sync" just before you report to the 
> world that the mail transaction is complete.  Ext3 does *not* leave a 
> lot of dirty blocks hanging around in normal operation, so sync is not 
> nearly as slow as it is with good old Ext2.

That wasn't my impression, particularly not with data=journalling which
can drop data into the log. It's just: why sync the world if synching
directories does the job and relevant data is synched manually with
fsync()?

However, how big are chances that these interfaces will spread outside
of Linux? That's the crucial point for portable applications. If it's a
kernel <-> libc interface, OK, no problem, but if it's a user-space
interface, it might easily become a useless invention because no-one
uses it in real life. You don't support multiple interfaces in a
portable application because that's a maintenance disaster and often
causes reliability problems because on different platforms, code takes
different paths, so applications won't usually choose limited-use
interfaces (such as sendfile).

BTW, your Message-ID is unqualified == on a collision course in mail
duplicate killers.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 15:48                   ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-26 15:54                     ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-26 16:13                     ` Rik van Riel
  2001-07-26 16:46                       ` ext3-2.4-0.9.4 Alan Cox
  2001-07-26 17:26                       ` ext3-2.4-0.9.4 Matthias Andree
  1 sibling, 2 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-26 16:13 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Christoph Hellwig, lkml

On Thu, 26 Jul 2001, Matthias Andree wrote:
> Christoph Hellwig schrieb am Donnerstag, den 26. Juli 2001:
>
> > So they rely on undocumented and non standadisized semantics of some
> > implementations.  I'd call this buggy.
>
> If each in the set of "supported systems" document this
> behaviour for themselves, there is no bug.

The MTA depends on behaviour which is undefined. Now you
want to go blame the OS ?

> Go tell your opinion to those people that refuse to wrap their
> rename/link calls with open()/fsync() calls to the respective parents,
> particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
> don't possibly know all MTAs.

If you care about your email, probably you should either
teach these people about standards like POSIX or SuS
(and tell them to not rely on undefined behaviour) or
switch to an MTA which isn't broken in various ways ;)

cheers,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 15:54                     ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-26 16:18                       ` Linus Torvalds
  2001-07-26 16:44                         ` ext3-2.4-0.9.4 Alan Cox
                                           ` (2 more replies)
  0 siblings, 3 replies; 662+ messages in thread
From: Linus Torvalds @ 2001-07-26 16:18 UTC (permalink / raw)
  To: linux-kernel

In article <E15PnTJ-0003z0-00@the-village.bc.nu>,
Alan Cox  <alan@lxorguk.ukuu.org.uk> wrote:
>> Go tell your opinion to those people that refuse to wrap their
>> rename/link calls with open()/fsync() calls to the respective parents,
>> particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
>> don't possibly know all MTAs.
>
>I've pointed things out to Mr Bernstein before. His normal replies are not
>helpful and generally vary between random ravings and threatening to sue
>people who publish things on web pages he disagrees with.

Now, now, Alan. He has strong opinions, I'll agree, but I've never see
him threaten to _sue_.

Also, I think he eventually agreed on the logic of fsync() on the
directory, and we even had a bug report (quickly fixed) for reiserfs
because it got confused by it.

Of course, knowing Dan, I suspect the fsync() is accompanied by several
lines of derogatory comments about the need for it (not that I've
checked). 

Everybody tends to agree that synchronous IO is stupid and slow, but
some people are just so fixated with "That is how it has been done for
20 years..".

Logging filesystems together with explicit logging points (namely,
"fsync()") are very obviously a superior answer from a technical
standpoint, but that doesn't impact the emotional arguments ("but I want
things to stay the same!"). 

		Linus

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 16:18                       ` ext3-2.4-0.9.4 Linus Torvalds
@ 2001-07-26 16:44                         ` Alan Cox
  2001-07-26 16:54                         ` ext3-2.4-0.9.4 Larry McVoy
  2001-07-26 18:32                         ` ext3-2.4-0.9.4 Richard A Nelson
  2 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-26 16:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

> >I've pointed things out to Mr Bernstein before. His normal replies are not
> >helpful and generally vary between random ravings and threatening to sue
> >people who publish things on web pages he disagrees with.
> 
> Now, now, Alan. He has strong opinions, I'll agree, but I've never see
> him threaten to _sue_.

Ask Alexey about the end of the syncookie "debate"

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 16:13                     ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-26 16:46                       ` Alan Cox
  2001-07-26 17:26                       ` ext3-2.4-0.9.4 Matthias Andree
  1 sibling, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-26 16:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Matthias Andree, Christoph Hellwig, lkml

> If you care about your email, probably you should either
> teach these people about standards like POSIX or SuS
> (and tell them to not rely on undefined behaviour) or
> switch to an MTA which isn't broken in various ways ;)

POSIX and SuS are actually not helpful here. They don't cover how to force
namespace to disk, only data and metadata for the file. So you can portably
stick your data onto disk, portably be sure its on disk, but not portably be
sure the directory entries are on disk.

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 16:18                       ` ext3-2.4-0.9.4 Linus Torvalds
  2001-07-26 16:44                         ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-26 16:54                         ` Larry McVoy
  2001-07-26 17:15                           ` ext3-2.4-0.9.4 Andre Pang
                                             ` (2 more replies)
  2001-07-26 18:32                         ` ext3-2.4-0.9.4 Richard A Nelson
  2 siblings, 3 replies; 662+ messages in thread
From: Larry McVoy @ 2001-07-26 16:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, Jul 26, 2001 at 04:18:59PM +0000, Linus Torvalds wrote:
> In article <E15PnTJ-0003z0-00@the-village.bc.nu>,
> Alan Cox  <alan@lxorguk.ukuu.org.uk> wrote:
> >> Go tell your opinion to those people that refuse to wrap their
> >> rename/link calls with open()/fsync() calls to the respective parents,
> >> particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
> >> don't possibly know all MTAs.
> >
> >I've pointed things out to Mr Bernstein before. His normal replies are not
> >helpful and generally vary between random ravings and threatening to sue
> >people who publish things on web pages he disagrees with.
> 
> Now, now, Alan. He has strong opinions, I'll agree, but I've never see
> him threaten to _sue_.

In the for what it is worth department, I spent the day with Daniel after
the kernel summit meeting a while back, we talked file systems for about
6 or 7 hours.  While I'll plead guilty to getting mad at him (his ego
is up there with mine :-), I came away impressed with his knowledge.
I get the feeling that he thinks deeply about the problems he works on,
he's probably right a lot of the time, *and* as with many deep thinkers,
he has a problem communicating his ideas.

This is a common problem, and I'm not sure Daniel is fully aware of it.
One cannot expect other people to have done the same thinking and have
the same context, and when they do not, it is easy to get frustrated.
I think that some of Daniel's "ravings" are probably just frustration
that the other person "doesn't get it".

That doesn't mean that Daniel is the right hand of God or anything, I've
seen him do some stupid things but I've seen all of us do some stupid
things, so that doesn't mean much.  I think Daniel does way more smart
things than stupid things, and not all of us can claim that (sort of
like half of the drivers are below average, noone likes that idea either).

What I'm trying to say is that I think Daniel is one of the good guys,
even though his user interface could stand improvement (a common thing
amongst smart people) and it looks like it would be smart to figure out
how to work with him.

Just my opinion...
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Validating Pointers
  2001-07-26 15:52   ` Validating Pointers Alan Cox
@ 2001-07-26 17:09     ` tpepper
  2001-07-26 17:12       ` Alan Cox
  0 siblings, 1 reply; 662+ messages in thread
From: tpepper @ 2001-07-26 17:09 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

On Thu 26 Jul at 16:52:48 +0100 alan@lxorguk.ukuu.org.uk done said:
> access_ok may do minimal checks, or no checking at all. The only point at
> which you can validate a user point is when you use copy*user and
> get/put_user to access the data.

Should the i386 access_ok() fail when checking a copy to/from userspace
from/to a static in a driver module?  The __copy_to|from_user work fine
and copy_to|from_user fail, but I guess that doesn't mean access_ok()
is the culprit.  I don't know intel assembly and the platforms for
which I do get the assembly don't do much in access_ok() so there's no
comparing...but I'd have thought they'd be more concerned with the user
address location than the kernel one.

t.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Validating Pointers
  2001-07-26 17:09     ` tpepper
@ 2001-07-26 17:12       ` Alan Cox
  2001-07-27  3:19         ` tpepper
  0 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-26 17:12 UTC (permalink / raw)
  To: tpepper; +Cc: Alan Cox, linux-kernel

> Should the i386 access_ok() fail when checking a copy to/from userspace
> from/to a static in a driver module?  The __copy_to|from_user work fine
> and copy_to|from_user fail, but I guess that doesn't mean access_ok()
> is the culprit.  I don't know intel assembly and the platforms for
> which I do get the assembly don't do much in access_ok() so there's no
> comparing...but I'd have thought they'd be more concerned with the user
> address location than the kernel one.

You can't pass kernel address as if they were userspace. It might happen to
sometimes work on some architectures. Take a look at the set_fs() stuff

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 16:54                         ` ext3-2.4-0.9.4 Larry McVoy
@ 2001-07-26 17:15                           ` Andre Pang
  2001-07-26 17:58                             ` ext3-2.4-0.9.4 Hans Reiser
                                               ` (2 more replies)
  2001-07-26 17:41                           ` ext3-2.4-0.9.4 Larry McVoy
  2001-07-26 22:17                           ` ext3-2.4-0.9.4 Daniel Phillips
  2 siblings, 3 replies; 662+ messages in thread
From: Andre Pang @ 2001-07-26 17:15 UTC (permalink / raw)
  To: Larry McVoy, linux-kernel

On Thu, Jul 26, 2001 at 09:54:52AM -0700, Larry McVoy wrote:

> What I'm trying to say is that I think Daniel is one of the good guys,
> even though his user interface could stand improvement (a common thing
> amongst smart people) and it looks like it would be smart to figure out
> how to work with him.

there's a work-in-progress called ReiserSMTP[1] which rewrites
some bits of qmail so it works better with ReiserFS, although i
can imagine that it would improve things on Linux as a whole.

this is getting off-topic, but since the various parties involved
(Linux vs djb/Wietse/etc[2]) are probably never going to agree
on semantics, i'm wondering if it's possible to ask them to
write the software in such a way that it's possible to 'drop in'
your own functions relevant for sync'ing.  then the MTA writers
can go and use their traditional filesystem assumptions and
Linux users can produce very small patches to support the
correct behaviour under Linux.

it would be _nice_ if the ext3 guys would be more willing to
implement directory-syncing on link/rename/etc, though, even as
an option.  a 'mount -o dirsync' would be enough; no need for
chattr +D stuff.  Linux tends to have a bad name as a platform
as an MTA just because of all this, which is a shame.  it would be
nice if a fix is possible.  *nudge nudge Mr. Morton* :)

    [1] http://www.jedi.claranet.fr/reisersmtp.html

    [2] hey, this might be the first time they agree on
        anything!


-- 
#ozone/algorithm <ozone@algorithm.com.au>          - trust.in.love.to.save

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 16:13                     ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-26 16:46                       ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-26 17:26                       ` Matthias Andree
  1 sibling, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-26 17:26 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Matthias Andree, lkml

On Thu, 26 Jul 2001, Rik van Riel wrote:

> The MTA depends on behaviour which is undefined. Now you
> want to go blame the OS ?

No, the behaviour is defined on certain systems. Not sure if that
comprises all supported systems.

I'm not blaming anybody besides Linux which does not offer the "noasync"
(FreeBSD) compromise between sync and async. I don't see any reason why
this option cannot be there. Is it too expensive too implement? No-one
said so.

I cannot tell if and how the MTA authors checked all their supported OSs
how they handle metadata updates.

> If you care about your email, probably you should either
> teach these people about standards like POSIX or SuS
> (and tell them to not rely on undefined behaviour) or
> switch to an MTA which isn't broken in various ways ;)

Wee. And then, I tell the system to comply with that as well, don't I?
;)

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 16:54                         ` ext3-2.4-0.9.4 Larry McVoy
  2001-07-26 17:15                           ` ext3-2.4-0.9.4 Andre Pang
@ 2001-07-26 17:41                           ` Larry McVoy
  2001-07-26 22:17                           ` ext3-2.4-0.9.4 Daniel Phillips
  2 siblings, 0 replies; 662+ messages in thread
From: Larry McVoy @ 2001-07-26 17:41 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

Arrg, I take it all back, I'm taking about Daniel Phillips not Daniel
Bernstein.  I tend to agree with Alan about Mr Bernstein.

Thanks to Richard Gooch for pointing out that I'm asleep at the switch.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: IGMP join/leave time variability
       [not found] ` <no.id>
                     ` (13 preceding siblings ...)
  2001-07-26 15:52   ` Validating Pointers Alan Cox
@ 2001-07-26 17:51   ` Alan Cox
  2001-07-26 22:10   ` Proliant ML530, Megaraid 493 (Elite 1600), 2.4.7, timeout & abort Alan Cox
                     ` (188 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-26 17:51 UTC (permalink / raw)
  To: Torrey Hoffman; +Cc: 'Alan Cox', Nat Ersoz, linux-kernel, davem

> >From this, I infer that there should be _no_ initial delay on sending 
> the IGMP join.  In fact, a quick peek at the source confirms this: 
> (net/ipv4/igmp.c):
> 
> #define IGMP_Initial_Report_Delay               (1*HZ)
> 
> /* IGMP_Initial_Report_Delay is not from IGMP specs!
>  * IGMP specs require to report membership immediately after
>  * joining a group, but we delay the first report by a
>  * small interval. It seems more natural and still does not
>  * contradict to specs provided this delay is small enough.
>  */
> 
> But this "small interval" is actually very noticeable in our application.

I suspect the small interval for the first one should be 1 not 1*HZ. That
would keep a little bit of jitter which is good to avoid the multicast
receive/join group problem

[Lots of clients all running an app listening for multicast packets, one
 packet says 'do xyz on this group' and they all then send joins at the same
 instant]

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 17:15                           ` ext3-2.4-0.9.4 Andre Pang
@ 2001-07-26 17:58                             ` Hans Reiser
  2001-07-28 22:45                               ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-27  4:28                             ` ext3-2.4-0.9.4 Andrew Morton
  2001-07-27 16:24                             ` ext3-2.4-0.9.4 Lawrence Greenfield
  2 siblings, 1 reply; 662+ messages in thread
From: Hans Reiser @ 2001-07-26 17:58 UTC (permalink / raw)
  To: Andre Pang; +Cc: Larry McVoy, linux-kernel

Andre Pang wrote:
> 
> On Thu, Jul 26, 2001 at 09:54:52AM -0700, Larry McVoy wrote:
> 
> > What I'm trying to say is that I think Daniel is one of the good guys,
> > even though his user interface could stand improvement (a common thing
> > amongst smart people) and it looks like it would be smart to figure out
> > how to work with him.
> 
> there's a work-in-progress called ReiserSMTP[1] which rewrites
> some bits of qmail so it works better with ReiserFS, although i
> can imagine that it would improve things on Linux as a whole.

It stopped due to flakiness on the part of all parties including myself, the programmer, and the
sponsor, but it would be nice if a sponsor and programmer came along to restart it.

> 
> this is getting off-topic, but since the various parties involved
> (Linux vs djb/Wietse/etc[2]) are probably never going to agree
> on semantics, i'm wondering if it's possible to ask them to
> write the software in such a way that it's possible to 'drop in'
> your own functions relevant for sync'ing.  then the MTA writers
> can go and use their traditional filesystem assumptions and
> Linux users can produce very small patches to support the
> correct behaviour under Linux.
> 
> it would be _nice_ if the ext3 guys would be more willing to
> implement directory-syncing on link/rename/etc, though, even as
> an option.  a 'mount -o dirsync' would be enough; no need for
> chattr +D stuff.  Linux tends to have a bad name as a platform
> as an MTA just because of all this, which is a shame.  it would be
> nice if a fix is possible.  *nudge nudge Mr. Morton* :)
> 
>     [1] http://www.jedi.claranet.fr/reisersmtp.html
> 
>     [2] hey, this might be the first time they agree on
>         anything!
> 
> --
> #ozone/algorithm <ozone@algorithm.com.au>          - trust.in.love.to.save
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


No, Linus is right and the MTA guys are just wrong.  The mailers are the place to fix things, not
the kernel.  If the mailer guys want to depend on the kernel being stupidly designed, tough. 
Someone should fix their mailer code and then it would run faster on Linux than on any other
platform.

Hans

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 16:18                       ` ext3-2.4-0.9.4 Linus Torvalds
  2001-07-26 16:44                         ` ext3-2.4-0.9.4 Alan Cox
  2001-07-26 16:54                         ` ext3-2.4-0.9.4 Larry McVoy
@ 2001-07-26 18:32                         ` Richard A Nelson
  2001-07-26 19:37                           ` ext3-2.4-0.9.4 Linus Torvalds
  2001-07-26 20:55                           ` ext3-2.4-0.9.4 Anton Altaparmakov
  2 siblings, 2 replies; 662+ messages in thread
From: Richard A Nelson @ 2001-07-26 18:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, 26 Jul 2001, Linus Torvalds wrote:

> In article <E15PnTJ-0003z0-00@the-village.bc.nu>,
> Alan Cox  <alan@lxorguk.ukuu.org.uk> wrote:
> >> Go tell your opinion to those people that refuse to wrap their
> >> rename/link calls with open()/fsync() calls to the respective parents,
> >> particularly Daniel J. Bernstein, Wietse Z. Venema, among others. I
> >> don't possibly know all MTAs.

[snip]
> Also, I think he eventually agreed on the logic of fsync() on the
> directory, and we even had a bug report (quickly fixed) for reiserfs
> because it got confused by it.

In looking at the synchronous directory options, I'm unsure as to
the 'real' status wrt fsync() on a directory:
	1) Does fsync() of a directory work on most/all current FS?
	2) Does it work on 2.2.x as well as 2.4.x?
-- 
Rick Nelson
"... being a Linux user is sort of like living in a house inhabited
by a large family of carpenters and architects. Every morning when
you wake up, the house is a little different. Maybe there is a new
turret, or some walls have moved. Or perhaps someone has temporarily
removed the floor under your bed." - Unix for Dummies, 2nd Edition
	-- found in the .sig of Rob Riggs, rriggs@tesser.com


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 18:32                         ` ext3-2.4-0.9.4 Richard A Nelson
@ 2001-07-26 19:37                           ` Linus Torvalds
  2001-07-26 20:04                             ` ext3-2.4-0.9.4 Richard A Nelson
  2001-07-26 20:55                           ` ext3-2.4-0.9.4 Anton Altaparmakov
  1 sibling, 1 reply; 662+ messages in thread
From: Linus Torvalds @ 2001-07-26 19:37 UTC (permalink / raw)
  To: Richard A Nelson; +Cc: linux-kernel


On Thu, 26 Jul 2001, Richard A Nelson wrote:
>
> In looking at the synchronous directory options, I'm unsure as to
> the 'real' status wrt fsync() on a directory:
> 	1) Does fsync() of a directory work on most/all current FS?

Modulo bugs, yes.

Now, there's another issue, of course: if you have an important mail-spool
on some of the less tested filesystems, I would consider you crazy
regardless of fsync() working ;). I don't think anybody has ever verified
that fsync() (or much anything else wrt writing) does the right thing on
NTFS, for example.

> 	2) Does it work on 2.2.x as well as 2.4.x?

Yes. However, there may be performance issues. As with just about
anything, we didn't start optimizing things until it became a real issue,
and in some cases at least historically the filesystems fell back on just
doing a whole "fsync_dev()" if they had nothing better to do.

I think later 2.2.x kernels (ie the ones past the point where Alan took
over) probably have the fsync() optimizations at least for ext2.

		Linus


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 19:37                           ` ext3-2.4-0.9.4 Linus Torvalds
@ 2001-07-26 20:04                             ` Richard A Nelson
  0 siblings, 0 replies; 662+ messages in thread
From: Richard A Nelson @ 2001-07-26 20:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, 26 Jul 2001, Linus Torvalds wrote:

> > 	1) Does fsync() of a directory work on most/all current FS?
>
> Modulo bugs, yes.

Great, that was a big concern

> Now, there's another issue, of course: if you have an important mail-spool
> on some of the less tested filesystems, I would consider you crazy
> regardless of fsync() working ;). I don't think anybody has ever verified
> that fsync() (or much anything else wrt writing) does the right thing on
> NTFS, for example.

Caveat Emptor ;-)

> > 	2) Does it work on 2.2.x as well as 2.4.x?
>
> Yes. However, there may be performance issues. As with just about
> anything, we didn't start optimizing things until it became a real issue,
> and in some cases at least historically the filesystems fell back on just
> doing a whole "fsync_dev()" if they had nothing better to do.
>
> I think later 2.2.x kernels (ie the ones past the point where Alan took
> over) probably have the fsync() optimizations at least for ext2.

That should be recent enough - I push 2.2.19 for shm support and security
reasons anyway - though I see alot of folk on 2.2.16/17.

Are the optimizations more than writing out only changed blocks?
Has anyone any information on the performance differences between
optimized vs non-optimized?

Thanks, I'm feeling much better about getting this support added
-- 
Rick Nelson
Life'll kill ya                         -- Warren Zevon
Then you'll be dead                     -- Life'll kill ya


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 15:28                 ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-26 20:23                   ` Gérard Roudier
  0 siblings, 0 replies; 662+ messages in thread
From: Gérard Roudier @ 2001-07-26 20:23 UTC (permalink / raw)
  To: Alan Cox; +Cc: Matthias Andree, Rik van Riel, Andrew Morton, lkml, ext3-users



On Thu, 26 Jul 2001, Alan Cox wrote:

> > them, and MTAs are portable, they choose chattr +S on Linux. And that's
> > a performance problem because it doesn't come for free, but also with
> > synchronous data updates, which are unnecessary because there is
> > fsync().
>
> chattr +S and atomic updates hitting disk then returning to the app will
> give the same performance. You can also fsync() the directory.
>
> > the "my rename call has returned 0" event. They expect that with the
> > call returning the rename operation has completed ultimately, finally,
> > for good, definitely and the old file will not reappear after a crash.
>
> Actually the old file re-appearing after the crash is irrelevant. It will
> have a previously logged message id. And if you are not doing message id
> histories then you have replay races at the SMTP level anyway
>
> > This still implies the drive doesn't lie to the OS about the completion
> > of write requests: write cache == off.
>
> Write cache off is not a feature available on many modern disks. You
> already lost the battle before you started.

Losing the battle of brain-dead hardware is not a problem... :-)

SCSI hard disks are expected to follow the specifications. But, may be,
you are referring to IDE disks, only ...

With SCSI, you can enable write caching and also ask the device to signal
completion of actual write to the media by setting the FUA bit in the SCSI
command block (not available in WRITE(6), but available in WRITE(10)).

  Gérard.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 15:49                 ` ext3-2.4-0.9.4 Andrew Morton
@ 2001-07-26 20:45                   ` Daniel Phillips
  0 siblings, 0 replies; 662+ messages in thread
From: Daniel Phillips @ 2001-07-26 20:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthias Andree, Alan Cox, Rik van Riel, lkml, ext3-users

On Thursday 26 July 2001 17:49, Andrew Morton wrote:
> Daniel Phillips wrote:
> > Ext3 does *not* leave a
> > lot of dirty blocks hanging around in normal operation, so sync is
> > not nearly as slow as it is with good old Ext2.
>
> eek.
>
> In fully-journalled data mode, we write everything to the journal
> in a linear chunk, wait on it, write a commit block, wait on that
> and then release all the just-journalled data into the main
> filesystem for conventional bdflush/kupdate writeback in twenty
> seconds time.
>
> Calling anything which uses fsync_dev() would cause all that
> writeback data to be written out and waited on, with the
> consequential seeking storm.  Disastrous.

Whoops, ok, no, this is not particularly sync-friendly.  On the other
hand, I don't think your seek storm would be as bad as all that.  You
can still feed enough blocks to the elevator to give it something to
chew on.  On the third hand, since you are still using the generic
flushing machinery I can see you'd have quite a lot of work to do to
control the flushing accurately in this way.

> Note that fsync() is OK - in full data journalling mode nothing
> is ever attached to i_dirty_buffers.

Somewhere in there is a beautiful optimization trying to get out...

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 18:32                         ` ext3-2.4-0.9.4 Richard A Nelson
  2001-07-26 19:37                           ` ext3-2.4-0.9.4 Linus Torvalds
@ 2001-07-26 20:55                           ` Anton Altaparmakov
  1 sibling, 0 replies; 662+ messages in thread
From: Anton Altaparmakov @ 2001-07-26 20:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Richard A Nelson, linux-kernel

At 20:37 26/07/2001, Linus Torvalds wrote:
>On Thu, 26 Jul 2001, Richard A Nelson wrote:
> > In looking at the synchronous directory options, I'm unsure as to
> > the 'real' status wrt fsync() on a directory:
> >       1) Does fsync() of a directory work on most/all current FS?
>
>Modulo bugs, yes.
>
>Now, there's another issue, of course: if you have an important mail-spool
>on some of the less tested filesystems, I would consider you crazy
>regardless of fsync() working ;). I don't think anybody has ever verified
>that fsync() (or much anything else wrt writing) does the right thing on
>NTFS, for example.

NTFS doesn't even have an fsync() operation defined so calling fsync() 
system call won't do anything at all. A quick look at 
fs/buffer.c::sys_fsync() shows it will return -EINVAL straight away.

But considering the fsync, even if present may well trash the file or the 
whole partition's data, it's just as well it doesn't happen...

Anton


-- 
   "Nothing succeeds like success." - Alexandre Dumas
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Proliant ML530, Megaraid 493 (Elite 1600), 2.4.7, timeout & abort
       [not found] ` <no.id>
                     ` (14 preceding siblings ...)
  2001-07-26 17:51   ` IGMP join/leave time variability Alan Cox
@ 2001-07-26 22:10   ` Alan Cox
  2001-07-26 22:20   ` Support for serial console on legacy free machines Alan Cox
                     ` (187 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-26 22:10 UTC (permalink / raw)
  To: C. R. Oldham; +Cc: linux-kernel

> the Elite 1600) in them. I am able to get them to boot with 2.2.19, but
> not 2.4.7.   Under 2.4.7 the relevant message is

Ok first thing - does 2.4.6 work. If so then  there is a bug or firmware
incompatibility with the firmware set you have and the newer driver.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 16:54                         ` ext3-2.4-0.9.4 Larry McVoy
  2001-07-26 17:15                           ` ext3-2.4-0.9.4 Andre Pang
  2001-07-26 17:41                           ` ext3-2.4-0.9.4 Larry McVoy
@ 2001-07-26 22:17                           ` Daniel Phillips
  2 siblings, 0 replies; 662+ messages in thread
From: Daniel Phillips @ 2001-07-26 22:17 UTC (permalink / raw)
  To: Larry McVoy, Linus Torvalds; +Cc: linux-kernel

On Thursday 26 July 2001 18:54, Larry McVoy wrote:
> On Thu, Jul 26, 2001 at 04:18:59PM +0000, Linus Torvalds wrote:
> > In article <E15PnTJ-0003z0-00@the-village.bc.nu>,
> >
> > Alan Cox  <alan@lxorguk.ukuu.org.uk> wrote:
> > >> Go tell your opinion to those people that refuse to wrap their
> > >> rename/link calls with open()/fsync() calls to the respective
> > >> parents, particularly Daniel J. Bernstein, Wietse Z. Venema,
> > >> among others. I don't possibly know all MTAs.
> > >
> > >I've pointed things out to Mr Bernstein before. His normal replies
> > > are not helpful and generally vary between random ravings and
> > > threatening to sue people who publish things on web pages he
> > > disagrees with.
> >
> > Now, now, Alan. He has strong opinions, I'll agree, but I've never
> > see him threaten to _sue_.
>
> In the for what it is worth department, I spent the day with Daniel
> after the kernel summit meeting a while back, we talked file systems
> for about 6 or 7 hours.  While I'll plead guilty to getting mad at
> him (his ego is up there with mine :-), I came away impressed with
> his knowledge. I get the feeling that he thinks deeply about the
> problems he works on, he's probably right a lot of the time, *and* as
> with many deep thinkers, he has a problem communicating his ideas.
>
> This is a common problem, and I'm not sure Daniel is fully aware of
> it. One cannot expect other people to have done the same thinking and
> have the same context, and when they do not, it is easy to get
> frustrated. I think that some of Daniel's "ravings" are probably just
> frustration that the other person "doesn't get it".
>
> That doesn't mean that Daniel is the right hand of God or anything,
> I've seen him do some stupid things but I've seen all of us do some
> stupid things, so that doesn't mean much.  I think Daniel does way
> more smart things than stupid things, and not all of us can claim
> that (sort of like half of the drivers are below average, noone likes
> that idea either).
>
> What I'm trying to say is that I think Daniel is one of the good
> guys, even though his user interface could stand improvement (a
> common thing amongst smart people) and it looks like it would be
> smart to figure out how to work with him.
>
> Just my opinion...

Heh, very interesting, but you seem to have created a collage of two
different Daniels ;-)

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Support for serial console on legacy free machines
       [not found] ` <no.id>
                     ` (15 preceding siblings ...)
  2001-07-26 22:10   ` Proliant ML530, Megaraid 493 (Elite 1600), 2.4.7, timeout & abort Alan Cox
@ 2001-07-26 22:20   ` Alan Cox
  2001-07-30 17:47     ` Khalid Aziz
  2001-07-27  9:27   ` IGMP join/leave time variability David S. Miller
                     ` (186 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-26 22:20 UTC (permalink / raw)
  To: Khalid Aziz; +Cc: LKML

> console is "Serial Port Console Redirection" (SPCR) table. This table
> gives me almost all the information I need to initialize and use a
> serial console. The bummer is this table was designed by Microsoft and
> Microsoft owns the copyright on it. Microsoft primarily designed this
> table for use by Whistler. Their copyright may cause potential problems
> with using it in Linux. This makes me reluctant to use this table. I

Such as ?

If its a table that microsoft added to ACPI and its well thought out I don't
see a big problem technically. There are a collection of BIOS services we
use that were microsoft originated

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Validating Pointers
  2001-07-26 17:12       ` Alan Cox
@ 2001-07-27  3:19         ` tpepper
  2001-07-27  9:47           ` Alan Cox
  0 siblings, 1 reply; 662+ messages in thread
From: tpepper @ 2001-07-27  3:19 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

On Thu 26 Jul at 18:12:57 +0100 alan@lxorguk.ukuu.org.uk done said:
> 
> You can't pass kernel address as if they were userspace. It might happen to
> sometimes work on some architectures. Take a look at the set_fs() stuff

Am I?  I though I was doing a pretty plain user<->kernel copy:

	copy_to_user(user_addr, kernel_addr, size);
		and
	copy_from_user(kernel_addr, user_addr, size);

Are you saying that static and dynamically allocated kernel variables end up
in different segments (kernel_ds and user_ds) and the copy is only expected to
succeed if the to and from addresses are in the same segment?

Tim

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 17:15                           ` ext3-2.4-0.9.4 Andre Pang
  2001-07-26 17:58                             ` ext3-2.4-0.9.4 Hans Reiser
@ 2001-07-27  4:28                             ` Andrew Morton
  2001-08-01 15:51                               ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2001-07-27 16:24                             ` ext3-2.4-0.9.4 Lawrence Greenfield
  2 siblings, 1 reply; 662+ messages in thread
From: Andrew Morton @ 2001-07-27  4:28 UTC (permalink / raw)
  To: Andre Pang; +Cc: linux-kernel

Andre Pang wrote:
> 
> it would be _nice_ if the ext3 guys would be more willing to
> implement directory-syncing on link/rename/etc, though, even as
> an option.  a 'mount -o dirsync' would be enough; no need for
> chattr +D stuff.  Linux tends to have a bad name as a platform
> as an MTA just because of all this, which is a shame.  it would be
> nice if a fix is possible.  *nudge nudge Mr. Morton* :)

Perhaps I didn't understand the requirement.

I believe that `dirsync' would provide synchronous metadata
operations (ie: the metadata is crashproofed on-disk when
the syscall returns), but non-sync data.  Correct?

Whereas `mount -o sync' or `chattr +S' would provide synchronous
metadata operations PLUS synchronous data, so when write()
returns, the data which was written is crashproofed.

Is that your understanding of the difference?

If so, then with `dirsync', the application would have to
open the file O_SYNC (which would make the whole thing pointless!)
or it would run fsync() when it had finished writing the file.

So what it boils down to is that dirsync will improve the
efficiency of applications which do a bunch of small writes
and then an fsync.

If, however, the application is capable of doing a nice big
write() (setvbuf!) then really, the two things will be pretty
much the same.

Wait and see how the benchmarks turn out, yes?


One problem at present is that an application could be in the
middle of a nice big write(), but another thread comes up and
does a synchronous creat().  That will force a commit right in the middle
of the write().  It would be better (I think) if the write's transaction
were allowed to run to completion and the creat() caller blocks until
the write() finishes - this way the write(), the creat() and anything
else which happened during the write() would all be written out in a
single compound transaction.

Alas, we cannot run a transaction handle for more than a single
page in write() because of locking inversion problems with i_sem
and the lock_page outside ->writepage().  i_sem is trivial to fix,
but writepage is not.  It has not really proven to be a problem
yet, but it would be nice to be able to _guarantee_ that writes
up to a particular size (100k, say) were 100% atomic.

-

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: IGMP join/leave time variability
       [not found] ` <no.id>
                     ` (16 preceding siblings ...)
  2001-07-26 22:20   ` Support for serial console on legacy free machines Alan Cox
@ 2001-07-27  9:27   ` David S. Miller
  2001-07-27  9:54   ` 2.4.7 + VIA Pro266 + 2xUltraTx2 lockups Alan Cox
                     ` (185 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: David S. Miller @ 2001-07-27  9:27 UTC (permalink / raw)
  To: Alan Cox; +Cc: Torrey Hoffman, Nat Ersoz, linux-kernel


Alan Cox writes:
 > > But this "small interval" is actually very noticeable in our application.
 > 
 > I suspect the small interval for the first one should be 1 not 1*HZ. That
 > would keep a little bit of jitter which is good to avoid the multicast
 > receive/join group problem

I've changed it to "1" in my sources.  Thanks.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Strange remount behaviour with ext3-2.4-0.9.4
  2001-07-26  7:34 ext3-2.4-0.9.4 Andrew Morton
  2001-07-26 11:08 ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-27  9:32 ` Sean Hunter
  2001-07-27 10:24   ` Andrew Morton
  2001-07-27 20:39   ` Michal Jaegermann
  2001-07-30  6:37 ` ext3-2.4-0.9.4 Philipp Matthias Hahn
  2 siblings, 2 replies; 662+ messages in thread
From: Sean Hunter @ 2001-07-27  9:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Following the announcement on lkml, I have started using ext3 on one of my
servers.  Since the server in question is a farily security-sensitive box, my
/usr partition is mounted read only except when I remount rw to install
packages.

I converted this partition to run ext3 with the mount options
"nodev,ro,data=writeback,defaults" figuring that when I need to install new
packages etc, that I could just mount rw as before and that metadata-only
journalling would be ok for this partition as it really sees very little write
activity.

When I try to remount it r/w I get a log message saying:
Jul 27 09:54:29 henry kernel: EXT3-fs: cannot change data mode on remount

...even if I give the full mount option list with the remount instruction.

I can, however, remount it as ext2 read-write, but when I try to remount as
ext3 (even read only) I get the same problem.

Wierdly, "mount" lists it as being still an ext3 partition even though it has
been remounted as ext2.  I can't umount /usr because kjournald is currently
listed as using the partition.

The box in question is more-or-less RedHat 7.1, with ext3-2.4-0.9.4, kernel
2.4.7 and with the following relevant package versions:

mount-2.11g-4
util-linux-2.11f-3
e2fsprogs-1.22-2

...all from rawhide rpms.

Sean

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Validating Pointers
  2001-07-27  3:19         ` tpepper
@ 2001-07-27  9:47           ` Alan Cox
  0 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27  9:47 UTC (permalink / raw)
  To: tpepper; +Cc: Alan Cox, linux-kernel

> 	copy_to_user(user_addr, kernel_addr, size);
> 		and
> 	copy_from_user(kernel_addr, user_addr, size);
> 
> Are you saying that static and dynamically allocated kernel variables end up
> in different segments (kernel_ds and user_ds) and the copy is only expected to
> succeed if the to and from addresses are in the same segment?

user and kernel address spaces are seperate. On S/390 and M68K for example
they occupy the same values for both. Long long ago this was done via
segments on x86 (we dont use segments now) and thus the functions to do 
what you want are still called set_fs/get_fs/get_ds

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 2.4.7 + VIA Pro266 + 2xUltraTx2 lockups
       [not found] ` <no.id>
                     ` (17 preceding siblings ...)
  2001-07-27  9:27   ` IGMP join/leave time variability David S. Miller
@ 2001-07-27  9:54   ` Alan Cox
  2001-07-28  4:03     ` Robin Humble
  2001-07-27 10:00   ` Hard disk problem: Alan Cox
                     ` (184 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27  9:54 UTC (permalink / raw)
  To: Robin Humble; +Cc: linux-kernel

> So the system is stable when driving a single Tx2 card, or on a BX,
> but just not two Tx2's together on the pro266 board :-/ So it's
> perhaps (I'm guessing here :) a non-trivial Tx2 driver bug or maybe a
> VIA Pro266 problem?

Firstly please try 2.4.6-ac5 as that has the proper VIA workaround for their
bridge bugs. Its useful to rule out the very conservative approach the older
kernels use to avoid the disk corruption problem they had

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Hard disk problem:
       [not found] ` <no.id>
                     ` (18 preceding siblings ...)
  2001-07-27  9:54   ` 2.4.7 + VIA Pro266 + 2xUltraTx2 lockups Alan Cox
@ 2001-07-27 10:00   ` Alan Cox
  2001-07-27 15:22     ` Steve Underwood
  2001-07-27 13:27   ` ReiserFS / 2.4.6 / Data Corruption Alan Cox
                     ` (183 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27 10:00 UTC (permalink / raw)
  To: Miloslaw Smyk; +Cc: Mike A. Harris, Linux Kernel mailing list

> >  Model=IBM-DTLA-307030, FwRev=TX4OA50C, SerialNo=YKDYKGF1437
> 
> Ah, one of these excellent Hungarian DTLA drives? :) AFAIK, the entire batch
> was broken, although there are people who insist that there was no single
> working hard drive leaving that factory! I personally have seen 7 out of 7
> failing...

I have a large collection of these drives and none of them are problematic,
while the maxtors seem a little less reliable

> Take it back to where you bought it and demand a replacement for something
> NOT bearing "MADE IN HUNGARY" sign.

Of course the writer of this is Polish and the drives are Hungarian ..

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Strange remount behaviour with ext3-2.4-0.9.4
  2001-07-27  9:32 ` Strange remount behaviour with ext3-2.4-0.9.4 Sean Hunter
@ 2001-07-27 10:24   ` Andrew Morton
  2001-07-27 12:15     ` Sean Hunter
  2001-07-27 20:39   ` Michal Jaegermann
  1 sibling, 1 reply; 662+ messages in thread
From: Andrew Morton @ 2001-07-27 10:24 UTC (permalink / raw)
  To: Sean Hunter; +Cc: linux-kernel

Sean Hunter wrote:
> 
> Following the announcement on lkml, I have started using ext3 on one of my
> servers.  Since the server in question is a farily security-sensitive box, my
> /usr partition is mounted read only except when I remount rw to install
> packages.
> 
> I converted this partition to run ext3 with the mount options
> "nodev,ro,data=writeback,defaults" figuring that when I need to install new
> packages etc, that I could just mount rw as before and that metadata-only
> journalling would be ok for this partition as it really sees very little write
> activity.
> 
> When I try to remount it r/w I get a log message saying:
> Jul 27 09:54:29 henry kernel: EXT3-fs: cannot change data mode on remount
> 
> ...even if I give the full mount option list with the remount instruction.

hmm..  The mount option handling there is a bit bogus.

What we *should* do on remount is check that the requested
journalling mode is equal to the current mode.  ext3 won't
allow you to change the journalling mode on-the-fly.

So...  you will have to omit the `data=xxx' portion of the
mount options when remounting.  It's being invisibly added
by /bin/mount.

/bin/mount tries to be smart.  If, for example you have

	/dev/hdf12 /mnt/hdf12 ext3 noauto,ro,data=writeback 1

in /etc/fstab and then type

	mount /dev/hdf12 -o remount,rw

then /bin/mount runs off and looks up the fstab entry and
inserts the mount options.  However if you instead type

	mount /dev/hdf12 /mnt/hdf12 -o remount,rw          (1)

then /bin/mount does *not* look up the fstab entry, and
the remount succeeds.

ho-hum.  For the while you'll have to fiddle with the mount
usage to get things working right.   Equation (1) above will
work fine.  Or apply the appended patch.

> I can, however, remount it as ext2 read-write, but when I try to remount as
> ext3 (even read only) I get the same problem.

You can't switch between ext2 and ext3 with a remount - unmount
is needed.

> Wierdly, "mount" lists it as being still an ext3 partition even though it has
> been remounted as ext2.  I can't umount /usr because kjournald is currently
> listed as using the partition.

That sounds very weird.  Could you please describe the steps
you took to create this state?

Sometimes /etc/mtab gets out of sync - especially for the
root fs.  It's more reliable to look in /proc/mounts



Here's the fix for the data= handling on remount:



Index: fs/ext3/super.c
===================================================================
RCS file: /cvsroot/gkernel/ext3/fs/ext3/super.c,v
retrieving revision 1.31
diff -u -r1.31 super.c
--- fs/ext3/super.c	2001/07/19 14:43:08	1.31
+++ fs/ext3/super.c	2001/07/27 10:14:48
@@ -513,12 +513,6 @@
 
 			if (want_value(value, "data"))
 				return 0;
-			if (is_remount) {
-				printk ("EXT3-fs: cannot change data mode "
-						"on remount\n");
-				return 0;
-			}
-
 			if (!strcmp (value, "journal"))
 				data_opt = EXT3_MOUNT_JOURNAL_DATA;
 			else if (!strcmp (value, "ordered"))
@@ -529,9 +523,18 @@
 				printk ("EXT3-fs: Invalid data option: %s\n",
 					value);
 				return 0;
+			}
+			if (is_remount) {
+				if ((*mount_options & EXT3_MOUNT_DATA_FLAGS) !=
+							data_opt) {
+					printk("EXT3-fs: cannot change data "
+						"mode on remount\n");
+					return 0;
+				}
+			} else {
+				*mount_options &= ~EXT3_MOUNT_DATA_FLAGS;
+				*mount_options |= data_opt;
 			}
-			*mount_options &= ~EXT3_MOUNT_DATA_FLAGS;
-			*mount_options |= data_opt;
 		} else {
 			printk ("EXT3-fs: Unrecognized mount option %s\n",
 					this_char);

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Strange remount behaviour with ext3-2.4-0.9.4
  2001-07-27 10:24   ` Andrew Morton
@ 2001-07-27 12:15     ` Sean Hunter
  0 siblings, 0 replies; 662+ messages in thread
From: Sean Hunter @ 2001-07-27 12:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, Jul 27, 2001 at 08:24:14PM +1000, Andrew Morton wrote:
> Sean Hunter wrote:
> > 
> > Following the announcement on lkml, I have started using ext3 on one of my
> > servers.  Since the server in question is a farily security-sensitive box, my
> > /usr partition is mounted read only except when I remount rw to install
> > packages.
> > 
> > I converted this partition to run ext3 with the mount options
> > "nodev,ro,data=writeback,defaults" figuring that when I need to install new
> > packages etc, that I could just mount rw as before and that metadata-only
> > journalling would be ok for this partition as it really sees very little write
> > activity.
> > 
> > When I try to remount it r/w I get a log message saying:
> > Jul 27 09:54:29 henry kernel: EXT3-fs: cannot change data mode on remount
> > 
> > ...even if I give the full mount option list with the remount instruction.
> 
> hmm..  The mount option handling there is a bit bogus.
> 
> What we *should* do on remount is check that the requested
> journalling mode is equal to the current mode.  ext3 won't
> allow you to change the journalling mode on-the-fly.

Indeed.

> 
> So...  you will have to omit the `data=xxx' portion of the
> mount options when remounting.  It's being invisibly added
> by /bin/mount.

Thought so.  I tried both ways just to be sure.

> /bin/mount tries to be smart.  If, for example you have
> 
> 	/dev/hdf12 /mnt/hdf12 ext3 noauto,ro,data=writeback 1
> 
> in /etc/fstab and then type
> 
> 	mount /dev/hdf12 -o remount,rw
> 
> then /bin/mount runs off and looks up the fstab entry and
> inserts the mount options.  However if you instead type
> 
> 	mount /dev/hdf12 /mnt/hdf12 -o remount,rw          (1)
> 
> then /bin/mount does *not* look up the fstab entry, and
> the remount succeeds.

Interesting, and (almost) 100% true

sean@henry:~$  sudo mount /dev/sda8 /usr -oro,nodev,data=writeback,remount
mount: you must specify the filesystem type
sean@henry:~$ sudo mount /dev/sda8 /usr -oro,nodev,data=writeback,remount -text3
mount: /usr not mounted already, or bad option
sean@henry:~$ sudo mount /dev/sda8 /usr -oro,nodev,remount -text3
sean@henry:~$ mount
/dev/sdb6 on / type ext3 (rw)
none on /proc type proc (rw)
/dev/sda1 on /boot type ext2 (ro,nosuid,nodev)
/dev/sdc6 on /home type ext3 (rw,nosuid,nodev,data=ordered)
/dev/sda8 on /usr type ext3 (ro,nodev)
/dev/sda5 on /var type ext3 (rw,nosuid,nodev,sync,data=journal)
none on /dev/pts type devpts (rw,gid=5,mode=620)

It succeeds as long as I don't specify the journal type.

> 
> ho-hum.  For the while you'll have to fiddle with the mount
> usage to get things working right.   Equation (1) above will
> work fine.  Or apply the appended patch.
> 
> > I can, however, remount it as ext2 read-write, but when I try to remount as
> > ext3 (even read only) I get the same problem.
> 
> You can't switch between ext2 and ext3 with a remount - unmount
> is needed.

Wierd.  This certainly looked to all the world as though it worked for me.  Thus:

sean@henry:~$ sudo mount /dev/sda8 /usr -oro,nodev,remount -text2

...doesn't give me an error, but:

sean@henry:~$ mount 
/dev/sdb6 on / type ext3 (rw)
none on /proc type proc (rw)
/dev/sda1 on /boot type ext2 (ro,nosuid,nodev)
/dev/sdc6 on /home type ext3 (rw,nosuid,nodev,data=ordered)
/dev/sda8 on /usr type ext3 (ro,nodev)
                       ^^^^
/dev/sda5 on /var type ext3 (rw,nosuid,nodev,sync,data=journal)
none on /dev/pts type devpts (rw,gid=5,mode=620)


> > Wierdly, "mount" lists it as being still an ext3 partition even though it has
> > been remounted as ext2.  I can't umount /usr because kjournald is currently
> > listed as using the partition.
> 
> That sounds very weird.  Could you please describe the steps
> you took to create this state?

See above.

> Sometimes /etc/mtab gets out of sync - especially for the
> root fs.  It's more reliable to look in /proc/mounts

sean@henry:~$ cat /proc/mounts
/dev/root / ext3 rw 0 0
/proc /proc proc rw 0 0
/dev/sda1 /boot ext2 ro,nosuid,nodev 0 0
/dev/sdc6 /home ext3 rw,nosuid,nodev 0 0
/dev/sda8 /usr ext3 ro,nodev 0 0
/dev/sda5 /var ext3 rw,nosuid,nodev,sync 0 0
none /dev/pts devpts rw 0 0

sean@henry:~$ cat /etc/mtab   
/dev/sdb6 / ext3 rw 0 0
none /proc proc rw 0 0
/dev/sda1 /boot ext2 ro,nosuid,nodev 0 0
/dev/sdc6 /home ext3 rw,nosuid,nodev,data=ordered 0 0
/dev/sda8 /usr ext3 ro,nodev 0 0
/dev/sda5 /var ext3 rw,nosuid,nodev,sync,data=journal 0 0
none /dev/pts devpts rw,gid=5,mode=620 0 0

Its almost as if mount is just silently ignoring the "-t" option when I specify
ext2.

> 
> 
> Here's the fix for the data= handling on remount:

I'll try this when its safe to reboot the box.

Thanks very much for your help.

Sean 

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
       [not found] ` <no.id>
                     ` (19 preceding siblings ...)
  2001-07-27 10:00   ` Hard disk problem: Alan Cox
@ 2001-07-27 13:27   ` Alan Cox
  2001-07-27 13:38     ` bvermeul
                       ` (2 more replies)
  2001-07-27 14:21   ` Alan Cox
                     ` (182 subsequent siblings)
  203 siblings, 3 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27 13:27 UTC (permalink / raw)
  To: Hans Reiser; +Cc: bvermeul, Erik Mouw, Steve Kieu, Sam Thompson, kernel

> > and when that hangs the kernel it will also screw up all files touched
> > just before it in a edit-make-install-try cycle. Which can be rather
> > annoying, because you can start all over again (this effect randomly
> > distributes the last touched sectors to the last touched files. Very nice
> > effect, but not something I expect from a journalled filesystem).
> > 
> Do you think it is reasonable to ask that a filesystem be designed to
> work well with bad drivers?

Its certainly a good idea. But it sounds to me like he is describing the
normal effect of metadata only logging. 

Putting a sync just before the insmod when developing new drivers is a good
idea btw


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 13:27   ` ReiserFS / 2.4.6 / Data Corruption Alan Cox
@ 2001-07-27 13:38     ` bvermeul
  2001-07-27 13:39       ` Alan Cox
  2001-07-27 14:16     ` Philip R. Auld
  2001-07-27 14:23     ` Hans Reiser
  2 siblings, 1 reply; 662+ messages in thread
From: bvermeul @ 2001-07-27 13:38 UTC (permalink / raw)
  To: Alan Cox; +Cc: Hans Reiser, Erik Mouw, Steve Kieu, Sam Thompson, kernel

On Fri, 27 Jul 2001, Alan Cox wrote:

> > > and when that hangs the kernel it will also screw up all files touched
> > > just before it in a edit-make-install-try cycle. Which can be rather
> > > annoying, because you can start all over again (this effect randomly
> > > distributes the last touched sectors to the last touched files. Very nice
> > > effect, but not something I expect from a journalled filesystem).
> > >
> > Do you think it is reasonable to ask that a filesystem be designed to
> > work well with bad drivers?
>
> Its certainly a good idea. But it sounds to me like he is describing the
> normal effect of metadata only logging.
>
> Putting a sync just before the insmod when developing new drivers is a good
> idea btw

I've been doing that most of the time. But I sometimes forget that.
But as I said, it's not something I expected from a journalled filesystem.

Regards,

Bas Vermeulen

-- 
"God, root, what is difference?"
	-- Pitr, User Friendly

"God is more forgiving."
	-- Dave Aronson


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 13:38     ` bvermeul
@ 2001-07-27 13:39       ` Alan Cox
  2001-07-27 13:47         ` bvermeul
                           ` (2 more replies)
  0 siblings, 3 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27 13:39 UTC (permalink / raw)
  To: bvermeul
  Cc: Alan Cox, Hans Reiser, Erik Mouw, Steve Kieu, Sam Thompson, kernel

> > Putting a sync just before the insmod when developing new drivers is a good
> > idea btw
> 
> I've been doing that most of the time. But I sometimes forget that.
> But as I said, it's not something I expected from a journalled filesystem.

You misunderstand journalling then

A journalling file system can offer different levels of guarantee. With 
metadata only journalling you don't take any real performance hit but your
file system is always consistent on reboot (consistent as in fsck would pass
it) but it makes no guarantee that data blocks got written.

Full data journalling will give you what you expect but at a performance hit
for many applications.

Alan


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 13:39       ` Alan Cox
@ 2001-07-27 13:47         ` bvermeul
  2001-07-27 13:49           ` Alan Cox
  2001-07-28 14:16         ` Matthew Gardiner
  2001-08-08 18:42         ` Stephen C. Tweedie
  2 siblings, 1 reply; 662+ messages in thread
From: bvermeul @ 2001-07-27 13:47 UTC (permalink / raw)
  To: Alan Cox; +Cc: Hans Reiser, Erik Mouw, Steve Kieu, Sam Thompson, kernel

On Fri, 27 Jul 2001, Alan Cox wrote:

> > > Putting a sync just before the insmod when developing new drivers is a good
> > > idea btw
> >
> > I've been doing that most of the time. But I sometimes forget that.
> > But as I said, it's not something I expected from a journalled filesystem.
>
> You misunderstand journalling then

Yup, I guess I did.

> A journalling file system can offer different levels of guarantee. With
> metadata only journalling you don't take any real performance hit but your
> file system is always consistent on reboot (consistent as in fsck would pass
> it) but it makes no guarantee that data blocks got written.

I allways thought that it could/would roll back the changes that weren't
consistent. But I stand corrected. Thanks... :)

> Full data journalling will give you what you expect but at a performance hit
> for many applications.

Do any of the other journalled filesystems for linux do this? If not, I
guess I'll go back to ext2.

Bas Vermeulen

-- 
"God, root, what is difference?"
	-- Pitr, User Friendly

"God is more forgiving."
	-- Dave Aronson


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 13:47         ` bvermeul
@ 2001-07-27 13:49           ` Alan Cox
  0 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27 13:49 UTC (permalink / raw)
  To: bvermeul
  Cc: Alan Cox, Hans Reiser, Erik Mouw, Steve Kieu, Sam Thompson, kernel

> > Full data journalling will give you what you expect but at a performance hit
> > for many applications.
> 
> Do any of the other journalled filesystems for linux do this? If not, I
> guess I'll go back to ext2.

ext3 can do full data journalling, I dont know if reiserfs has an option for
it or not

Alan


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 13:27   ` ReiserFS / 2.4.6 / Data Corruption Alan Cox
  2001-07-27 13:38     ` bvermeul
@ 2001-07-27 14:16     ` Philip R. Auld
  2001-07-27 14:38       ` Jordan
  2001-07-27 14:51       ` Hans Reiser
  2001-07-27 14:23     ` Hans Reiser
  2 siblings, 2 replies; 662+ messages in thread
From: Philip R. Auld @ 2001-07-27 14:16 UTC (permalink / raw)
  To: Alan Cox; +Cc: kernel

Alan Cox wrote:
> 
> Its certainly a good idea. But it sounds to me like he is describing the
> normal effect of metadata only logging.
> 

Which brings up something I have been struggling with lately:

Linux (using both ext2 and reiserfs) can show garbage data blocks at the end of
files after a crash. With reiserfs this is clearly due to metadata only logging
and happens say 3 out of 5 times. With ext2 the frequency is about 1 in 5 times,
and more often that not it is simply zeroed data. Sometimes it is old data
though. 


This is something that is not present in other unix filesystems as far as I can
tell. If linux wants to be used in enterprise sites we can't allow 
old data blocks to be read. And ideally shouldn't allow zero blocks to be seen
either, but this is somewhat less serious.

I cannot reproduce this in ufs on either freebsd or solaris8.

I have not tested it with xfs and jfs for linux yet (and don't have any native
systems at hand.)

I believe vxfs to have a mechanism to prevent this despite metadata only
logging.

reiserfs with full data logging enabled of course does not show this behavior
(and works really well if you are willing to take the performance hit).

The basic test I use is to run this perl script for a while (to make sure at
least somehting has had a chance to get written out) and then power-cycle the
machine. When it comes back a simple tail logfile will show the problem. I also
run bonnie before hand to fill the disk with a known pattern so its easier to
spot.

linux is 2.2.16 and 2.4.2 from redhat 7.1. reiserfs is 3.5.33 and was tested
only on 2.2.16.


#!/usr/bin/perl
use Fcntl;
$count = 0;
while (1) {
#sysopen(FH, "/scratch/logfile", O_RDWR|O_APPEND|O_CREAT|O_SYNC)
sysopen(FH, "/scratch/logfile", O_RDWR|O_APPEND|O_CREAT)
        or die "Couldn't open file $path : $!\n";
print FH "Log file line ", $count , " yadda  yadda  yadda  yadda  yadda  yadda 
yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda 
yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda 
yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda 
yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda 
yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda 
yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda 
yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda 
yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda 
yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda \n" ;
close (FH);
#print $count , "\n";
$count++;
}


------------------------------------------------------
Philip R. Auld, Ph.D.                  Technical Staff 
Egenera Corp.                        pauld@egenera.com
165 Forest St, Marlboro, MA 01752        (508)786-9444

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
       [not found] ` <no.id>
                     ` (20 preceding siblings ...)
  2001-07-27 13:27   ` ReiserFS / 2.4.6 / Data Corruption Alan Cox
@ 2001-07-27 14:21   ` Alan Cox
  2001-07-28 14:18     ` Matthew Gardiner
  2001-07-27 15:06   ` Alan Cox
                     ` (181 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27 14:21 UTC (permalink / raw)
  To: Philip R. Auld; +Cc: Alan Cox, kernel

> This is something that is not present in other unix filesystems as far as I can
> tell. If linux wants to be used in enterprise sites we can't allow 
> old data blocks to be read. And ideally shouldn't allow zero blocks to be seen
> either, but this is somewhat less serious.

> I cannot reproduce this in ufs on either freebsd or solaris8.

It can happen on UFS. What normally happens on UFS is that you get an old
file attached to a new filename when the file is deleted and the inode
reused.

Basically it can happen on any no data logging fs (with a few exceptions for
other clever algorithms like tree-phase)

If you write the metadata block first (UFS) then there is a risk of getting
someone elses data appended to the end of a file (eg length updated before
data blocks). If you write data first there is a risk of writing the data
and never committing the removal of the block from previous files.

FreeBSD softupdates probably make it very hard to trigger and they are a
very nice approach

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 13:27   ` ReiserFS / 2.4.6 / Data Corruption Alan Cox
  2001-07-27 13:38     ` bvermeul
  2001-07-27 14:16     ` Philip R. Auld
@ 2001-07-27 14:23     ` Hans Reiser
  2 siblings, 0 replies; 662+ messages in thread
From: Hans Reiser @ 2001-07-27 14:23 UTC (permalink / raw)
  To: Alan Cox; +Cc: bvermeul, Erik Mouw, Steve Kieu, Sam Thompson, kernel, ramon

Alan Cox wrote:
> 
> > > and when that hangs the kernel it will also screw up all files touched
> > > just before it in a edit-make-install-try cycle. Which can be rather
> > > annoying, because you can start all over again (this effect randomly
> > > distributes the last touched sectors to the last touched files. Very nice
> > > effect, but not something I expect from a journalled filesystem).
> > >
> > Do you think it is reasonable to ask that a filesystem be designed to
> > work well with bad drivers?
> 
> Its certainly a good idea. 
I think it is a terrible idea.... at least as a general expectation to meet, there may be specifics
where things can be done though.... like journaling....

> But it sounds to me like he is describing the
> normal effect of metadata only logging.

Ah, right you are.  Now I understand him.  Well, data-journaling that doesn't cost a whole lot of
performance awaits reiser4, and reiser4 is at least a year away, we are doing seminars and
pseudo-coding now.

> 
> Putting a sync just before the insmod when developing new drivers is a good
> idea btw

This makes a lot of sense to me.  Good suggestion.  It should go into our FAQ.  Dad, please put it
there.

Q: I like to dynamically load buggy drivers into the kernel because that is what kernel developers
like me do for fun, how can I better avoid data corruption when doing this and using ReiserFS?

A: Do sync before insmod.  (Alan Cox's good suggestion.)

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 14:16     ` Philip R. Auld
@ 2001-07-27 14:38       ` Jordan
  2001-07-27 14:51       ` Hans Reiser
  1 sibling, 0 replies; 662+ messages in thread
From: Jordan @ 2001-07-27 14:38 UTC (permalink / raw)
  To: Philip R. Auld; +Cc: Alan Cox, kernel

"Philip R. Auld" wrote:
> 
> Alan Cox wrote:
> >
> > Its certainly a good idea. But it sounds to me like he is describing the
> > normal effect of metadata only logging.
> >
> 
> Which brings up something I have been struggling with lately:
> 
> Linux (using both ext2 and reiserfs) can show garbage data blocks at the end of
> files after a crash. With reiserfs this is clearly due to metadata only logging
> and happens say 3 out of 5 times. With ext2 the frequency is about 1 in 5 times,
> and more often that not it is simply zeroed data. Sometimes it is old data
> though.
> 
> This is something that is not present in other unix filesystems as far as I can
> tell. If linux wants to be used in enterprise sites we can't allow
> old data blocks to be read. And ideally shouldn't allow zero blocks to be seen
> either, but this is somewhat less serious.
> 
> I cannot reproduce this in ufs on either freebsd or solaris8.
> 
> I have not tested it with xfs and jfs for linux yet (and don't have any native
> systems at hand.)
> 
> I believe vxfs to have a mechanism to prevent this despite metadata only
> logging.
> 
> reiserfs with full data logging enabled of course does not show this behavior
> (and works really well if you are willing to take the performance hit).
> 
> The basic test I use is to run this perl script for a while (to make sure at
> least somehting has had a chance to get written out) and then power-cycle the
> machine. When it comes back a simple tail logfile will show the problem. I also
> run bonnie before hand to fill the disk with a known pattern so its easier to
> spot.
> 
> linux is 2.2.16 and 2.4.2 from redhat 7.1. reiserfs is 3.5.33 and was tested
> only on 2.2.16.
> 
> #!/usr/bin/perl
> use Fcntl;
> $count = 0;
> while (1) {
> #sysopen(FH, "/scratch/logfile", O_RDWR|O_APPEND|O_CREAT|O_SYNC)
> sysopen(FH, "/scratch/logfile", O_RDWR|O_APPEND|O_CREAT)
>         or die "Couldn't open file $path : $!\n";
> print FH "Log file line ", $count , " yadda  yadda  yadda  yadda  yadda  yadda
> yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda
> yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda
> yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda
> yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda
> yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda
> yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda
> yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda
> yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda
> yadda  yadda  yadda  yadda  yadda  yadda  yadda  yadda \n" ;
> close (FH);
> #print $count , "\n";
> $count++;
> }
> 
> ------------------------------------------------------
> Philip R. Auld, Ph.D.                  Technical Staff
> Egenera Corp.                        pauld@egenera.com
> 165 Forest St, Marlboro, MA 01752        (508)786-9444
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

I didn't know that there was a way to enable full data journaling using
reiserfs.  I was under the impression that with the latest round of the
unlink patch to go with 2.4.7 that reiserfs was basically in ordered
journaling mode instead of writeback (I believe that is the name), if I
am wrong or if there really is a way to enable full data journaling
please let me know.  Thanks.

Jordan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 14:16     ` Philip R. Auld
  2001-07-27 14:38       ` Jordan
@ 2001-07-27 14:51       ` Hans Reiser
  2001-07-27 15:12         ` Philip R. Auld
  1 sibling, 1 reply; 662+ messages in thread
From: Hans Reiser @ 2001-07-27 14:51 UTC (permalink / raw)
  To: Philip R. Auld; +Cc: Alan Cox, kernel, Chris Mason, Gryaznova E.

"Philip R. Auld" wrote:
 
> reiserfs with full data logging enabled of course does not show this behavior
> (and works really well if you are willing to take the performance hit).

Hmmm, I didn't realize this had made off our wish list and into the code.:)
We should benchmark the cost to performance.

Hans

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
       [not found] ` <no.id>
                     ` (21 preceding siblings ...)
  2001-07-27 14:21   ` Alan Cox
@ 2001-07-27 15:06   ` Alan Cox
  2001-07-27 15:26     ` Joshua Schmidlkofer
                       ` (2 more replies)
  2001-07-27 15:51   ` Alan Cox
                     ` (180 subsequent siblings)
  203 siblings, 3 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27 15:06 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Joshua Schmidlkofer, kernel

> Don't use RedHat with ReiserFS, they screw things up so many ways.....
> For instance, they compile it with the wrong options set, their boot scripts are wrong, they just
> shovel software onto the CD.

Sorry Hans you can rant all you like but you know you are wrong on most
of that. RH did weeks of stress testing on multiple systems up to 8Gb 8 way
and didn't ship until we stopped seeing corruption problems with the mm/fs
code. 

That test suite caught bugs in kernel revisions other vendors shipped
blindly to their customers without fixing.

That is hardly shovelling software onto the CD.

> Actually, I am curious as to exactly how they manage to make ReiserFS boot longer than ext2.  Do
> they run fsck or what?

No. The only thing I can think of that might slow it is that we build with
the reiserfs paranoia/sanity checks on. Thats because at the time 7.1 was
done the kernel list was awash with reiserfs bug reports and Chris Mason
tail recursion bug patch of the week.

That might be something to check to get a fair comparison

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 14:51       ` Hans Reiser
@ 2001-07-27 15:12         ` Philip R. Auld
  0 siblings, 0 replies; 662+ messages in thread
From: Philip R. Auld @ 2001-07-27 15:12 UTC (permalink / raw)
  To: Hans Reiser; +Cc: reiserfs-list, kernel, Chris Mason

Hans Reiser wrote:
> 
> "Philip R. Auld" wrote:
> 
> > reiserfs with full data logging enabled of course does not show this behavior
> > (and works really well if you are willing to take the performance hit).
> 
> Hmmm, I didn't realize this had made off our wish list and into the code.:)
> We should benchmark the cost to performance.
> 
> Hans

Ooops, hope I'm not getting Chris in trouble ;)

This is reiserfs 3.5.33, with a few changes from Chris to enable full logging, 
and from me to make it a mount option. 

We are in a situation where we need the safety more than the speed so it was
necessary.


Here is a simple comparison using bonnie:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
pblade 1 (reiserfs defaults)
         1000 13048 98.9 21609 27.4  6599 10.7 11066 72.3 16483  8.4 1011.2  5.3 
	 1000 12771 96.7 21058 25.9  5536  9.0 10430 67.5 17347  8.4 1065.2  6.7 
	 1000 13034 98.6 19746 21.6  7026 11.6  9884 64.4 14838  7.2 1106.0  9.7
         1000 13091 99.3 19483 28.9  7586 12.3 10520 68.4 14685  6.9  900.9  6.3
pblade 2 (ext2 defaults)
         1000 14373 99.9 14940  8.8  7494 11.1 10093 65.3 22213  9.3 1028.3 
6.4      
	 1000 14305 99.6 16129  9.4  7768 11.9  9629 62.2 26108 10.8 1135.8  7.7    
	 1000 14400 99.9 16769  9.8  7397 11.2  9805 63.4 21820  9.1 1139.8  5.7
	 1000 14361 100. 17089 10.4  7768 11.5  9924 64.1 24154  9.8 1112.9  7.2
pblade 3 (log all data)
	 1000  5932 47.6  7244 12.5  4708  9.7 13909 90.5 17051  8.1 894.5  6.5
	 1000  5839 46.9  7229 12.5  4604  9.9 13437 87.9 19852  9.7 724.3  4.7
	 1000  5853 47.0  7176 12.3  4611  9.8 13995 91.1 18838  8.7 908.0  5.7
	 1000  5604 45.1  7106 12.2  4627  9.5 13628 88.6 15248  6.9 882.9  6.6
pblade 6 ( log new data )
	 1000  5556 49.0  7057 11.9  7714 12.6 11559 92.8 18075  8.8 1264.3  7.3    
	 1000  5631 49.8  7307 12.3  7945 13.0 11558 93.0 18859  9.0 1230.7  8.0
	 1000  5610 49.6  7337 12.5  6620 11.0 11821 95.0 16484  7.5 1236.8  9.3
	 1000  5592 49.4  7070 12.1  7422 12.0 11575 92.9 16198  7.3 1236.6  4.9


I suugest we move this to reiserfs-list for more discussion if needed :)


Cheers,

Phil


------------------------------------------------------
Philip R. Auld, Ph.D.                  Technical Staff 
Egenera Corp.                        pauld@egenera.com
165 Forest St, Marlboro, MA 01752        (508)786-9444

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Hard disk problem:
  2001-07-27 10:00   ` Hard disk problem: Alan Cox
@ 2001-07-27 15:22     ` Steve Underwood
  2001-07-27 19:18       ` Bill Pringlemeir
  0 siblings, 1 reply; 662+ messages in thread
From: Steve Underwood @ 2001-07-27 15:22 UTC (permalink / raw)
  To: linux-kernel

Alan Cox wrote:
> 
> > >  Model=IBM-DTLA-307030, FwRev=TX4OA50C, SerialNo=YKDYKGF1437
> >
> > Ah, one of these excellent Hungarian DTLA drives? :) AFAIK, the entire batch
> > was broken, although there are people who insist that there was no single
> > working hard drive leaving that factory! I personally have seen 7 out of 7
> > failing...
> 
> I have a large collection of these drives and none of them are problematic,
> while the maxtors seem a little less reliable
> 
> > Take it back to where you bought it and demand a replacement for something
> > NOT bearing "MADE IN HUNGARY" sign.
> 
> Of course the writer of this is Polish and the drives are Hungarian ..
> 
But he is right. Practically all the "Made in Hungary" ones develop bad
sectors after a few months. The "Made in Phillipinnes" ones do not.
Strangely, I am Hong Kong and almost all the GXP75s we got here were
made in Hungary - go figure! They were so bad the dealers finally
wouldn't stock them. If your experience has been different, think
yourself lucky.

Regards,
Steve

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 15:06   ` Alan Cox
@ 2001-07-27 15:26     ` Joshua Schmidlkofer
  2001-07-27 15:46       ` Hans Reiser
  2001-07-27 15:31     ` Hans Reiser
  2001-07-27 20:46     ` Lehmann 
  2 siblings, 1 reply; 662+ messages in thread
From: Joshua Schmidlkofer @ 2001-07-27 15:26 UTC (permalink / raw)
  To: linux-kernel

On Friday 27 July 2001 09:06 am, Alan Cox wrote:
> > Don't use RedHat with ReiserFS, they screw things up so many ways.....
> > For instance, they compile it with the wrong options set, their boot
> > scripts are wrong, they just shovel software onto the CD.
>
> Sorry Hans you can rant all you like but you know you are wrong on most
> of that. RH did weeks of stress testing on multiple systems up to 8Gb 8 way
> and didn't ship until we stopped seeing corruption problems with the mm/fs
> code.
>
> That test suite caught bugs in kernel revisions other vendors shipped
> blindly to their customers without fixing.
>
> That is hardly shovelling software onto the CD.
>
> > Actually, I am curious as to exactly how they manage to make ReiserFS
> > boot longer than ext2.  Do they run fsck or what?
>
> No. The only thing I can think of that might slow it is that we build with
> the reiserfs paranoia/sanity checks on. Thats because at the time 7.1 was
> done the kernel list was awash with reiserfs bug reports and Chris Mason
> tail recursion bug patch of the week.
>
> That might be something to check to get a fair comparison

   I feel that things are actually progressing above my level of perception 
here, however, I would like to mention that since my Redhat 4.x days i have 
feared vendor kernels, and I never use them, for better or worse.   

    Also, maybe I screwed my own system - I don't think so, but maybe.  I 
prefer to stick with Linus's kernels, and sometimes, depending on the 
changlog -ac kernels.  As far as the kernel & init scirpts are concerned, I 
axed any fsck'ing entries for reiserfs.   [I assume that they were 
unnessecary.]  I used kgcc [w/Rh7.1] to compile kernels, until recently.  And 
I stayed current with the lkml, and the namesys page watching for obvious 
updates that I needed. 

    The slowness [seemed] actually [to be] the process of starting & stopping 
daemons.  Almost like there was some sort of stigma about reading shell 
scripts.  All the binaries executed with appropriate haste.

   As far as shoveling code.   Sometimes the options used to compile packages 
leaves me with a large bit of wonder.  Strange and seemingly heinous changes 
to the various utilities, etc.   But, I have never had a cause to fault them 
based on this. [Except that I have never found the magic that causes all the 
SRPMS to be [re]buildable.]

  So to sort it, I don't feel that being a moron caused to boot slow - unless 
there is some wierd filehandling problem in bash2, or something that causes 
severe slow-down when sourcing shell scripts.  ????   However, Hans, I do 
beleive you about Suse, and if I wasn't a cheap bastard I would probably buy 
a copy.  

thanks for all the response, and I am sorry if this does not belong here.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 15:06   ` Alan Cox
  2001-07-27 15:26     ` Joshua Schmidlkofer
@ 2001-07-27 15:31     ` Hans Reiser
  2001-07-27 16:25       ` Kip Macy
  2001-07-27 20:46     ` Lehmann 
  2 siblings, 1 reply; 662+ messages in thread
From: Hans Reiser @ 2001-07-27 15:31 UTC (permalink / raw)
  To: Alan Cox; +Cc: Joshua Schmidlkofer, kernel

Alan Cox wrote:
> 
> > Don't use RedHat with ReiserFS, they screw things up so many ways.....
> > For instance, they compile it with the wrong options set, their boot scripts are wrong, they just
> > shovel software onto the CD.
> 
> Sorry Hans you can rant all you like but you know you are wrong on most
> of that. RH did weeks of stress testing on multiple systems up to 8Gb 8 way
> and didn't ship until we stopped seeing corruption problems with the mm/fs
> code.
> 
> That test suite caught bugs in kernel revisions other vendors shipped
> blindly to their customers without fixing.
> 
> That is hardly shovelling software onto the CD.
> 
> > Actually, I am curious as to exactly how they manage to make ReiserFS boot longer than ext2.  Do
> > they run fsck or what?
> 
> No. The only thing I can think of that might slow it is that we build with
> the reiserfs paranoia/sanity checks on. Thats because at the time 7.1 was

Yes, that option should never be on for an end user not having a bug that he wants a more detailed
bug report on.  It just makes us look slow compared to ext2.

2.4.2 was not a stable kernel for any FS, not just for ReiserFS.

2.4.4 was the earliest kernel that should have been called 2.4.0, and sad to say, I bet we won't hit
a really stable kernel for another couple of versions.

I understand the marketing pressure on distributions to ship using 2.4.x as soon as 2.4.0 was
available, and that pressure should never have been generated upon them by making an unstable kernel
be named 2.4.0.

It won't surpise me if you agree with me on the kernel naming though, and if so it is pointless for
me to complain to you about it. 

> done the kernel list was awash with reiserfs bug reports and Chris Mason
> tail recursion bug patch of the week.
> 
> That might be something to check to get a fair comparison
> 
> Alan

I don't think that even with CONFIG_REISERFS_CHECK on, journal replay can take as long as fsck on
ext2.  reiserfsck though, if that was on, oh, could even RedHat be that desperate to make us look
bad to users as to run reiserfsck at every boot?

I surely hope not, and I'd like to hear that this user just had something individually wrong with
his configuration.

Hans

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 15:26     ` Joshua Schmidlkofer
@ 2001-07-27 15:46       ` Hans Reiser
  2001-07-27 17:46         ` Christoph Rohland
                           ` (2 more replies)
  0 siblings, 3 replies; 662+ messages in thread
From: Hans Reiser @ 2001-07-27 15:46 UTC (permalink / raw)
  To: Joshua Schmidlkofer; +Cc: linux-kernel

Well, I am afraid this is much too vague for me to have any understanding of what went wrong on your
system.

Maybe somebody else who is using both ReiserFS and RedHat's boot scripts can comment on whether
things are slow for them and if so, where they get slow.

With this lack of specificity is entirely possible that things went slow for coincidental reasons
unrelated to ReiserFS (waiting for network stuff to timeout, etc.)

Hans

Joshua Schmidlkofer wrote:
> 
> On Friday 27 July 2001 09:06 am, Alan Cox wrote:
> > > Don't use RedHat with ReiserFS, they screw things up so many ways.....
> > > For instance, they compile it with the wrong options set, their boot
> > > scripts are wrong, they just shovel software onto the CD.
> >
> > Sorry Hans you can rant all you like but you know you are wrong on most
> > of that. RH did weeks of stress testing on multiple systems up to 8Gb 8 way
> > and didn't ship until we stopped seeing corruption problems with the mm/fs
> > code.
> >
> > That test suite caught bugs in kernel revisions other vendors shipped
> > blindly to their customers without fixing.
> >
> > That is hardly shovelling software onto the CD.
> >
> > > Actually, I am curious as to exactly how they manage to make ReiserFS
> > > boot longer than ext2.  Do they run fsck or what?
> >
> > No. The only thing I can think of that might slow it is that we build with
> > the reiserfs paranoia/sanity checks on. Thats because at the time 7.1 was
> > done the kernel list was awash with reiserfs bug reports and Chris Mason
> > tail recursion bug patch of the week.
> >
> > That might be something to check to get a fair comparison
> 
>    I feel that things are actually progressing above my level of perception
> here, however, I would like to mention that since my Redhat 4.x days i have
> feared vendor kernels, and I never use them, for better or worse.
> 
>     Also, maybe I screwed my own system - I don't think so, but maybe.  I
> prefer to stick with Linus's kernels, and sometimes, depending on the
> changlog -ac kernels.  As far as the kernel & init scirpts are concerned, I
> axed any fsck'ing entries for reiserfs.   [I assume that they were
> unnessecary.]  I used kgcc [w/Rh7.1] to compile kernels, until recently.  And
> I stayed current with the lkml, and the namesys page watching for obvious
> updates that I needed.
> 
>     The slowness [seemed] actually [to be] the process of starting & stopping
> daemons.  Almost like there was some sort of stigma about reading shell
> scripts.  All the binaries executed with appropriate haste.
> 
>    As far as shoveling code.   Sometimes the options used to compile packages
> leaves me with a large bit of wonder.  Strange and seemingly heinous changes
> to the various utilities, etc.   But, I have never had a cause to fault them
> based on this. [Except that I have never found the magic that causes all the
> SRPMS to be [re]buildable.]
> 
>   So to sort it, I don't feel that being a moron caused to boot slow - unless
> there is some wierd filehandling problem in bash2, or something that causes
> severe slow-down when sourcing shell scripts.  ????   However, Hans, I do
> beleive you about Suse, and if I wasn't a cheap bastard I would probably buy
> a copy.
> 
> thanks for all the response, and I am sorry if this does not belong here.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
       [not found] ` <no.id>
                     ` (22 preceding siblings ...)
  2001-07-27 15:06   ` Alan Cox
@ 2001-07-27 15:51   ` Alan Cox
  2001-07-27 16:41     ` Hans Reiser
  2001-07-27 16:50   ` ext3-2.4-0.9.4 Alan Cox
                     ` (179 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27 15:51 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Alan Cox, Joshua Schmidlkofer, kernel

> > No. The only thing I can think of that might slow it is that we build with
> > the reiserfs paranoia/sanity checks on. Thats because at the time 7.1 was
> 
> Yes, that option should never be on for an end user not having a bug that he wants a more detailed
> bug report on.  It just makes us look slow compared to ext2.

Maybe its old fashioned but we'd rather any inconsistency in the file system
behaviour was made obvious to the end user. Enterprise customers object to
losing data.

> 2.4.2 was not a stable kernel for any FS, not just for ReiserFS.

The RH 2.4.2 derived kernel isnt 2.4.2 by any stretch of the imagination. 
Vanilla 2.4.2 wouldnt pass a test suite.

> I don't think that even with CONFIG_REISERFS_CHECK on, journal replay can take as long as fsck on
> ext2.  reiserfsck though, if that was on, oh, could even RedHat be that desperate to make us look
> bad to users as to run reiserfsck at every boot?

Hans, if you stopped considering every report that your file system wasn't
the best in the world as either a conspiracy theory or someone elses fault
you'd have a much better product

Nobody needs conspiracies to not use reiserfs as their core fs, and until
things like big endian support are cleanly resolved that isnt likely to
change.

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 17:15                           ` ext3-2.4-0.9.4 Andre Pang
  2001-07-26 17:58                             ` ext3-2.4-0.9.4 Hans Reiser
  2001-07-27  4:28                             ` ext3-2.4-0.9.4 Andrew Morton
@ 2001-07-27 16:24                             ` Lawrence Greenfield
  2001-07-27 16:57                               ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-27 17:16                               ` ext3-2.4-0.9.4 Bill Rugolsky Jr.
  2 siblings, 2 replies; 662+ messages in thread
From: Lawrence Greenfield @ 2001-07-27 16:24 UTC (permalink / raw)
  To: linux-kernel

Hi,

I'm one of those icky application programmers attempting to make
reliable software across different versions of Unix.

We need to get data to disk portably, quickly, and reliably.

I love it when I see things like:  "No, Linus is right and the MTA
guys are just wrong."

This sort of attitude is just ridiculous.  Unix had a defined set of
semantics.  This might have been stupid semantics, but it had them.
Then journalling filesystems, softupdates, and Linux async updates
came along and destroyed those semantics, preventing those of us who
want to write reliable applications using the filesystem from doing
so.  At least Oracle doesn't change the definition of COMMIT.

When I contacted the Linux JFS team about the semantics of link(), I
was told that there is _no way_ of forcing a link() to disk.  Not an
fsync() on the file, not an fsync() on the directory, just _not
possible_.

Great.

Then we come to ext2.  "Oh, just call fsync() on the directory and
you'll be fine."  Well, wait, a second, if ext2 isn't ordering the
metadata writes, a crash at the wrong time (whether or not I've called
fsync()) may lose directory entries---even directory entries unrelated
to the files I'm doing operations on!  Greeeeat.

Thus why all reasonably paranoid MTAs and other mail programs say "use
chattr +S on ext2"---we need ordered metadata writes.

Ok, journalled filesystems are better.  At least crashes aren't going
to affect random files on disk.  But since link() and the like don't
force a commit, we need some way---some reasonably portable way---of
getting that on disk.  On softupdates, calling fsync() on a file
forces all directory entries pointing to that file to disk.  This is
pretty reasonable.  1 fsync() call.

Why do we all cringe when we're told to call fsync() on the directory?
Several reasons:
. not needed on any other variety of Unix
. two fsync() calls force two different syncronization points: the
  application is forcing ordering on the OS that may not be needed.
  (Thus performance doesn't "fly" when you need multiple fsyncs.)
. directory may have other modifications going on that we're not
  interested in

You want to help performance?  Give us an fsync() that works on
multiple file descriptors at once, or an async fsync() call.  Don't
make us fight the OS on getting data to disk.

Larry


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 15:31     ` Hans Reiser
@ 2001-07-27 16:25       ` Kip Macy
  2001-07-27 17:29         ` Ville Herva
  0 siblings, 1 reply; 662+ messages in thread
From: Kip Macy @ 2001-07-27 16:25 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Alan Cox, Joshua Schmidlkofer, kernel





> Alan Cox wrote:
> > 
> > > Don't use RedHat with ReiserFS, they screw things up so many ways.....
> > > For instance, they compile it with the wrong options set, their boot scripts are wrong, they just
> > > shovel software onto the CD.
> > 
> > Sorry Hans you can rant all you like but you know you are wrong on most
> > of that. RH did weeks of stress testing on multiple systems up to 8Gb 8 way
> > and didn't ship until we stopped seeing corruption problems with the mm/fs
> > code.

Sorry Alan, but even though I am sure Redhat did lots of stress testing,
Redhat 7.1 was not a particularly solid release. I got oopses in the
eepro100 driver even though lots of other people use it, and the netapp
simulator which works just fine on 2.2.16 does not work on it. When I ran
strace on the simulator while it was zeroing some files it turned out that
sys_write was failing with ENOMEM (on a machine with 1GB of RAM that was 
not doing anything else).


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 15:51   ` Alan Cox
@ 2001-07-27 16:41     ` Hans Reiser
  0 siblings, 0 replies; 662+ messages in thread
From: Hans Reiser @ 2001-07-27 16:41 UTC (permalink / raw)
  To: Alan Cox; +Cc: Joshua Schmidlkofer, kernel, Nikita Danilov, Jeff Mahoney

Alan Cox wrote:

> Nobody needs conspiracies to not use reiserfs as their core fs, and until
> things like big endian support are cleanly resolved that isnt likely to
> change.
> 
> Alan
big endian support is resolved, there is a working patch for it by Jeff Mahoney, it passes all of
our tests, but it is a feature not a bug fix, and not something for a supposedly stabilizing kernel.

Nikita, you were supposed to send the big endian support and some other stuff in to Alan for testing
in the ac series, what is the status of patches that are supposed to be going to Alan?

Hans

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
       [not found] ` <no.id>
                     ` (23 preceding siblings ...)
  2001-07-27 15:51   ` Alan Cox
@ 2001-07-27 16:50   ` Alan Cox
  2001-07-27 17:41     ` ext3-2.4-0.9.4 Lawrence Greenfield
  2001-07-28 16:46     ` ext3-2.4-0.9.4 Patrick J. LoPresti
  2001-07-27 16:55   ` ReiserFS / 2.4.6 / Data Corruption Alan Cox
                     ` (178 subsequent siblings)
  203 siblings, 2 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27 16:50 UTC (permalink / raw)
  To: Lawrence Greenfield; +Cc: linux-kernel

> This sort of attitude is just ridiculous.  Unix had a defined set of
> semantics.  This might have been stupid semantics, but it had them.

The unix defined semantics are very simple and very clear. They btw
dont contain the guarantees that certain email system authors think they do
and they never have.

rename() itself is new as of 4BSD, rather than ever being in true unix.
True unix did the right thing. It said 'this problem is hard, this problem
is application specific, do it at application level'.

> When I contacted the Linux JFS team about the semantics of link(), I
> was told that there is _no way_ of forcing a link() to disk.  Not an
> fsync() on the file, not an fsync() on the directory, just _not
> possible_.

I would expect an fsync of the directory to do that. It does on other
Linux file systems so it violates the least suprise bit. Right now JFS
isnt a standard file system on Linux however, and they have much left to do.
I suspect its something to ask them about.

> Thus why all reasonably paranoid MTAs and other mail programs say "use
> chattr +S on ext2"---we need ordered metadata writes.

And then your IDE disk gets you anyway. Also if you write metadata first 
then you risk delivering email to the wrong person instead. 

> You want to help performance?  Give us an fsync() that works on
> multiple file descriptors at once, or an async fsync() call.  Don't
> make us fight the OS on getting data to disk.

And what pray does an asynchronous fsync do. It seems to be a nop to me.

Doing reliabile transactions on disk is a hard problem. That is why oracle
and friends have spent many man years of research on this kind of problem. 
Current unix mailers do the smoke mirrors and prayer bit to reduce the
probability a little that is all, regardless of fs and os.

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
       [not found] ` <no.id>
                     ` (24 preceding siblings ...)
  2001-07-27 16:50   ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-27 16:55   ` Alan Cox
  2001-07-27 17:45   ` ext3-2.4-0.9.4 Alan Cox
                     ` (177 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27 16:55 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Alan Cox, Joshua Schmidlkofer, kernel, Nikita Danilov, Jeff Mahoney

> > Alan
> big endian support is resolved, there is a working patch for it by Jeff Mahoney, it passes all of
> our tests, but it is a feature not a bug fix, and not something for a supposedly stabilizing kernel.
> 
> Nikita, you were supposed to send the big endian support and some other stuff in to Alan for testing
> in the ac series, what is the status of patches that are supposed to be going to Alan?

I suspect its a bug fix to S/390, ppc and sparc users 8)

I'd be happy to test run it in -ac


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-27 16:24                             ` ext3-2.4-0.9.4 Lawrence Greenfield
@ 2001-07-27 16:57                               ` Rik van Riel
  2001-07-28 23:15                                 ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-27 17:16                               ` ext3-2.4-0.9.4 Bill Rugolsky Jr.
  1 sibling, 1 reply; 662+ messages in thread
From: Rik van Riel @ 2001-07-27 16:57 UTC (permalink / raw)
  To: Lawrence Greenfield; +Cc: linux-kernel

On Fri, 27 Jul 2001, Lawrence Greenfield wrote:

> I'm one of those icky application programmers attempting to make
> reliable software across different versions of Unix.
>
> We need to get data to disk portably, quickly, and reliably.
>
> I love it when I see things like:  "No, Linus is right and the MTA
> guys are just wrong."
>
> This sort of attitude is just ridiculous.  Unix had a defined set of
> semantics.  This might have been stupid semantics, but it had them.

The stuff you people seem to insist on, however, most
definately isn't part of the defined set of semantics.

If you believe otherwise, feel free to point out the
relevant sections in POSIX / SuS / ...

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-27 16:24                             ` ext3-2.4-0.9.4 Lawrence Greenfield
  2001-07-27 16:57                               ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-27 17:16                               ` Bill Rugolsky Jr.
  1 sibling, 0 replies; 662+ messages in thread
From: Bill Rugolsky Jr. @ 2001-07-27 17:16 UTC (permalink / raw)
  To: Lawrence Greenfield; +Cc: linux-kernel

On Fri, Jul 27, 2001 at 12:24:56PM -0400, Lawrence Greenfield wrote:
> I love it when I see things like:  "No, Linus is right and the MTA
> guys are just wrong."
> 
> This sort of attitude is just ridiculous.  Unix had a defined set of
> semantics.  This might have been stupid semantics, but it had them.
> Then journalling filesystems, softupdates, and Linux async updates
> came along and destroyed those semantics, preventing those of us who
> want to write reliable applications using the filesystem from doing
> so.  At least Oracle doesn't change the definition of COMMIT.

First off, would you care to quote chapter and verse of these
"defined semantics" ?   Do you mean the BSD source?

Traditional FFS/UFS achieves "safety" at a terrible cost to
performance.  I can barely stand the wait to untar XFree86 on Solaris8
on a PII-333, even with UFS logging -- I'd rather use my Pentium 166
laptop running Linux!  ext2 solved this performance issue many years
ago by recognizing that the FFS metadata scheme was not really safe
either; instead the intelligence was put into e2fsck, and where
necessary, the applications.  (Do I hear faint echoes of the
"lint" v. "cc" design criterion ... ?)

The infrastructure is now in place to solve these problems in ext3,
without imposing a least-common-denominator approach that degrades
overall system performance.  In these instances "Linus is right" when
he notes that (1) the proposed immediate solution does not really solve
the problem, and (2) once in there, developers will rely on its precise
semantics, making them difficult to get right later on, and providing
no incentive to do so.  In many such instances "undefined" behavior is
the best intermediate solution.

As one can see from the "gkernel-commit" traffic, Andrew Morton has
not only taken away useful information from this thread, he's already
halfway to a solution, in just a day, because  Matthias Andree took
the time to describe the functional requirements instead of just
whining that "it's not like BSD."
 
> Thus why all reasonably paranoid MTAs and other mail programs say "use
> chattr +S on ext2"---we need ordered metadata writes.

And that's precisely the type of thing we want -- unused features should
not impact the rest of the system.
 
Regards,

   Bill Rugolsky


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 16:25       ` Kip Macy
@ 2001-07-27 17:29         ` Ville Herva
  2001-07-27 17:40           ` Alan Cox
  0 siblings, 1 reply; 662+ messages in thread
From: Ville Herva @ 2001-07-27 17:29 UTC (permalink / raw)
  To: Kip Macy; +Cc: Alan Cox, kernel

On Fri, Jul 27, 2001 at 09:25:23AM -0700, you [Kip Macy] claimed:
>  
> sys_write was failing with ENOMEM (on a machine with 1GB of RAM that was 
> not doing anything else).

I second that.

256M memory, no swap at the time.

After fresh boot to the default RH71 kernel (2.4.2-2 or whatever it is) on
console (no X running):

> diff -Naur /usr/src/linux.rh-default /usr/src/linux-2.4.4 > diff
zsh: killed diff

> dmesg | tail
kernel: out of memory, killed process n (xfs)
kernel: out of memory, killed process n (diff)

Phew.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 17:29         ` Ville Herva
@ 2001-07-27 17:40           ` Alan Cox
  2001-07-27 17:43             ` Ville Herva
  0 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27 17:40 UTC (permalink / raw)
  To: Ville Herva; +Cc: Kip Macy, Alan Cox, kernel

> After fresh boot to the default RH71 kernel (2.4.2-2 or whatever it is) on
> console (no X running):
> 
> > diff -Naur /usr/src/linux.rh-default /usr/src/linux-2.4.4 > diff
> zsh: killed diff
> 
> > dmesg | tail
> kernel: out of memory, killed process n (xfs)
> kernel: out of memory, killed process n (diff)
> 
> Phew.

No argument on that one. I'm still seeing it in vanilla 2.4.6 as well but
2.4.7 is looking a lot better. 

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-27 16:50   ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-27 17:41     ` Lawrence Greenfield
  2001-07-27 21:16       ` ext3-2.4-0.9.4 Daniel Phillips
  2001-07-28 16:46     ` ext3-2.4-0.9.4 Patrick J. LoPresti
  1 sibling, 1 reply; 662+ messages in thread
From: Lawrence Greenfield @ 2001-07-27 17:41 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

   Date: Fri, 27 Jul 2001 17:50:29 +0100 (BST)
   Cc: linux-kernel@vger.kernel.org
   From: Alan Cox <alan@lxorguk.ukuu.org.uk>

[...]
   > Thus why all reasonably paranoid MTAs and other mail programs say "use
   > chattr +S on ext2"---we need ordered metadata writes.

   And then your IDE disk gets you anyway. Also if you write metadata first 
   then you risk delivering email to the wrong person instead. 

These are tangential issues.  Not everybody uses IDE disks.  I'm not
asking for things that are impossible.  Just because sometimes the
hardware screws you isn't a good reason for not trying to do the right
thing.

The application can avoid the wrong file problem by zeroing out data
before releasing it to the OS to reallocate.

   > You want to help performance?  Give us an fsync() that works on
   > multiple file descriptors at once, or an async fsync() call.  Don't
   > make us fight the OS on getting data to disk.

   And what pray does an asynchronous fsync do. It seems to be a nop to me.

An async fsync allows me to issue multiple fsyncs and then wait for
all of them to complete, hopefully in the same framework that I would
do async I/O (but that's an argument for another day).

   Doing reliabile transactions on disk is a hard problem. That is why oracle
   and friends have spent many man years of research on this kind of problem. 
   Current unix mailers do the smoke mirrors and prayer bit to reduce the
   probability a little that is all, regardless of fs and os.

Isn't the point of the operating system to try to make it as easy as
possible to do these things correctly?

Otherwise you force anyone who wants to write a reliable application
(be it e-mail or not) to go to Oracle and one wonders why fsync() is
even implemented.

Larry


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 17:40           ` Alan Cox
@ 2001-07-27 17:43             ` Ville Herva
  0 siblings, 0 replies; 662+ messages in thread
From: Ville Herva @ 2001-07-27 17:43 UTC (permalink / raw)
  To: Alan Cox; +Cc: Kip Macy, kernel

On Fri, Jul 27, 2001 at 06:40:32PM +0100, you [Alan Cox] claimed:
> > After fresh boot to the default RH71 kernel (2.4.2-2 or whatever it is) on
> > console (no X running):
> > 
> > > diff -Naur /usr/src/linux.rh-default /usr/src/linux-2.4.4 > diff
> > zsh: killed diff
> > 
> > > dmesg | tail
> > kernel: out of memory, killed process n (xfs)
> > kernel: out of memory, killed process n (diff)
> > 
> > Phew.
> 
> No argument on that one. I'm still seeing it in vanilla 2.4.6 as well but
> 2.4.7 is looking a lot better. 

I wasn't able to easily reproduce that on 2.4.4ac5 (that I upgraded into
almost immediately). It may be that the OOM rambo wasn't fully sane on that
one either, but at least it seemed to handle the silly "someone filled the
cache - gee, we must be oom" case rather better...


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
       [not found] ` <no.id>
                     ` (25 preceding siblings ...)
  2001-07-27 16:55   ` ReiserFS / 2.4.6 / Data Corruption Alan Cox
@ 2001-07-27 17:45   ` Alan Cox
  2001-07-27 17:52   ` ext3-2.4-0.9.4 Alan Cox
                     ` (176 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27 17:45 UTC (permalink / raw)
  To: Lawrence Greenfield; +Cc: Alan Cox, linux-kernel

> 
> "Paul G. Allen" <pgallen@randomlogic.com> writes:
>  
> > Do the newer kernel releases support the 760 MP chipset? Will they
> > anytime soon? (If not I will see if I can put it in myself.)
> 
> There is better support in 2.4.7 (especially IDE) but there is not complete
> support.  
> 
> I don't know of anyone planning on finishing up any the pieces so feel free.
> 
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 15:46       ` Hans Reiser
@ 2001-07-27 17:46         ` Christoph Rohland
  2001-07-27 18:02           ` Hans Reiser
  2001-07-27 18:10         ` Dustin Byford
  2001-07-28 16:10         ` Henning P. Schmiedehausen
  2 siblings, 1 reply; 662+ messages in thread
From: Christoph Rohland @ 2001-07-27 17:46 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Joshua Schmidlkofer, linux-kernel

Hi Hans,

On Fri, 27 Jul 2001, Hans Reiser wrote:
> Maybe somebody else who is using both ReiserFS and RedHat's boot
> scripts can comment on whether things are slow for them and if so,
> where they get slow.

At least not if it's not the root disk. I have a RH71 box with a 19GB
reiserfs partition and it's booting fast and fine.

Greetings
		Christoph



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
       [not found] ` <no.id>
                     ` (26 preceding siblings ...)
  2001-07-27 17:45   ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-27 17:52   ` Alan Cox
  2001-07-27 19:31   ` Linux 2.4.7-ac1 PNP Oops on shutdown Alan Cox
                     ` (175 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27 17:52 UTC (permalink / raw)
  To: Lawrence Greenfield; +Cc: Alan Cox, linux-kernel

> These are tangential issues.  Not everybody uses IDE disks.  I'm not
> asking for things that are impossible.  Just because sometimes the

Actually if I remember rightly the problem is mathematically insoluble

> The application can avoid the wrong file problem by zeroing out data
> before releasing it to the OS to reallocate.

When you zero out the data what order do you want those writes in relative
to the rename

> An async fsync allows me to issue multiple fsyncs and then wait for
> all of them to complete, hopefully in the same framework that I would
> do async I/O (but that's an argument for another day).

Ok.. right that makes more sense. So you actually want 'begin_fsync' and
'wait_fsync_all' type stuff

>    Doing reliabile transactions on disk is a hard problem. That is why oracle
>    and friends have spent many man years of research on this kind of problem. 
>    Current unix mailers do the smoke mirrors and prayer bit to reduce the
>    probability a little that is all, regardless of fs and os.
> 
> Isn't the point of the operating system to try to make it as easy as
> possible to do these things correctly?

The OS doesnt have enough information. To do transactions you must know the
entire material that corresponds to the transaction and bound it. That isnt
something the kernel has the knowledge about.

The job of the OS is to make the simple things easy, and the hard possible.
Not to burden the simple with the cost of the hard. That why the chattr +S
is such a nice solution in many ways

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 17:46         ` Christoph Rohland
@ 2001-07-27 18:02           ` Hans Reiser
  0 siblings, 0 replies; 662+ messages in thread
From: Hans Reiser @ 2001-07-27 18:02 UTC (permalink / raw)
  To: Christoph Rohland; +Cc: Joshua Schmidlkofer, linux-kernel

Christoph Rohland wrote:
> 
> Hi Hans,
> 
> On Fri, 27 Jul 2001, Hans Reiser wrote:
> > Maybe somebody else who is using both ReiserFS and RedHat's boot
> > scripts can comment on whether things are slow for them and if so,
> > where they get slow.
> 
> At least not if it's not the root disk. I have a RH71 box with a 19GB
> reiserfs partition and it's booting fast and fine.
> 
> Greetings
>                 Christoph
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Ok, well then I conclude that it was a user misdiagnosis as to what his booting problem was of some
unknowable form.

Apologies to RedHat.

Hans

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 15:46       ` Hans Reiser
  2001-07-27 17:46         ` Christoph Rohland
@ 2001-07-27 18:10         ` Dustin Byford
  2001-07-27 19:20           ` Hans Reiser
  2001-07-28 16:10         ` Henning P. Schmiedehausen
  2 siblings, 1 reply; 662+ messages in thread
From: Dustin Byford @ 2001-07-27 18:10 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Joshua Schmidlkofer, linux-kernel

Hans Reiser wrote:

> Maybe somebody else who is using both ReiserFS and RedHat's boot scripts can comment on whether
> things are slow for them and if so, where they get slow.


For what it's worth I just configured a RedHat 7.1 box with reiserfs on 
all partitions except /boot using this update disk 
ftp://139.82.28.40/pub/update-rh71reiser-v1.img from 
http://cambuca.ldhs.cetuc.puc-rio.br/.

Upgraded all of redhat's packages, note there is a SysVinit update and a 
gcc update.

Compiled a 2.4.7-pre kernel and the latest reiserfsprogs.

Mounted /boot read only to eliminate the chance of an fsck required on 
that partition.

I have been running reiserfs on a mail server with about 60k accounts 
(30k really active) for about 6 months. I haven't experienced any 
problems with the filesystems. The one I just configured is its intended 
replacment. After a few days of testing with bonnie, some perl scripts I 
wrote, and a few pullings of the power cord I think it's almost ready 
for production. An upgrade to 2.4.7 and some more testing will tell.

If you pull the plug on this machine it takes around 40 seconds to get 
back to the login prompt, (p3-600 60G ide drive). Including the act of 
pulling the power cord, bios delays, lilo delays, and the rest of the 
redhat boot sequence.

So, in my experience I've had very few problems with reiserfs and 
redhat. That said, the slightest hint of data corruption under normal 
(continuous power, no failing hardware) operation and I'll probably be 
evaluating other filesystems. There are sometimes as many as 500,000 
files on this filesystem, reiserfs seems to do a good job under my 
conditions.

				--Dustin

Also, one purely cosmetic patch to rc.sysinit if you want:
--- rc.sysinit.orig Fri Jul 27 13:06:58 2001
+++ rc.sysinit Fri Jul 27 13:38:25 2001
@@ -211,7 +211,8 @@

_RUN_QUOTACHECK=0
ROOTFSTYPE=`grep " / " /proc/mounts | awk '{ print $3 }'`
-if [ -z "$fastboot" -a "$ROOTFSTYPE" != "nfs" ]; then
+if [ -z "$fastboot" -a "$ROOTFSTYPE" != "nfs" \
+ -a "$ROOTFSTYPE" != "reiserfs" ]; then

STRING=$"Checking root filesystem"
echo $STRING


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Hard disk problem:
  2001-07-27 15:22     ` Steve Underwood
@ 2001-07-27 19:18       ` Bill Pringlemeir
  0 siblings, 0 replies; 662+ messages in thread
From: Bill Pringlemeir @ 2001-07-27 19:18 UTC (permalink / raw)
  To: linux-kernel

>>>>> "Steve" == Steve Underwood <steveu@coppice.org> writes:

 Steve> But he is right. Practically all the "Made in Hungary" ones
 Steve> develop bad sectors after a few months. The "Made in
 Steve> Phillipinnes" ones do not.  Strangely, I am Hong Kong and
 Steve> almost all the GXP75s we got here were made in Hungary - go
 Steve> figure! They were so bad the dealers finally wouldn't stock
 Steve> them. If your experience has been different, think yourself
 Steve> lucky.

I have an IBM drive made in Hungary.  It get `fiery hot'!  I kept
moving it until I had it in a place with good thermal contact to the
case.  Then these drive errors went away.  At first I had a Linux
install on that drive.  Later it crashed, I fixed it, deleted the OS
on my HDA that I was no longer using and moved Linux there.  Now I
only keep MP3s, tmp, and swap (a 2nd one) on the IBM drive.

Sometimes when I close the case, I still get errors.  So it may be a
case of overheating.  You could try to change the position of the drive
to see if it fixes things.  Mine was made in Hungary.  And in case (ha ha)
I am talking crap,

[bpringle@localhost bpringle]$ dmesg | grep ^hdd:
hdd: IBM-DTTA-351010, ATA DISK drive
hdd: 19807200 sectors (10141 MB) w/466KiB Cache, CHS=19650/16/63, UDMA(33)

 Model=IBM-DTTA-351010, FwRev=T56OA73A, SerialNo=WF0WFFD7387
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=34
 BuffType=3(DualPortCache), BuffSize=466kB, MaxMultSect=16, MultSect=off
 DblWordIO=no, maxPIO=2(fast), DMA=yes, maxDMA=2(fast)
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=19807200
 WARNING 3293136 ORPHANED SECTORS :: KERNEL REPORTING ERROR
 tDMA={min:120,rec:120}, DMA modes: sword0 sword1 sword2 mword0 mword1 mword2
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, PIO modes: mode3 mode4
 UDMA modes: mode0 mode1 *mode2
 Drive Supports : ATA/ATAPI-4 T13 1153D revision 17 : ATA-1 ATA-2 ATA-3 ATA-4

fwiw,
Bill Pringlemeir.












^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 18:10         ` Dustin Byford
@ 2001-07-27 19:20           ` Hans Reiser
  0 siblings, 0 replies; 662+ messages in thread
From: Hans Reiser @ 2001-07-27 19:20 UTC (permalink / raw)
  To: Dustin Byford; +Cc: Joshua Schmidlkofer, linux-kernel, Edward Shushkin

Dustin Byford wrote:

> Also, one purely cosmetic patch to rc.sysinit if you want:
> --- rc.sysinit.orig Fri Jul 27 13:06:58 2001
> +++ rc.sysinit Fri Jul 27 13:38:25 2001
> @@ -211,7 +211,8 @@
> 
> _RUN_QUOTACHECK=0
> ROOTFSTYPE=`grep " / " /proc/mounts | awk '{ print $3 }'`
> -if [ -z "$fastboot" -a "$ROOTFSTYPE" != "nfs" ]; then
> +if [ -z "$fastboot" -a "$ROOTFSTYPE" != "nfs" \
> + -a "$ROOTFSTYPE" != "reiserfs" ]; then
> 
> STRING=$"Checking root filesystem"
> echo $STRING


Yes, this patch is much needed.  Edward, put it on our website in an appropriate location.

Hans

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Linux 2.4.7-ac1 PNP Oops on shutdown
       [not found] ` <no.id>
                     ` (27 preceding siblings ...)
  2001-07-27 17:52   ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-27 19:31   ` Alan Cox
  2001-07-27 20:19   ` VIA KT133A / athlon / MMX Alan Cox
                     ` (174 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27 19:31 UTC (permalink / raw)
  To: Udo A. Steinberg; +Cc: Alan Cox, Linux Kernel

> 2.4.7-ac1 oopses reproduceably during every shutdown. As far as I can tell,
> 2.4.6-ac5 didn't exhibit this behaviour.

>From the trace that looks what I would expect

> >>EIP; c0112b5d <complete+1d/a0>   <=====
> Trace; c011792d <complete_and_exit+d/20>
> Trace; c01dde51 <pnp_dock_thread+d1/e0>
> Trace; c01054c8 <kernel_thread+28/40>
> Code;  c0112b5d <complete+1d/a0>
> 00000000 <_EIP>:
> Code;  c0112b5d <complete+1d/a0>   <=====
>    0:   8b 03                     mov    (%ebx),%eax   <=====

Its oopsing in the complete_and_exit changes killing the PnP docking thread.

A quick look over the code and I have to admit I don't see why that happened
I'll ponder it later


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
       [not found] ` <no.id>
                     ` (28 preceding siblings ...)
  2001-07-27 19:31   ` Linux 2.4.7-ac1 PNP Oops on shutdown Alan Cox
@ 2001-07-27 20:19   ` Alan Cox
  2001-07-27 20:37     ` Chris Wedgwood
  2001-07-27 21:24   ` ReiserFS / 2.4.6 / Data Corruption Alan Cox
                     ` (173 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27 20:19 UTC (permalink / raw)
  To: PEIFFER Pierre; +Cc: linux-kernel

> have not found clear answer on the different threads about this topic.
> As I understand, this problem does not exist on every athlon but only on
> some which work with the VIA KT133 chipset ? Right ?

Its heavily tied to certain motherboards. Some people found a better PSU
fixed it, others that altering memory settings helped. And in many cases,
taking it back and buying a different vendors board worked.

> 	Anyway, feel free to ask me more information if needed and please,
> CC'ed me personally the answers/comments because I'm not subscribed to
> the LKML.

http://www.linuxhardware.org/article.pl?sid=01/06/06/1821202&mode=thread

gives a good feel for the current state of play

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-27 20:19   ` VIA KT133A / athlon / MMX Alan Cox
@ 2001-07-27 20:37     ` Chris Wedgwood
  2001-07-27 20:40       ` Alan Cox
  2001-07-28 17:29       ` PEIFFER Pierre
  0 siblings, 2 replies; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-27 20:37 UTC (permalink / raw)
  To: Alan Cox; +Cc: PEIFFER Pierre, linux-kernel

On Fri, Jul 27, 2001 at 09:19:21PM +0100, Alan Cox wrote:

    Its heavily tied to certain motherboards. Some people found a
    better PSU fixed it, others that altering memory settings
    helped. And in many cases, taking it back and buying a different
    vendors board worked.

Does anyone know *why* stuff breaks? surely VIA do as they have a fix
for (some, all?) cases of breakage?

My guess is its some kind of timing or near-miss on a signal edge, and
the bios changes relax things so you don't miss whatever it was you
missed before.



  --cw

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Strange remount behaviour with ext3-2.4-0.9.4
  2001-07-27  9:32 ` Strange remount behaviour with ext3-2.4-0.9.4 Sean Hunter
  2001-07-27 10:24   ` Andrew Morton
@ 2001-07-27 20:39   ` Michal Jaegermann
  2001-07-27 20:46     ` Alan Cox
  1 sibling, 1 reply; 662+ messages in thread
From: Michal Jaegermann @ 2001-07-27 20:39 UTC (permalink / raw)
  To: linux-kernel

On Fri, Jul 27, 2001 at 10:32:21AM +0100, Sean Hunter wrote:
> Following the announcement on lkml, I have started using ext3 on one of my
> servers.  Since the server in question is a farily security-sensitive box, my
> /usr partition is mounted read only except when I remount rw to install
> packages.

Regardless of possible weirdness in a "smart" behaviour of 'mount' what
one exactly buys running a journaling file system on a _read only_
partition?  fsck times will be the same (unless you crashed when
installing new software :-).

  Michal

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-27 20:37     ` Chris Wedgwood
@ 2001-07-27 20:40       ` Alan Cox
       [not found]         ` <3B61E5BC.5780E1E@randomlogic.com>
                           ` (3 more replies)
  2001-07-28 17:29       ` PEIFFER Pierre
  1 sibling, 4 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-27 20:40 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Alan Cox, PEIFFER Pierre, linux-kernel

> On Fri, Jul 27, 2001 at 09:19:21PM +0100, Alan Cox wrote:
>     Its heavily tied to certain motherboards. Some people found a
>     better PSU fixed it, others that altering memory settings
>     helped. And in many cases, taking it back and buying a different
>     vendors board worked.
> 
> Does anyone know *why* stuff breaks? surely VIA do as they have a fix
> for (some, all?) cases of breakage?

At the moment the big problem is I don't have enough reliable info to
see patterns that I can give to VIA for study. VIAs fixes for board problems
are for the fifo problem normally seen with the 686B and SB Live but
sometimes in other cases.

(and it seems also we have a few via + promise weirdnesses on all sorts of
 boards not yet explained)

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 15:06   ` Alan Cox
  2001-07-27 15:26     ` Joshua Schmidlkofer
  2001-07-27 15:31     ` Hans Reiser
@ 2001-07-27 20:46     ` Lehmann 
  2001-07-27 21:13       ` Hans Reiser
  2 siblings, 1 reply; 662+ messages in thread
From: Lehmann  @ 2001-07-27 20:46 UTC (permalink / raw)
  To: Alan Cox; +Cc: Hans Reiser, Joshua Schmidlkofer, kernel

On Fri, Jul 27, 2001 at 04:06:16PM +0100, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > Don't use RedHat with ReiserFS, they screw things up so many ways.....
> > For instance, they compile it with the wrong options set, their boot scripts are wrong, they just
> > shovel software onto the CD.
> 
> Sorry Hans you can rant all you like but you know you are wrong on most
> of that. RH did weeks of stress testing on multiple systems up to 8Gb 8 way
> and didn't ship until we stopped seeing corruption problems with the mm/fs
> code. 

You might be well advised looking at reality (visit a few other projects)
and you'll see that redhat, indeed, has a very bad reputation. Wether it's
gimp, gtk, perl, wine, dosemu or any other project, the basic reaction is:
oh, you have gt problems under redhat? you compile it yourself and most
probably your problems will go away (gtk+ even had this message in their
install script).

> That test suite caught bugs in kernel revisions other vendors shipped
> blindly to their customers without fixing.

they might have a very good testsuite, but that means nothing: redhat
so frequently takes snapshots of undebugged alpha versions of software
(higher version numbers) that no matter of testing will suffice to ever
make this work.

the might be doing well for the kernel, but that only gets you so far.

> That is hardly shovelling software onto the CD.

Right, that's shovelling the latest alpha versions of software onto CD.

> > Actually, I am curious as to exactly how they manage to make ReiserFS boot longer than ext2.  Do
> > they run fsck or what?
> No. The only thing I can think of that might slow it is that we build with
> the reiserfs paranoia/sanity checks on.

That's a pretty dumb thing. Maybe one should have asked the develoers
before doing this (they never do). Redhat somehow manages pretty well to
show reiserfs in a bad light ;)

However, ext2 is much faster on mount time with -onocheck (instantaneous);
and for all current harddisk sizes ext2 is somewhat to much slower on
mount. And yes, the redhat init system (just like suse's or most others,
of course) is sooo slow that improving the init system will have a much
larger effect than the ext2/reiserfs differences.

(So trying to improve this in the kernel would be the wrong place to
start).

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Strange remount behaviour with ext3-2.4-0.9.4
  2001-07-27 20:39   ` Michal Jaegermann
@ 2001-07-27 20:46     ` Alan Cox
  2001-07-27 21:08       ` Chris Wedgwood
  0 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27 20:46 UTC (permalink / raw)
  To: Michal Jaegermann; +Cc: linux-kernel

> Regardless of possible weirdness in a "smart" behaviour of 'mount' what
> one exactly buys running a journaling file system on a _read only_
> partition?  fsck times will be the same (unless you crashed when
> installing new software :-).

Several things:

1.	The simple case of remounting an fs read-only is easy, since no
	writes means no journal

2.	The software suspend case is horrible. Right now mixing a
	journalling fs and swsuspend tends to cause disk corruption because
	journalling fs's write to disk when told to mount read only

3.	Failed drives. Here the journalling mount overrides the read only
	request and the machine locks up preventing data recovery except
	by copying the whole 80Gb disk image to another disk

	Been there, it sucks

4.	Snapshots. Making read only snapshots can be very useful, and there
	you want the replay of the log to be into the page cache but not
	written back to physical media until its marked read-write

Alan


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Strange remount behaviour with ext3-2.4-0.9.4
  2001-07-27 20:46     ` Alan Cox
@ 2001-07-27 21:08       ` Chris Wedgwood
  2001-07-27 21:23         ` Alan Cox
  2001-07-28 14:37         ` Kai Henningsen
  0 siblings, 2 replies; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-27 21:08 UTC (permalink / raw)
  To: Alan Cox; +Cc: Michal Jaegermann, linux-kernel

On Fri, Jul 27, 2001 at 09:46:57PM +0100, Alan Cox wrote:

    2.	The software suspend case is horrible. Right now mixing a
    	journalling fs and swsuspend tends to cause disk corruption because
    	journalling fs's write to disk when told to mount read only

this is hard to fix... the fs needs to replay things to make things
consistent, and in many cases doing an 'in-memory' replay isn't an
option (ie. remember which stuff needs to replayed and read from the
journal instead of disk when required to do so)

    4.	Snapshots. Making read only snapshots can be very useful, and there
    	you want the replay of the log to be into the page cache but not
    	written back to physical media until its marked read-write

R/O snapshots are best done in the fs if possible, al la
WAFL. Something like that for resierfs or TUX2 would rule so much (you
more-or-less need need a tree-based fs and reference counting for all
the magic bits).  In fact, doing it as the fs layer means you could
have r/w snapshots with COW semantics.



  --cw

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 20:46     ` Lehmann 
@ 2001-07-27 21:13       ` Hans Reiser
  0 siblings, 0 replies; 662+ messages in thread
From: Hans Reiser @ 2001-07-27 21:13 UTC (permalink / raw)
  To: A. Lehmann; +Cc: Alan Cox, Joshua Schmidlkofer, kernel

"pcg( Marc)"@goof(A.).(Lehmann )com wrote:

> > No. The only thing I can think of that might slow it is that we build with
> > the reiserfs paranoia/sanity checks on.
> 
> That's a pretty dumb thing. Maybe one should have asked the develoers
> before doing this (they never do). Redhat somehow manages pretty well to
> show reiserfs in a bad light ;)

Let us be a bit more precise here.  If you click on the help button when deciding whether to select
that option it tells you not to do it.  What can you say about a distro that doesn't read the help
buttons for the kernel options when configuring the kernel?  Shovelware?

Hans

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-27 17:41     ` ext3-2.4-0.9.4 Lawrence Greenfield
@ 2001-07-27 21:16       ` Daniel Phillips
  0 siblings, 0 replies; 662+ messages in thread
From: Daniel Phillips @ 2001-07-27 21:16 UTC (permalink / raw)
  To: Lawrence Greenfield, Alan Cox; +Cc: linux-kernel

On Friday 27 July 2001 19:41, Lawrence Greenfield wrote:
> From: Alan Cox <alan@lxorguk.ukuu.org.uk>
> > Lawrence Greenfield wrote:
> > > You want to help performance?  Give us an fsync() that works on
> > > multiple file descriptors at once, or an async fsync() call. 
> > > Don't make us fight the OS on getting data to disk.
> >
> > And what pray does an asynchronous fsync do. It seems to be a nop
> to me.
>
> An async fsync allows me to issue multiple fsyncs and then wait for
> all of them to complete, hopefully in the same framework that I would
> do async I/O (but that's an argument for another day).

I'll say.  While it's truly desirable, all known filesystems are *far* 
from being able to do that.  An efficient, reliable fsync would do the 
trick for you, or even an efficient sync.  And somewhere in Andrew 
Morton's bag of tricks is something to fix you up too, read his 
comments carefully.

Looking forward, a sanely defined filesystem transaction interface 
from userland would give the best possible combination of performance 
and reliability.[1]  Since we now have four filesystems (five if you 
count JFFS) that could implement such a transaction interface, now is 
the time to figure out what it would look like.  That would include 
accomodating the needs of MTA developers.  It would be Linux-specific 
for sure.  It would also be progress.  If it turned out to be the 
fastest way to run a mailer we'd see it migrate to other nixes soon 
enough.

>    Doing reliabile transactions on disk is a hard problem. That is
> why oracle and friends have spent many man years of research on this
> kind of problem.

Tell me about it ;-)

> Current unix mailers do the smoke mirrors and prayer
> bit to reduce the probability a little that is all, regardless of fs
> and os.
>
> Isn't the point of the operating system to try to make it as easy as
> possible to do these things correctly?

   begin_transaction (filesystem_handle);
   <send the mail>;
   if (!end_transaction (filesystem_handle))
	<confirm sent>;

Something like that.[2]  Caveat: this is blue-sky stuff, it is not 
going to solve your problem today.  Andrew Morton and Hans Reiser are 
working on solving the problem today by giving you at least one mode 
where rename is synchronous, or at least giving you a fast fsync.

I'm with those who think that a little short-term pain is worth it if 
the final result is superior.

> Otherwise you force anyone who wants to write a reliable application
> (be it e-mail or not) to go to Oracle and one wonders why fsync() is
> even implemented.

[1] Al Viro pointed out that such a transaction interface could open up 
new possibilities for DOS attacks, something that has to be anticipated 
in the design.

[2] I see Alan suggested essentially the same thing in another branch 
of this thread.  Then by the "million flies" theorum...

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Strange remount behaviour with ext3-2.4-0.9.4
  2001-07-27 21:08       ` Chris Wedgwood
@ 2001-07-27 21:23         ` Alan Cox
  2001-07-27 21:27           ` Chris Wedgwood
  2001-07-28 14:37         ` Kai Henningsen
  1 sibling, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27 21:23 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Alan Cox, Michal Jaegermann, linux-kernel

> more-or-less need need a tree-based fs and reference counting for all
> the magic bits).  In fact, doing it as the fs layer means you could
> have r/w snapshots with COW semantics.

You dont want r/w snapshots for archiving. An archive of a previous date is
worthless if it can't be absolutely utterly and definitively read only.

It is hard to do well, but its an important item. One possiiblity is to do
it by replaying the log to a r/w snapshot (in ram) over a r/o snapshot (on
stable media)

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
       [not found] ` <no.id>
                     ` (29 preceding siblings ...)
  2001-07-27 20:19   ` VIA KT133A / athlon / MMX Alan Cox
@ 2001-07-27 21:24   ` Alan Cox
  2001-07-27 21:47     ` Hans Reiser
  2001-07-27 22:10   ` Alan Cox
                     ` (172 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27 21:24 UTC (permalink / raw)
  To: Hans Reiser; +Cc: A. Lehmann, Alan Cox, Joshua Schmidlkofer, kernel

> Let us be a bit more precise here.  If you click on the help button when deciding whether to select
> that option it tells you not to do it.  What can you say about a distro that doesn't read the help
> buttons for the kernel options when configuring the kernel?  Shovelware?

The alternative was to disable it. Because at the time we had lots of good
evidence it didnt work reliably. Evidence backed up by the pile of later
Chris Mason patches.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Strange remount behaviour with ext3-2.4-0.9.4
  2001-07-27 21:23         ` Alan Cox
@ 2001-07-27 21:27           ` Chris Wedgwood
  0 siblings, 0 replies; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-27 21:27 UTC (permalink / raw)
  To: Alan Cox; +Cc: Michal Jaegermann, linux-kernel

On Fri, Jul 27, 2001 at 10:23:44PM +0100, Alan Cox wrote:

    You dont want r/w snapshots for archiving. An archive of a
    previous date is worthless if it can't be absolutely utterly and
    definitively read only.

sure, for archiving you don't, but for other purposes you might

RO is easier and what most people want, this is all WAFL gives right now

RW has it's uses too, especially if you can clone /foo/bar to
/foo/blem and such like, a cheaper more elegant way of cp -Rupdl I
guess

    It is hard to do well, but its an important item. One possiiblity
    is to do it by replaying the log to a r/w snapshot (in ram) over a
    r/o snapshot (on stable media)

you can probably get away without the need for replay... just build
and in-memory extent list of blocks to would otherwise have been
rewritten and the journal offsets, before you read a block, you check
to see if you need to get from journal first

obviously you need to make sure you get the last insatce of each block
in the journal should there be more than one



  --cw

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 21:24   ` ReiserFS / 2.4.6 / Data Corruption Alan Cox
@ 2001-07-27 21:47     ` Hans Reiser
  0 siblings, 0 replies; 662+ messages in thread
From: Hans Reiser @ 2001-07-27 21:47 UTC (permalink / raw)
  To: Alan Cox; +Cc: A. Lehmann, Joshua Schmidlkofer, kernel

Alan Cox wrote:
> 
> > Let us be a bit more precise here.  If you click on the help button when deciding whether to select
> > that option it tells you not to do it.  What can you say about a distro that doesn't read the help
> > buttons for the kernel options when configuring the kernel?  Shovelware?
> 
> The alternative was to disable it. Because at the time we had lots of good
> evidence it didnt work reliably. Evidence backed up by the pile of later
> Chris Mason patches.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Better to disable it than to cripple it.

By the way, how about considering the use of tests before redhat coders put stuff in the linux
kernel?  You know, if VFS changes actually got tested before users encountered things like Viro
breaking ReiserFS in 2.4.5, it would be nice.

At Namesys, like all normal software shops, we actually run a test suite before shipping code
externally.  We usually try to require that it be tested by at least one person in addition to the
code author.

It would catch things like your gcc problems.  Test suites don't catch everything, but they are
considered the responsible thing to do at most places.

Hans

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
       [not found] ` <no.id>
                     ` (30 preceding siblings ...)
  2001-07-27 21:24   ` ReiserFS / 2.4.6 / Data Corruption Alan Cox
@ 2001-07-27 22:10   ` Alan Cox
  2001-07-28  7:36     ` Hans Reiser
  2001-07-27 23:46   ` Linx Kernel Source tree and metrics Alan Cox
                     ` (171 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27 22:10 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Alan Cox, A. Lehmann, Joshua Schmidlkofer, kernel

> By the way, how about considering the use of tests before redhat coders put stuff in the linux
> kernel?  You know, if VFS changes actually got tested before users encountered things like Viro
> breaking ReiserFS in 2.4.5, it would be nice.
> At Namesys, like all normal software shops, we actually run a test suite before shipping code
> externally.  We usually try to require that it be tested by at least one person in addition to the
> code author.

*PLONK*

No doubt if Namesys ran test suites all the tail merging bug fiasco and the
directory/tree balance races wouldnt have happened.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
       [not found]         ` <3B61E5BC.5780E1E@randomlogic.com>
@ 2001-07-27 22:12           ` Paul G. Allen
  0 siblings, 0 replies; 662+ messages in thread
From: Paul G. Allen @ 2001-07-27 22:12 UTC (permalink / raw)
  To: linux-kernel

I meant to send this to the list, but sent it straight to Alan instead.

PGA

"Paul G. Allen" wrote:
> 
> Alan Cox wrote:
> >
> 
> [SNIP]
> >
> > (and it seems also we have a few via + promise weirdnesses on all sorts of
> >  boards not yet explained)
> 
> I happen to have one of these boards. I was rather upset with it because it would lock Linux several times a day, especially while playing games. This is part
> of what drove me to purchase the K7 Thunder I now have and put the Asus A7V133 on the shelf.
> 
> Is there anything I can do that might help track down the problem(s)? I still have the board. In fact, it is a complete system less the SB Live! and GeForce 3
> that I relocated to my K7 Thunder, and it's running a Duron 750. (I also have a second system with a SB Live! and Athlon 1.2, but I'd have to beg my wife for
> its use. ;)
> 
> PGA
> 


-- 
Paul G. Allen
UNIX Admin II/Programmer
Akamai Technologies, Inc.
www.akamai.com
Work: (858)909-3630
Cell: (858)395-5043

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Linx Kernel Source tree and metrics
       [not found] ` <no.id>
                     ` (31 preceding siblings ...)
  2001-07-27 22:10   ` Alan Cox
@ 2001-07-27 23:46   ` Alan Cox
  2001-07-28  0:20     ` Paul G. Allen
  2001-07-28 19:08   ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Alan Cox
                     ` (170 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-27 23:46 UTC (permalink / raw)
  To: Paul G. Allen
  Cc: kplug-list, Linux kernel developer's mailing list, kplug-lpsg

> If this happens, I'll update it to the latest source (whatever happens to be available at that time). If it doesn't, I'll update it anyway and just bite the
> bullet and upload the data to a server with more bandwidth.

bzip2 -9 is your friend for repetetive data 8)

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-27 20:40       ` Alan Cox
       [not found]         ` <3B61E5BC.5780E1E@randomlogic.com>
@ 2001-07-28  0:04         ` Kurt Garloff
  2001-07-28  0:23         ` David Lang
  2001-07-29  4:03         ` Gav
  3 siblings, 0 replies; 662+ messages in thread
From: Kurt Garloff @ 2001-07-28  0:04 UTC (permalink / raw)
  To: Alan Cox; +Cc: Chris Wedgwood, PEIFFER Pierre, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2832 bytes --]

Hi Alan,

as I stumbled across the K7 KT133 MMX problem, let me report my failure.
MSI K7 Turbo Ver. 3 (BIOS 2.8). K7 1.2GHz. 256MB SDRam, tested fine by
memtes86 2.7.

The thing would Oops or just hand at random places at the boot process if
compiled with K7 optimization.

On Fri, Jul 27, 2001 at 09:40:09PM +0100, Alan Cox wrote:
> > On Fri, Jul 27, 2001 at 09:19:21PM +0100, Alan Cox wrote:
> >     Its heavily tied to certain motherboards. Some people found a
> >     better PSU fixed it, 

PSU = Power supply? 300W should be fine IMHO.

> >     others that altering memory settings helped.

Did not. Board allows you to set CL 3 which won't help (and I guess the SPDs
read out 3 anyway) and to turn off some PCI features which does not
help either.
It does also allow you to increase mainboard speed and multiplier but not
decrease!

> >     And in many cases, taking it back and buying a different
> >     vendors board worked.

The best option most probably.

> > Does anyone know *why* stuff breaks? surely VIA do as they have a fix
> > for (some, all?) cases of breakage?
> 
> At the moment the big problem is I don't have enough reliable info to
> see patterns that I can give to VIA for study.

Well, I did some testing, like reordering the MMX instructions, only using 4
instead of 8 registers, ... to no avail.
It all came down to replacing movntq with movq and the thing magically works.
Looks like the writes just get lost otherwise. (Maybe the sfence is just not
effective? But that would be a CPU bug, not a mainboard one.)

> VIAs fixes for board problems
> are for the fifo problem normally seen with the 686B and SB Live but
> sometimes in other cases.

It also has the 686b southbridge bug, but I believe the workaound works.

> (and it seems also we have a few via + promise weirdnesses on all sorts of
>  boards not yet explained)

No Promise involved here, fortunately.
garloff@gum09:~ $ /sbin/lspci
00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133] (rev 03)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 40)
00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)
00:07.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
00:07.3 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
00:07.4 Host bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 40)
00:07.5 Multimedia audio controller: VIA Technologies, Inc. AC97 Audio Controller (rev 50)
[...]

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, DE                                SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Linx Kernel Source tree and metrics
  2001-07-27 23:46   ` Linx Kernel Source tree and metrics Alan Cox
@ 2001-07-28  0:20     ` Paul G. Allen
  2001-07-28  1:33       ` Paul G. Allen
  0 siblings, 1 reply; 662+ messages in thread
From: Paul G. Allen @ 2001-07-28  0:20 UTC (permalink / raw)
  Cc: kplug-list, Linux kernel developer's mailing list, kplug-lpsg

Alan Cox wrote:
> 
> > If this happens, I'll update it to the latest source (whatever happens to be available at that time). If it doesn't, I'll update it anyway and just bite the
> > bullet and upload the data to a server with more bandwidth.
> 
> bzip2 -9 is your friend for repetetive data 8)

Isn't that the truth, especially for this much text (I bet it'll compress real nice :)

PGA

-- 
Paul G. Allen
UNIX Admin II/Programmer
Akamai Technologies, Inc.
www.akamai.com
Work: (858)909-3630
Cell: (858)395-5043

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-27 20:40       ` Alan Cox
       [not found]         ` <3B61E5BC.5780E1E@randomlogic.com>
  2001-07-28  0:04         ` Kurt Garloff
@ 2001-07-28  0:23         ` David Lang
  2001-07-28 11:11           ` Kurt Garloff
  2001-07-28 12:47           ` Alan Cox
  2001-07-29  4:03         ` Gav
  3 siblings, 2 replies; 662+ messages in thread
From: David Lang @ 2001-07-28  0:23 UTC (permalink / raw)
  To: Alan Cox; +Cc: cw, ppeiffer, linux-kernel

I have a 1u box at my des that has two MSI boards in it with 1.2G athlons.
at the moment they are both running 2.4.5 (athlon optimized), one box has
no problems at all while the other dies (no video, no keyboard, etc)
within an hour of being booted.

systems have no sound enabled, 512MB ram, 20G ata100 drives. D-Link quad
fast ethernet cards.

if you have any patch you would like me to test on these boxes let me know
(I am arranging to ship this one and three others like it that each have
one working and one failing system in them back to the factory to get the
MLB swapped out on the one that is failing.

David Lang


 On Fri, 27 Jul 2001, Alan Cox wrote:

> Date: Fri, 27 Jul 2001 21:40:09 +0100 (BST)
> From: Alan Cox <alan@lxorguk.ukuu.org.uk>
> To: cw@f00f.org
> Cc: alan@lxorguk.ukuu.org.uk, ppeiffer@free.fr, linux-kernel@vger.kernel.org
> Subject: Re: VIA KT133A / athlon / MMX
>
> > On Fri, Jul 27, 2001 at 09:19:21PM +0100, Alan Cox wrote:
> >     Its heavily tied to certain motherboards. Some people found a
> >     better PSU fixed it, others that altering memory settings
> >     helped. And in many cases, taking it back and buying a different
> >     vendors board worked.
> >
> > Does anyone know *why* stuff breaks? surely VIA do as they have a fix
> > for (some, all?) cases of breakage?
>
> At the moment the big problem is I don't have enough reliable info to
> see patterns that I can give to VIA for study. VIAs fixes for board problems
> are for the fifo problem normally seen with the 686B and SB Live but
> sometimes in other cases.
>
> (and it seems also we have a few via + promise weirdnesses on all sorts of
>  boards not yet explained)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Linx Kernel Source tree and metrics
  2001-07-28  0:20     ` Paul G. Allen
@ 2001-07-28  1:33       ` Paul G. Allen
  0 siblings, 0 replies; 662+ messages in thread
From: Paul G. Allen @ 2001-07-28  1:33 UTC (permalink / raw)
  To: kplug-list, Linux kernel developer's mailing list, kplug-lpsg

Please, no more wgets on my poor limited bandwidth (256Kbit uplink) web server. Next week I will have a fatter pipe and you can D/L the whole dir if you want.
(Though it would be better if you let me compress it and put it on a ftp server).

Thank you for your support. ;)

PGA

-- 
Paul G. Allen
UNIX Admin II/Programmer
Akamai Technologies, Inc.
www.akamai.com
Work: (858)909-3630
Cell: (858)395-5043

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 2.4.7 + VIA Pro266 + 2xUltraTx2 lockups
  2001-07-27  9:54   ` 2.4.7 + VIA Pro266 + 2xUltraTx2 lockups Alan Cox
@ 2001-07-28  4:03     ` Robin Humble
  0 siblings, 0 replies; 662+ messages in thread
From: Robin Humble @ 2001-07-28  4:03 UTC (permalink / raw)
  To: linux-kernel


Alan Cox wrote:
>Robin Humble wrote:
>> So the system is stable when driving a single Tx2 card, or on a BX,
>> but just not two Tx2's together on the pro266 board :-/ So it's
>> perhaps (I'm guessing here :) a non-trivial Tx2 driver bug or maybe a
>> VIA Pro266 problem?
>
>Firstly please try 2.4.6-ac5 as that has the proper VIA workaround for their
>bridge bugs. Its useful to rule out the very conservative approach the older
>kernels use to avoid the disk corruption problem they had

Ok. That locked up in the same way unfortunately :-/
Also a 2.4.8-pre1-xfs that I just tried...
I tried the "noapic" option as suggested in another email and that
didn't change anything either.

We've moved all the disks and controllers to a BX m/b machine for now, but
if there's anything else you want us to be guinea pigs for them we'll be
happy to try it out on the VIA Pro266 machine.
One other odd thing is that I have yet to make the CUV266 board see any
devices on its built-in secondary IDE controller. I have no idea why that
could be... The BIOS just doesn't detect them. Might that be a related
problem? Perhaps it's a faulty motherboard? Seems unlikely.

Please CC me on any replies as I'm not subscribed... ta...

cheers,
robin

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 22:10   ` Alan Cox
@ 2001-07-28  7:36     ` Hans Reiser
  2001-07-28 14:08       ` Chris Mason
  0 siblings, 1 reply; 662+ messages in thread
From: Hans Reiser @ 2001-07-28  7:36 UTC (permalink / raw)
  To: Alan Cox; +Cc: A. Lehmann, Joshua Schmidlkofer, kernel

Alan Cox wrote:
> 
> > By the way, how about considering the use of tests before redhat coders put stuff in the linux
> > kernel?  You know, if VFS changes actually got tested before users encountered things like Viro
> > breaking ReiserFS in 2.4.5, it would be nice.
> > At Namesys, like all normal software shops, we actually run a test suite before shipping code
> > externally.  We usually try to require that it be tested by at least one person in addition to the
> > code author.
> 
> *PLONK*
> 
> No doubt if Namesys ran test suites all the tail merging bug fiasco and the
> directory/tree balance races wouldnt have happened.
Our test suites need much improvement, but we do have them and use them.  Can you say the same?

Hans

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-28  0:23         ` David Lang
@ 2001-07-28 11:11           ` Kurt Garloff
  2001-07-28 11:49             ` Victor Julien
  2001-07-29  0:37             ` J. Dow
  2001-07-28 12:47           ` Alan Cox
  1 sibling, 2 replies; 662+ messages in thread
From: Kurt Garloff @ 2001-07-28 11:11 UTC (permalink / raw)
  To: David Lang
  Cc: Alan Cox, cw, ppeiffer, linux-kernel, Arjan van de Ven, Chris Brady


[-- Attachment #1.1: Type: text/plain, Size: 1327 bytes --]

Hi,

On Fri, Jul 27, 2001 at 05:23:07PM -0700, David Lang wrote:
> I have a 1u box at my des that has two MSI boards in it with 1.2G athlons.
> at the moment they are both running 2.4.5 (athlon optimized), one box has
> no problems at all while the other dies (no video, no keyboard, etc)
> within an hour of being booted.

Somebody told he had the same MoBo already replaced a couple of times ...

> if you have any patch you would like me to test on these boxes let me know

Well, no kernel patches.
But some program which does the K7 optmizied copies and zeroing in userspace.
(Attached)

Strange enough it succeeds on the machine that fails to boot a K7 optimized
kernel. 
So I'm puzzled now. 
Seems we can trigger problems in kernelspace that we can't have in userspace?
Some problem with non-serialization if an interrupt occurs or something
esoteric like this?

> (I am arranging to ship this one and three others like it that each have
> one working and one failing system in them back to the factory to get the
> MLB swapped out on the one that is failing.

Good luck!
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, DE                                SCSI, Security

[-- Attachment #1.2: test_movntq.c --]
[-- Type: text/plain, Size: 4336 bytes --]

/* test_movntq.c 
 * Program that tests the K7 optimized routines for copying 
 * and zeroing pages (which fail on some MoBos in the kernel).
 * gcc -O2 -Wall -g -fomit-frame-pointer -o test_movntq test_movntq.c
 * and run on AMD K7!
 * (c) Kurt Garloff <garloff@suse.de>, 2001-07-28, GNU GPL
 */

#include <stdio.h>
#include <unistd.h>
#include <malloc.h>
#include <stdlib.h>

#define PAGE_SIZE 4096
#define NR_TESTS 4096

void * fpu_ctx;

double c;
void trigger_fpu ()
{

	double a = 4.3;
	double b = rand()/ (float)RAND_MAX;
	c = a/b;
}

void movntq_copy_page0 (void* to, void* from)
{
	//void *d0, *d1;
	//printf ("%p <- %p\n", to, from);
	asm volatile (
		      "\n\t   prefetch (%0)"
		      "\n\t   prefetch 64(%0)"
		      "\n\t   prefetch 128(%0)"
		      "\n\t   prefetch 192(%0)"
		      "\n\t   fxsave (%3)"
		      "\n\t   prefetch 256(%0)"
		      "\n\t   movl %2, %%ecx"
		      "\n\t   fnclex"
		      "\n\t1: prefetch 320(%0)"
		      "\n\t   movq (%0),%%mm0"
		      "\n\t   movntq %%mm0,(%1)"
		      "\n\t   movq 8(%0),%%mm1"
		      "\n\t   movntq %%mm1,8(%1)"
		      "\n\t   movq 16(%0),%%mm2"
		      "\n\t   movntq %%mm2,16(%1)"
		      "\n\t   movq 24(%0),%%mm3"
		      "\n\t   movntq %%mm3,24(%1)"
		      "\n\t   movq 32(%0),%%mm4"
		      "\n\t   movntq %%mm4,32(%1)"
		      "\n\t   movq 40(%0),%%mm5"
		      "\n\t   movntq %%mm5,40(%1)"
		      "\n\t   movq 48(%0),%%mm6"
		      "\n\t   movntq %%mm6,48(%1)"
		      "\n\t   movq 56(%0),%%mm7"
		      "\n\t   movntq %%mm7,56(%1)"
		      /*"\n\t   sfence"*/
		      "\n\t   addl $64,%0"
		      "\n\t   addl $64,%1"
		      "\n\t   loop 1b"
		      "\n\t   movl $5, %%ecx"
		      "\n\t2: movq (%0),%%mm0"
		      "\n\t   movntq %%mm0,(%1)"
		      "\n\t   movq 8(%0),%%mm1"
		      "\n\t   movntq %%mm1,8(%1)"
		      "\n\t   movq 16(%0),%%mm2"
		      "\n\t   movntq %%mm2,16(%1)"
		      "\n\t   movq 24(%0),%%mm3"
		      "\n\t   movntq %%mm3,24(%1)"
		      "\n\t   movq 32(%0),%%mm4"
		      "\n\t   movntq %%mm4,32(%1)"
		      "\n\t   movq 40(%0),%%mm5"
		      "\n\t   movntq %%mm5,40(%1)"
		      "\n\t   movq 48(%0),%%mm6"
		      "\n\t   movntq %%mm6,48(%1)"
		      "\n\t   movq 56(%0),%%mm7"
		      "\n\t   movntq %%mm7,56(%1)"
		      "\n\t   addl $64,%0"
		      "\n\t   addl $64,%1"
		      "\n\t   loop 2b"
		      "\n\t   sfence"
		      "\n\t   fxrstor (%3) \n"
		      :
		      : "r" (from), "r" (to), "i" (PAGE_SIZE/64 - 5), "r" (fpu_ctx)
		      : "memory", "ecx" );
};


void movntq_zero_page0 (void* to)
{
	//void *d0;
	//printf ("%p <- 0\n", to);
	asm volatile (
		      "\n\t   fxsave (%2)"
		      "\n\t   movl %1, %%ecx"
		      "\n\t   fnclex"
		      "\n\t   pxor %%mm0, %%mm0"
		      "\n\t1: "
		      "\n\t   movntq %%mm0,(%0)"
		      "\n\t   movntq %%mm0,8(%0)"
		      "\n\t   movntq %%mm0,16(%0)"
		      "\n\t   movntq %%mm0,24(%0)"
		      "\n\t   movntq %%mm0,32(%0)"
		      "\n\t   movntq %%mm0,40(%0)"
		      "\n\t   movntq %%mm0,48(%0)"
		      "\n\t   movntq %%mm0,56(%0)"
		      /*"\n\t   sfence"*/
		      "\n\t   addl $64,%0"
		      "\n\t   loop 1b"
		      "\n\t   sfence"
		      "\n\t   fxrstor (%2) \n"
		      :
		      : "r" (to), "i" (PAGE_SIZE/64), "r" (fpu_ctx)
		      : "memory", "ecx");
}


void alloc_fpu_ctx ()
{
	fpu_ctx = (void*) memalign (256, 1024);
}

void fill_rand_page (void* mem)
{
	int* ptr = (int*) mem;
	do {
		*ptr = rand();
	} while (( (char*)(++ptr) - (char*)mem) < PAGE_SIZE);
}

void* memzero (void* mem, size_t ln)
{
	int* ptr = (int*)mem;
	int i = ln / sizeof(int);
	while (i--)
		if (*ptr++ != 0) return (void*)ptr;
	return 0;
}

int main ()
{
	void *b1, *b2, *b3; void* err; int i;
	srand (5);
	alloc_fpu_ctx ();
	trigger_fpu ();
	b3 = b1 = (void*) memalign (PAGE_SIZE, (NR_TESTS+1)*PAGE_SIZE);
	fill_rand_page (b1);
	for (i = 0; i < NR_TESTS; i++) {
		b2 = (void*) ((char*)b3 + PAGE_SIZE);
		movntq_copy_page0 (b2, b3);
		if (memcmp (b3, b2, PAGE_SIZE)) {
			printf ("Error (%i)!\n", i);
			exit (1);
		}
		movntq_zero_page0 (b3);
		if ((err = memzero (b3, PAGE_SIZE))) {
			printf ("Error! (%i) %p\n", i, err);
			exit (2);
		}
		b3 = b2;
	}
	free (b1);
	free (fpu_ctx);
	return 0;
}
			      

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-28 11:11           ` Kurt Garloff
@ 2001-07-28 11:49             ` Victor Julien
  2001-07-29  0:37             ` J. Dow
  1 sibling, 0 replies; 662+ messages in thread
From: Victor Julien @ 2001-07-28 11:49 UTC (permalink / raw)
  To: linux-kernel

Do these problems also affect Durons? I have a MSI K7T Turbo-R with Via
KT133A and I have nog problems at all. I even run my Duron 600 at 866(!)
(6,5 * 133) for several months now. Now I wonder if I could get problems
when i upgrade to a tbird at 1,4 ghz. I have a 300 PSU.

Victor


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-28 17:29       ` PEIFFER Pierre
@ 2001-07-28 12:21         ` Kurt Garloff
  2001-07-28 22:00           ` PEIFFER Pierre
  0 siblings, 1 reply; 662+ messages in thread
From: Kurt Garloff @ 2001-07-28 12:21 UTC (permalink / raw)
  To: PEIFFER Pierre; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 713 bytes --]

On Sat, Jul 28, 2001 at 01:29:04PM -0400, PEIFFER Pierre wrote:
> FYI, according to the user's manual, enabling this option "set the north
> bridge chipset timing parameters more aggressively providing higher
> system performance" (Default value is 'disable'). I can't say more about
> what it does exactly.

A lspci -vxxx of your northbridge with adn without the BIOS option will
reveal more.

Regards,
-- 
Kurt Garloff                   <kurt@garloff.de>         [Eindhoven, NL]
Physics: Plasma simulations  <K.Garloff@Phys.TUE.NL>  [TU Eindhoven, NL]
Linux: SCSI, Security          <garloff@suse.de>    [SuSE Nuernberg, DE]
 (See mail header or public key servers for PGP2 and GPG public keys.)

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-28  0:23         ` David Lang
  2001-07-28 11:11           ` Kurt Garloff
@ 2001-07-28 12:47           ` Alan Cox
  2001-07-31 19:53             ` David Lang
  1 sibling, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-28 12:47 UTC (permalink / raw)
  To: David Lang; +Cc: Alan Cox, cw, ppeiffer, linux-kernel

> I have a 1u box at my des that has two MSI boards in it with 1.2G athlons.
> at the moment they are both running 2.4.5 (athlon optimized), one box has
> no problems at all while the other dies (no video, no keyboard, etc)
> within an hour of being booted.

Same bios, same bios settings ?

lspci -vxx on both show the same settings ?


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-28  7:36     ` Hans Reiser
@ 2001-07-28 14:08       ` Chris Mason
  0 siblings, 0 replies; 662+ messages in thread
From: Chris Mason @ 2001-07-28 14:08 UTC (permalink / raw)
  To: Hans Reiser, Alan Cox; +Cc: A. Lehmann, Joshua Schmidlkofer, kernel



On Saturday, July 28, 2001 11:36:33 AM +0400 Hans Reiser <reiser@namesys.com>
wrote:

> Alan Cox wrote:
>> 
>> No doubt if Namesys ran test suites all the tail merging bug fiasco and the
>> directory/tree balance races wouldnt have happened.
> Our test suites need much improvement, but we do have them and use them.
> Can you say the same?

He's already described some of the testing they do.  I would suggest there
are better ways to use l-k bandwidth than picking a fight with redhat,
especially on topics that have already been beaten to death.  

Alan, thanks for helping to test the reiserfs patches we've been sending to
in the ac tree, we do appreciate it.

-chris


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 13:39       ` Alan Cox
  2001-07-27 13:47         ` bvermeul
@ 2001-07-28 14:16         ` Matthew Gardiner
  2001-08-08 18:42         ` Stephen C. Tweedie
  2 siblings, 0 replies; 662+ messages in thread
From: Matthew Gardiner @ 2001-07-28 14:16 UTC (permalink / raw)
  To: Alan Cox, bvermeul
  Cc: Alan Cox, Hans Reiser, Erik Mouw, Steve Kieu, Sam Thompson, kernel

On Saturday 28 July 2001 01:39, Alan Cox wrote:
> > > Putting a sync just before the insmod when developing new drivers is a
> > > good idea btw
> >
> > I've been doing that most of the time. But I sometimes forget that.
> > But as I said, it's not something I expected from a journalled
> > filesystem.
>
> You misunderstand journalling then
>
> A journalling file system can offer different levels of guarantee. With
> metadata only journalling you don't take any real performance hit but your
> file system is always consistent on reboot (consistent as in fsck would
> pass it) but it makes no guarantee that data blocks got written.
>
> Full data journalling will give you what you expect but at a performance
> hit for many applications.
>
> Alan

Just in regards to full journalling, will/is there an option in ReiserFS to 
allow it? Personally, I would much rather have full journalling, and a little 
more of a performance hit for security and reliability, than great 
performance and a higher level of risk.

Matthew Gardiner
-- 
WARNING:

This email was written on an OS using the viral 'GPL' as its license.

Please check with Bill Gates before continuing to read this email/posting.

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 14:21   ` Alan Cox
@ 2001-07-28 14:18     ` Matthew Gardiner
  2001-07-28 16:25       ` Alan Cox
                         ` (2 more replies)
  0 siblings, 3 replies; 662+ messages in thread
From: Matthew Gardiner @ 2001-07-28 14:18 UTC (permalink / raw)
  To: Alan Cox, Philip R. Auld; +Cc: Alan Cox, kernel

I've noticed that in the menuconfig there is support for the Vertias 
Journalling File System. Has there been any push for that to be a "bootable" 
filesystem so it can be used for Linux?

Matthew Gardiner
-- 
WARNING:

This email was written on an OS using the viral 'GPL' as its license.

Please check with Bill Gates before continuing to read this email/posting.

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Strange remount behaviour with ext3-2.4-0.9.4
  2001-07-27 21:08       ` Chris Wedgwood
  2001-07-27 21:23         ` Alan Cox
@ 2001-07-28 14:37         ` Kai Henningsen
  1 sibling, 0 replies; 662+ messages in thread
From: Kai Henningsen @ 2001-07-28 14:37 UTC (permalink / raw)
  To: linux-kernel

alan@lxorguk.ukuu.org.uk (Alan Cox)  wrote on 27.07.01 in <E15QF5E-0006ZL-00@the-village.bc.nu>:

> > more-or-less need need a tree-based fs and reference counting for all
> > the magic bits).  In fact, doing it as the fs layer means you could
> > have r/w snapshots with COW semantics.
>
> You dont want r/w snapshots for archiving.

Not for archiving, but when you want to run something and then throw it  
away again, for example. You could do that by just holding onto a ro  
snapshot and then replacing the rw tree with it later, but by having two  
rw trees you don't need to stop your regular operations.

For this to really be useful, you'd want it as an inheritable per-process  
thing, similar to aviro's namespace thing.

MfG Kai

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 15:46       ` Hans Reiser
  2001-07-27 17:46         ` Christoph Rohland
  2001-07-27 18:10         ` Dustin Byford
@ 2001-07-28 16:10         ` Henning P. Schmiedehausen
  2 siblings, 0 replies; 662+ messages in thread
From: Henning P. Schmiedehausen @ 2001-07-28 16:10 UTC (permalink / raw)
  To: linux-kernel

Hans Reiser <reiser@namesys.com> writes:

> Well, I am afraid this is much too vague for me to have any
> understanding of what went wrong on your system.

But you were able on this vagueness of accusing Redhat to "just shovel
software on a CD". Why? Because they didn't give you money unlike some
other vendors, e.g. SuSE?

The thing that really pisses me off about ReiserFS from time to time
is not the "FS" part...

	Regards
		Henning

-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen       -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH     hps@intermeta.de

Am Schwabachgrund 22  Fon.: 09131 / 50654-0   info@intermeta.de
D-91054 Buckenhof     Fax.: 09131 / 50654-20   

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-28 14:18     ` Matthew Gardiner
@ 2001-07-28 16:25       ` Alan Cox
  2001-07-28 16:27         ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Jeff Garzik
                           ` (2 more replies)
  2001-07-28 16:43       ` missing symbols in 2.4.7-ac2 Thomas Kotzian
  2001-07-29 11:16       ` ReiserFS / 2.4.6 / Data Corruption Christoph Hellwig
  2 siblings, 3 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-28 16:25 UTC (permalink / raw)
  To: Matthew Gardiner; +Cc: Alan Cox, Philip R. Auld, kernel

> I've noticed that in the menuconfig there is support for the Vertias 
> Journalling File System. Has there been any push for that to be a "bootable" 
> filesystem so it can be used for Linux?

The Linux freevxfs module is read only currently. Veritas apparently will be
releasing the genuine article for Linux but binary only with all the mess
that entails

^ permalink raw reply	[flat|nested] 662+ messages in thread

* binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
  2001-07-28 16:25       ` Alan Cox
@ 2001-07-28 16:27         ` Jeff Garzik
  2001-07-28 18:22           ` Andreas Dilger
  2001-07-28 19:02           ` Rik van Riel
  2001-07-28 17:44         ` Richard Gooch
  2001-07-29 10:15         ` ReiserFS / 2.4.6 / Data Corruption Matthew Gardiner
  2 siblings, 2 replies; 662+ messages in thread
From: Jeff Garzik @ 2001-07-28 16:27 UTC (permalink / raw)
  To: Alan Cox; +Cc: Matthew Gardiner, Philip R. Auld, kernel

Alan Cox wrote:
> The Linux freevxfs module is read only currently. Veritas apparently will be
> releasing the genuine article for Linux but binary only with all the mess
> that entails

Isn't that a violation of the GPL, to release binary modules?

-- 
Jeff Garzik      | "Mind if I drive?" -Sam
Building 1024    | "Not if you don't mind me clawing at the dash
MandrakeSoft     |  and shrieking like a cheerleader." -Max

^ permalink raw reply	[flat|nested] 662+ messages in thread

* missing symbols in 2.4.7-ac2
  2001-07-28 14:18     ` Matthew Gardiner
  2001-07-28 16:25       ` Alan Cox
@ 2001-07-28 16:43       ` Thomas Kotzian
  2001-07-29  1:53         ` Andrew Morton
  2001-07-29 11:16       ` ReiserFS / 2.4.6 / Data Corruption Christoph Hellwig
  2 siblings, 1 reply; 662+ messages in thread
From: Thomas Kotzian @ 2001-07-28 16:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alan Cox

when compiling with highmem = 4GB
problem in 3c59x - module:
unresolved symbol nr_free_highpages ...

ThomasK.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-27 16:50   ` ext3-2.4-0.9.4 Alan Cox
  2001-07-27 17:41     ` ext3-2.4-0.9.4 Lawrence Greenfield
@ 2001-07-28 16:46     ` Patrick J. LoPresti
  2001-07-28 19:03       ` ext3-2.4-0.9.4 Alan Cox
  2001-07-30 21:03       ` rename() (was Re: ext3-2.4-0.9.4) Anthony DeBoer
  1 sibling, 2 replies; 662+ messages in thread
From: Patrick J. LoPresti @ 2001-07-28 16:46 UTC (permalink / raw)
  To: linux-kernel, alan

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> Also if you write metadata first then you risk delivering email to
> the wrong person instead.

The MTAs do this:

    Open temp file
    Write to temp file
    fsync() temp file
    rename() temp file into mail spool
    indicate success to remote MTA

As long as rename() does not return until the metadata are committed,
this should be a reliable delivery mechanism.  After a crash, you
might end up with the temp file still there, or with the file having a
link count of two (temp file and spool file).  But you can clean up
all of this at boot time; if the temp file is gone and the spool file
is present, then the transaction was completed.

(Yes, you might not have returned the success code to the remote MTA,
but that just means you might do a double delivery.  That is an
acceptable failure mode; corrupting, losing, or misdirecting mail is
not.)

How does this scheme "risk delivering mail to the wrong person
instead"?

If you have metadata journalling, all you need for this algorithm to
work is to have rename() write to the journal before returning.  Is
this true for any of the current journalling file systems on Linux?

 - Pat

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-27 20:37     ` Chris Wedgwood
  2001-07-27 20:40       ` Alan Cox
@ 2001-07-28 17:29       ` PEIFFER Pierre
  2001-07-28 12:21         ` Kurt Garloff
  1 sibling, 1 reply; 662+ messages in thread
From: PEIFFER Pierre @ 2001-07-28 17:29 UTC (permalink / raw)
  To: linux-kernel

Chris Wedgwood a écrit :
> 
> On Fri, Jul 27, 2001 at 09:19:21PM +0100, Alan Cox wrote:
> 
>     Its heavily tied to certain motherboards. Some people found a
>     better PSU fixed it, others that altering memory settings
>     helped. And in many cases, taking it back and buying a different
>     vendors board worked.
> 
> My guess is its some kind of timing or near-miss on a signal edge, and
> the bios changes relax things so you don't miss whatever it was you
> missed before.
> 

Ok, after reading that, I've tried to see if my BIOS setting changes
were implicated or not. And I've found a winner:
Disabling option "Enhance Chip Performance" makes kernel K7-mmx routines
work fine. Enabling it causes the kernel crash at boot time... (And I
haved it enable)

FYI, according to the user's manual, enabling this option "set the north
bridge chipset timing parameters more aggressively providing higher
system performance" (Default value is 'disable'). I can't say more about
what it does exactly.

I don't know if this will help you to locate the problem, but at least,
Abit's users will be warned...

Thanks for your help !

	Pierre

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
  2001-07-28 16:25       ` Alan Cox
  2001-07-28 16:27         ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Jeff Garzik
@ 2001-07-28 17:44         ` Richard Gooch
  2001-07-29 10:15         ` ReiserFS / 2.4.6 / Data Corruption Matthew Gardiner
  2 siblings, 0 replies; 662+ messages in thread
From: Richard Gooch @ 2001-07-28 17:44 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, Matthew Gardiner, Philip R. Auld, kernel

Jeff Garzik writes:
> Alan Cox wrote:
> > The Linux freevxfs module is read only currently. Veritas apparently will be
> > releasing the genuine article for Linux but binary only with all the mess
> > that entails
> 
> Isn't that a violation of the GPL, to release binary modules?

Linus said it's OK. I know Alan doesn't agree, but that's life :-)
The king penguin has spoken.

I don't see the need to be bloody-minded on this issue. If a vendor
wants to go through the pain of tracking kernel drift and having to
compile modules for many different versions, then let them. Given how
much trouble it is, why bother them with legal threats?

The right answer for vendors who want to ship binary modules is to
ship an Open Source interface layer which shields the vendor from
kernel drift (since users will be able to build the interface layer if
they need to, without waiting for the vendor).
I guess that would also shield them from unhelpful legal threats.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
  2001-07-28 16:27         ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Jeff Garzik
@ 2001-07-28 18:22           ` Andreas Dilger
  2001-07-28 19:02           ` Rik van Riel
  1 sibling, 0 replies; 662+ messages in thread
From: Andreas Dilger @ 2001-07-28 18:22 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, Matthew Gardiner, Philip R. Auld, kernel

Jeff Garzik writes:
> Alan Cox wrote:
> > The Linux freevxfs module is read only currently. Veritas apparently will be
> > releasing the genuine article for Linux but binary only with all the mess
> > that entails
> 
> Isn't that a violation of the GPL, to release binary modules?

Noooooo....  Not this thread again.

Cheers, Andreas
-- 
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
  2001-07-28 16:27         ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Jeff Garzik
  2001-07-28 18:22           ` Andreas Dilger
@ 2001-07-28 19:02           ` Rik van Riel
  1 sibling, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-28 19:02 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Alan Cox, Matthew Gardiner, Philip R. Auld, kernel

On Sat, 28 Jul 2001, Jeff Garzik wrote:
> Alan Cox wrote:
> > The Linux freevxfs module is read only currently. Veritas apparently will be
> > releasing the genuine article for Linux but binary only with all the mess
> > that entails
>
> Isn't that a violation of the GPL, to release binary modules?

Binary modules using only the interfaces exported in /proc/ksyms
are, under certain readings of the GPL, no less "infected" by the
GPL than binary programs making system calls.

This means binary only modules are ok, as long as they don't need
changes in the kernel to work.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-28 16:46     ` ext3-2.4-0.9.4 Patrick J. LoPresti
@ 2001-07-28 19:03       ` Alan Cox
  2001-07-29  1:53         ` ext3-2.4-0.9.4 Chris Wedgwood
  2001-07-29  1:59         ` ext3-2.4-0.9.4 Andrew Morton
  2001-07-30 21:03       ` rename() (was Re: ext3-2.4-0.9.4) Anthony DeBoer
  1 sibling, 2 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-28 19:03 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel, alan

> How does this scheme "risk delivering mail to the wrong person
> instead"?

With the fsync it looks ok for most cases. It depends on the actions of
a rename touching only one disk block - which of course it doesn't do. Even
so with the fsync on a sane fs I cant see that problem occuring

> If you have metadata journalling, all you need for this algorithm to
> work is to have rename() write to the journal before returning.  Is
> this true for any of the current journalling file systems on Linux?

Ext3 I believe so, Reiserfs I would assume so but Hans can answer
definitively

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
       [not found] ` <no.id>
                     ` (32 preceding siblings ...)
  2001-07-27 23:46   ` Linx Kernel Source tree and metrics Alan Cox
@ 2001-07-28 19:08   ` Alan Cox
  2001-07-29 10:24     ` Matthew Gardiner
  2001-07-29  0:38   ` make rpm Alan Cox
                     ` (169 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-07-28 19:08 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Jeff Garzik, Alan Cox, Matthew Gardiner, Philip R. Auld, kernel

> The right answer for vendors who want to ship binary modules is to
> ship an Open Source interface layer which shields the vendor from
> kernel drift (since users will be able to build the interface layer if
> they need to, without waiting for the vendor).

As people have seen from vmware and from the ever growing piles of 
nvidia crashes the truth about binary modules in general even with glue is
pain and suffering.

Veritas have some good Linux people though, and while I'm sad they won't
open source the core of veritas they do at least appear to have the
knowledgebase to do a good job

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-28 12:21         ` Kurt Garloff
@ 2001-07-28 22:00           ` PEIFFER Pierre
  2001-07-29 20:28             ` Kurt Garloff
  0 siblings, 1 reply; 662+ messages in thread
From: PEIFFER Pierre @ 2001-07-28 22:00 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 437 bytes --]

Kurt Garloff a écrit :
> 
> A lspci -vxxx of your northbridge with adn without the BIOS option will
> reveal more.

In attached files are the result. I've only kept the (what I suppose to
be) northbridge info.
This doesn't tell me anything...

Note: both has been done after booting on  Mandrake-kernel 2.4.3 which
come with Mandrake distribution (i.e. with lot of patches and
options...) I don't know the impact on the result...

Pierre

[-- Attachment #2: lspci_opt_disable.txt --]
[-- Type: text/plain, Size: 1142 bytes --]

00:00.0 Host bridge: VIA Technologies, Inc.: Unknown device 0305 (rev 03)
	Subsystem: ABIT Computer Corp.: Unknown device a401
	Flags: bus master, medium devsel, latency 0
	Memory at d8000000 (32-bit, prefetchable) [size=64M]
	Capabilities: [a0] AGP version 2.0
	Capabilities: [c0] Power Management version 2
00: 06 11 05 03 06 00 10 a2 03 00 00 06 00 00 00 00
10: 08 00 00 d8 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 7b 14 01 a4
30: 00 00 00 00 a0 00 00 00 00 00 00 00 00 00 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 17 a3 eb b4 02 00 10 10 c0 00 08 10 10 10 10 10
60: 03 aa 02 20 e6 d6 d6 c6 51 28 43 0d 08 3f 00 00
70: d4 88 cc 0c 0e 81 62 00 01 b4 19 02 00 00 00 00
80: 0f 40 00 00 c0 00 00 00 02 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 32 00 00
a0: 02 c0 20 00 17 02 00 1f 00 00 00 00 2b 12 14 00
b0: 49 da 00 60 31 ff 80 05 67 00 00 00 00 00 00 00
c0: 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 03 03 00 22 00 00 00 00 00 00 00


[-- Attachment #3: lspci_opt_enable.txt --]
[-- Type: text/plain, Size: 1142 bytes --]

00:00.0 Host bridge: VIA Technologies, Inc.: Unknown device 0305 (rev 03)
	Subsystem: ABIT Computer Corp.: Unknown device a401
	Flags: bus master, medium devsel, latency 8
	Memory at d8000000 (32-bit, prefetchable) [size=64M]
	Capabilities: [a0] AGP version 2.0
	Capabilities: [c0] Power Management version 2
00: 06 11 05 03 06 00 10 a2 03 00 00 06 00 08 00 00
10: 08 00 00 d8 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 7b 14 01 a4
30: 00 00 00 00 a0 00 00 00 00 00 00 00 00 00 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 17 a3 eb b4 43 89 10 10 c0 00 08 10 10 10 10 10
60: 03 aa 02 20 e6 d6 d6 c6 45 28 43 0f 08 3f 00 00
70: d4 88 cc 0c 0e 81 62 00 01 b4 19 02 00 00 00 00
80: 0f 40 00 00 c0 00 00 00 02 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 32 00 00
a0: 02 c0 20 00 17 02 00 1f 00 00 00 00 2f 12 14 00
b0: 49 da 88 60 31 ff 80 05 67 00 00 00 00 00 00 00
c0: 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 03 03 00 22 00 00 00 00 00 91 06


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 17:58                             ` ext3-2.4-0.9.4 Hans Reiser
@ 2001-07-28 22:45                               ` Matthias Andree
  2001-07-28 23:50                                 ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-29 13:42                                 ` ext3-2.4-0.9.4 Hans Reiser
  0 siblings, 2 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-28 22:45 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Andre Pang, Larry McVoy, linux-kernel

On Thu, 26 Jul 2001, Hans Reiser wrote:

> No, Linus is right and the MTA guys are just wrong.  The mailers are
> the place to fix things, not the kernel.  If the mailer guys want to
> depend on the kernel being stupidly designed, tough.  Someone should
> fix their mailer code and then it would run faster on Linux than on
> any other platform.

Well, some systems are even documented that way, so there's nothing with
"depend on the kernel being stupidly designed", but "depend on what
mount(8) says".

MTA authors don't play games, they also write that their software relies
on this behaviour, as laid out.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-27 16:57                               ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-28 23:15                                 ` Matthias Andree
  2001-07-28 23:47                                   ` ext3-2.4-0.9.4 Rik van Riel
  0 siblings, 1 reply; 662+ messages in thread
From: Matthias Andree @ 2001-07-28 23:15 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Lawrence Greenfield, linux-kernel

On Fri, 27 Jul 2001, Rik van Riel wrote:

> The stuff you people seem to insist on, however, most
> definately isn't part of the defined set of semantics.

And even if it's "inherited wisdom", you cannot simply tell those people
"don't rely on that" if - as claimed - you can't even force a link() to
disk.

> If you believe otherwise, feel free to point out the
> relevant sections in POSIX / SuS / ...

The standard is only useful if it specifies how to get data safely on
disk - it is quite explicit for fsync(), but you evidently cannot
fsync() a link().

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-28 23:15                                 ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-28 23:47                                   ` Rik van Riel
  2001-07-29  0:08                                     ` ext3-2.4-0.9.4 Matthias Andree
  0 siblings, 1 reply; 662+ messages in thread
From: Rik van Riel @ 2001-07-28 23:47 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Lawrence Greenfield, linux-kernel

On Sun, 29 Jul 2001, Matthias Andree wrote:

> The standard is only useful if it specifies how to get data safely on
> disk - it is quite explicit for fsync(), but you evidently cannot
> fsync() a link().

As Linus said, fsync() on the directory.

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-28 22:45                               ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-28 23:50                                 ` Rik van Riel
  2001-07-29 13:42                                 ` ext3-2.4-0.9.4 Hans Reiser
  1 sibling, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-28 23:50 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Hans Reiser, Andre Pang, Larry McVoy, linux-kernel

On Sun, 29 Jul 2001, Matthias Andree wrote:
> On Thu, 26 Jul 2001, Hans Reiser wrote:
>
> > No, Linus is right and the MTA guys are just wrong.  The mailers are
> > the place to fix things, not the kernel.  If the mailer guys want to
> > depend on the kernel being stupidly designed, tough.  Someone should
> > fix their mailer code and then it would run faster on Linux than on
> > any other platform.
>
> Well, some systems are even documented that way, so there's nothing
> with "depend on the kernel being stupidly designed", but "depend on
> what mount(8) says".

The key word here is "some systems".

> MTA authors don't play games, they also write that their software
> relies on this behaviour, as laid out.

"MTA authors don't play games" ?!?!

I wonder how that explains things like QMQP or the
next-to-useless bounce messages generated by Notes ;)

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-28 23:47                                   ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-29  0:08                                     ` Matthias Andree
  2001-07-29  2:51                                       ` ext3-2.4-0.9.4 Mike Touloumtzis
  2001-07-29 14:00                                       ` ext3-2.4-0.9.4 Rik van Riel
  0 siblings, 2 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-29  0:08 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Matthias Andree, Lawrence Greenfield, linux-kernel

On Sat, 28 Jul 2001, Rik van Riel wrote:

> > The standard is only useful if it specifies how to get data safely on
> > disk - it is quite explicit for fsync(), but you evidently cannot
> > fsync() a link().
> 
> As Linus said, fsync() on the directory.

Relying on that to work on other operating systems is no better than
demanding synchronous meta data writes: relying on undocumented
behaviour.

If we spake about Linux-specific applications, that'd be okay, but we
speak about portable applications, and the diversity is bigger than
useful. Speed is not the only problem the OS has to solve.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-28 11:11           ` Kurt Garloff
  2001-07-28 11:49             ` Victor Julien
@ 2001-07-29  0:37             ` J. Dow
  1 sibling, 0 replies; 662+ messages in thread
From: J. Dow @ 2001-07-29  0:37 UTC (permalink / raw)
  To: Kurt Garloff, David Lang
  Cc: Alan Cox, cw, ppeiffer, linux-kernel, Arjan van de Ven, Chris Brady

From: "Kurt Garloff" <garloff@suse.de>

On Fri, Jul 27, 2001 at 05:23:07PM -0700, David Lang wrote:
> I have a 1u box at my des that has two MSI boards in it with 1.2G athlons.
> at the moment they are both running 2.4.5 (athlon optimized), one box has
> no problems at all while the other dies (no video, no keyboard, etc)
> within an hour of being booted.

Somebody told he had the same MoBo already replaced a couple of times ...

Kurt, et al, I have been following this VIA vs Linux thing for some time
now. (My "big machine" is an Athlon based system. So it interests me.)
Comments have been made about the size of power supply needed to keep these
systems happy with 400 watts coming up in discussions frequently. But if you
pause to think on it a few minutes you begin to wonder about this concept.
The RAM runs at about 3.3 volts. The CPU core runs at about 1.7v (in my case.)
So both of these are running off of power supplies on the motherboards that
take the 5 volts down to something reasonable. If the problem is inadequate
power supply AND it is more of a problem with some motherboards than others
I look for the volts. Where are the losses which could cause this. One source
is the connector from the power supply to the motherboard. (This was a chronic
problem with A2000s, for example.) I don't see newer style connectors that
have less contact resistance on any systems. That is probably a factor. Since
the problem is greater with some boards than others I suspect that the
auxilliary power sipplies on the motherboards are better for some boards than
for others. Somebody with hardware access to a sufficient variety of mother-
boards should survey this. Do they all use exactly the same power supply parts?

Another issue is the speed of these systems. And the Athlon problem seems to
peak when driving the various buses at their peaks. RF crosstalk is an issue
that a lot of digital designers claim to understand when they design (and
model) their circuits. Now, I built my first circuit analysis program back
in about 1975. Results of that work fly on GPS satellites today. Since it was
MY program I used for design I was acutely aware of its deficiencies as well
as the modeling deficiencies. At some point in the analysis you cut a corner
or two in order to make the calculations tractable. You do not manage to get
all the "strays" into the models. What I ams saying is that board layout is
another area where problems may exist.

These are not thigs software settings in the VIA chips can cure. On another
mailinglist catering to developers for very exotic video cards some problems
with the latest INTEL based motherboards are appearing. (DigiSuite:E and its
kith and kin drive the PCI bus very hard.) I suspect we have a situation not
properly anticipated in modeling the backplanes on all these boards. And until
the designers can wrap their minds around the entire problem the software
solution may simply be, in the words of an old philospher, "Slow down! You move
too fast." At the same time someone with suitable test equipment needs to look
for voltage glitches out of the motherboard regulators and we need to develop
software tools for "forcing" the suspected crosstalk and ideally characterising
it with regards to data passing on the bus at the time of the bad data transfer.
The software based fixes seem to be shooting at black cats in a coal mine
without a flashlight or IR goggles.

{^_^}    Joanne Dow, jdow@earthlink.net



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: make rpm
       [not found] ` <no.id>
                     ` (33 preceding siblings ...)
  2001-07-28 19:08   ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Alan Cox
@ 2001-07-29  0:38   ` Alan Cox
  2001-07-29  7:05   ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Richard Gooch
                     ` (168 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-29  0:38 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Alan Cox, linux-kernel

> Alan Cox <alan@lxorguk.ukuu.org.uk> said:
> > I've been meaning to do this one for a while and I now have it working so
> > that with my current -ac kernel working tree I can type
> > 
> > 	make rpm
> > 
> > and out puts kernel-2.4.7ac3-1.i386.rpm
> 
> Great idea!
> 
> Just the bunch of "echo this or that" is ugly as sin... why not a
> kernel.spec template that gets its version &c substituted by sed(1) or
> something?

Well for one because its easier to hack on at the moment. I still need to
finish up packing the right pieces, and also checking if the user
has an /sbin/installkernel and also if they are not using GRUB need to then
rerun lilo

Alan


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: missing symbols in 2.4.7-ac2
  2001-07-28 16:43       ` missing symbols in 2.4.7-ac2 Thomas Kotzian
@ 2001-07-29  1:53         ` Andrew Morton
  2001-07-29 10:21           ` Hugh Dickins
  0 siblings, 1 reply; 662+ messages in thread
From: Andrew Morton @ 2001-07-29  1:53 UTC (permalink / raw)
  To: Thomas Kotzian; +Cc: linux-kernel, Alan Cox

Thomas Kotzian wrote:
> 
> when compiling with highmem = 4GB
> problem in 3c59x - module:
> unresolved symbol nr_free_highpages ...
> 

Ah.  Sorry.

Alan, is it OK to export this symbol?


--- linux-2.4.7-ac1/kernel/ksyms.c	Sun Jul 29 11:43:01 2001
+++ ac/kernel/ksyms.c	Sun Jul 29 11:43:05 2001
@@ -122,6 +122,7 @@ EXPORT_SYMBOL(kmap_high);
 EXPORT_SYMBOL(kunmap_high);
 EXPORT_SYMBOL(highmem_start_page);
 EXPORT_SYMBOL(create_bounce);
+EXPORT_SYMBOL(nr_free_highpages);
 #endif
 
 /* filesystem internal functions */

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-28 19:03       ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-29  1:53         ` Chris Wedgwood
  2001-07-30  0:32           ` ext3-2.4-0.9.4 Chris Mason
  2001-07-29  1:59         ` ext3-2.4-0.9.4 Andrew Morton
  1 sibling, 1 reply; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-29  1:53 UTC (permalink / raw)
  To: Alan Cox; +Cc: Patrick J. LoPresti, linux-kernel, Chris Mason

On Sat, Jul 28, 2001 at 08:03:37PM +0100, Alan Cox wrote:

    Ext3 I believe so, Reiserfs I would assume so but Hans can answer
    definitively

Reiserfs does not, nor are creates or unlink operations synchronous.

For MTAs it just happens to work: if you fsync the way transactions
are written means the metadata for the dirtectories is written as part
of the transaction --- but I think this is a quirk and not by design?

Chris?




  --cw

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-28 19:03       ` ext3-2.4-0.9.4 Alan Cox
  2001-07-29  1:53         ` ext3-2.4-0.9.4 Chris Wedgwood
@ 2001-07-29  1:59         ` Andrew Morton
  1 sibling, 0 replies; 662+ messages in thread
From: Andrew Morton @ 2001-07-29  1:59 UTC (permalink / raw)
  To: Alan Cox; +Cc: Patrick J. LoPresti, linux-kernel

Alan Cox wrote:
> 
>...
> > If you have metadata journalling, all you need for this algorithm to
> > work is to have rename() write to the journal before returning.  Is
> > this true for any of the current journalling file systems on Linux?
> 
> Ext3 I believe so, Reiserfs I would assume so but Hans can answer
> definitively

For ext3: this is true if something forces a commit.  Apart from data in
`-o data=writeback' mode, a commit syncs the entire filesystem.
Things which force a commit include:

- completing a write() on an O_SYNC file.
- Performing any metadata operation on a `chattr +S' object
- Performing any metadata operation on an object on a `mount -o sync'
  filesystem.

In `data=journal' or `data=ordered' mode, any of these things will
commit everything to non-volatile storage.

-

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-29  0:08                                     ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-29  2:51                                       ` Mike Touloumtzis
  2001-07-29  9:28                                         ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-29 14:00                                       ` ext3-2.4-0.9.4 Rik van Riel
  1 sibling, 1 reply; 662+ messages in thread
From: Mike Touloumtzis @ 2001-07-29  2:51 UTC (permalink / raw)
  To: linux-kernel

On Sun, Jul 29, 2001 at 02:08:12AM +0200, Matthias Andree wrote:
> On Sat, 28 Jul 2001, Rik van Riel wrote:
> 
> > As Linus said, fsync() on the directory.
> 
> Relying on that to work on other operating systems is no better than
> demanding synchronous meta data writes: relying on undocumented
> behaviour.

You are blurring the boundaries between "undocumented behavior" and
"OS-specific behavior".  fsync() on a directory to sync metadata is a
defined (according to my copy of fsync(2)), Linux-specific behavior.
It is also very reasonable IMHO and in keeping with the traditional
Unix notion of directories as lists of files.

I argue that using defined Linux behavior to implement what you want
on Linux systems _is_ better than relying on undocumented behavior,
and I think most people would agree.  If you don't do this you have
not really ported the software to Linux; you instead have some
standards compliant software that "kinda usually works on Linux".
You could argue that no one should localize their software to
different versions of Unix, but you would be by far in the minority.

http://www.google.com/search?q=autoconf

Writing portable Unix software has always meant some degree
of system-specific accomodation.  It's a bummer but it's life;
otherwise Unix wouldn't evolve.

miket

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-27 20:40       ` Alan Cox
                           ` (2 preceding siblings ...)
  2001-07-28  0:23         ` David Lang
@ 2001-07-29  4:03         ` Gav
  2001-07-29 16:10           ` Mike Frisch
  2001-07-30  7:15           ` Steffen Persvold
  3 siblings, 2 replies; 662+ messages in thread
From: Gav @ 2001-07-29  4:03 UTC (permalink / raw)
  To: linux-kernel

On Friday 27 July 2001 20:40, Alan Cox wrote:

> > On Fri, Jul 27, 2001 at 09:19:21PM +0100, Alan Cox wrote:
> >     Its heavily tied to certain motherboards. Some people found a
> >     better PSU fixed it, others that altering memory settings
> >     helped. And in many cases, taking it back and buying a different
> >     vendors board worked.
> >
> > Does anyone know *why* stuff breaks? surely VIA do as they have a fix
> > for (some, all?) cases of breakage?
>
> At the moment the big problem is I don't have enough reliable info to
> see patterns that I can give to VIA for study. VIAs fixes for board
> problems are for the fifo problem normally seen with the 686B and SB Live
> but sometimes in other cases.
>
> (and it seems also we have a few via + promise weirdnesses on all sorts of
>  boards not yet explained)

Just FYI, I've been running 2.4.7-pre6 for a few weeks on a Abit-KT7-a 
(hpt370) that uses the KT133/VIA chipset, with a 1.33Ghz Athlon and the 
kernel compiled for an Athlon. 

The machine is now rock solid. I've given it the usual tests, k7burn for 5 
hours, cp'ing 30G+ across drives a few times etc, and all is good.

The broken sound (crackle/pop) with my SB128PCI (same problem as SBLive) 
still didn't go away though, but enabling PCI DRAM PREFETCH on the VT8363 
Bus-PCI Bridge does cure it. This took me a while to find as I can't set this 
in my bios, but powertweak came to the rescue.

While DRAM Prefetch is supposed to be an option to increase performance, my 
sound is totally unusable without this set. I've heard numerous people 
explain the same problem and it would be interesting to find out if this 
cures their sound troubles too. If this is the case, is this something that 
belongs in quirks, or is it too hardware specific? and would enabling this by 
default hurt anything anyway? Or is this just masking the underlaying problem 
?

-- Regards, Gavin Baker


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
       [not found] ` <no.id>
                     ` (34 preceding siblings ...)
  2001-07-29  0:38   ` make rpm Alan Cox
@ 2001-07-29  7:05   ` Richard Gooch
  2001-07-29 10:00     ` Chris Wedgwood
  2001-08-02  0:20   ` 2.4.2 ext2fs corruption status Alan Cox
                     ` (167 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Richard Gooch @ 2001-07-29  7:05 UTC (permalink / raw)
  To: Alan Cox; +Cc: Jeff Garzik, Matthew Gardiner, Philip R. Auld, kernel

Alan Cox writes:
> > The right answer for vendors who want to ship binary modules is to
> > ship an Open Source interface layer which shields the vendor from
> > kernel drift (since users will be able to build the interface layer if
> > they need to, without waiting for the vendor).
> 
> As people have seen from vmware and from the ever growing piles of
> nvidia crashes the truth about binary modules in general even with
> glue is pain and suffering.

Sure. If you load a binary module (shim layer or not), you don't get
community support. So vendors are digging their own shitpile by
shipping binary-only drivers. I just don't see the need to shove them
in the back while they do it.

Besides, if someone can make a lot of money shipping binary drivers,
then they can afford the support costs, so it may well be a viable
revenue model for them (at the very least, programmers need to eat
too).

> Veritas have some good Linux people though, and while I'm sad they
> won't open source the core of veritas they do at least appear to
> have the knowledgebase to do a good job

Yeah, I'd rather see all source open. But that's an ideal world. In
the meantime, many people want $$$. One of the great things about
Linux is that it is open and allows different funding models. The
success of Linux is due to the openness, not some cool technological
feature.

Open Source pushes the "innovation envelope". Eventually, the "core"
(what's now the basic OS) which isn't worth selling grows outwards,
consuming areas where it used to be profitable to sell software. So it
forces companies to innovate or die, leading to a dynamic industry.
That is good for both Society and Industry (as seen by the respective
idealogical poles).

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-29  2:51                                       ` ext3-2.4-0.9.4 Mike Touloumtzis
@ 2001-07-29  9:28                                         ` Matthias Andree
  2001-07-29 14:16                                           ` ext3-2.4-0.9.4 Rik van Riel
                                                             ` (2 more replies)
  0 siblings, 3 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-29  9:28 UTC (permalink / raw)
  To: linux-kernel

On Sat, 28 Jul 2001, Mike Touloumtzis wrote:

> You are blurring the boundaries between "undocumented behavior" and
> "OS-specific behavior".  fsync() on a directory to sync metadata is a
> defined (according to my copy of fsync(2)), Linux-specific behavior.
> It is also very reasonable IMHO and in keeping with the traditional
> Unix notion of directories as lists of files.

No-one claims that fsync() the directory is a bad interface - it's
non-portable however. Actually, chattr +S is well-documented - it just
doesn't work on ReiserFS or Minix for now, and it may be unnecessarily
slow on ext2.

As pointed out more than once, "synchronous meta data" is documented e.
g.  for FreeBSD, so in at least these two cases, the box relies on
documented behaviour.

> http://www.google.com/search?q=autoconf
> 
> Writing portable Unix software has always meant some degree
> of system-specific accomodation.  It's a bummer but it's life;
> otherwise Unix wouldn't evolve.

How can autoconf figure if you need to fsync() the directory? Apart from
that, which Unix MTA uses autoconf?

Remember, the whole discussion is about getting rid of the need for
chattr +S and offering the admin the chance to mount or flag a directory
for synchronous meta data updates.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
  2001-07-29  7:05   ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Richard Gooch
@ 2001-07-29 10:00     ` Chris Wedgwood
  2001-07-31 15:18       ` Florian Weimer
  0 siblings, 1 reply; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-29 10:00 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Alan Cox, Jeff Garzik, Matthew Gardiner, Philip R. Auld, kernel

On Sun, Jul 29, 2001 at 03:05:06AM -0400, Richard Gooch wrote:

    Yeah, I'd rather see all source open. But that's an ideal world. In
    the meantime, many people want $$$. One of the great things about
    Linux is that it is open and allows different funding models. The
    success of Linux is due to the openness, not some cool technological
    feature.

People all need to appreciate sometimes vendors cannot released open
source drivers even if they wanted too.  Sometimes vendors have the
ability to released binary only drivers which are derived in part from
source-code which they license --- but cannot share.

This is also the case for various SCSI cards and such like, firmware
is provided binary-only because the source for the firmware isn't
something that can be distributed.



  --cw

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-28 16:25       ` Alan Cox
  2001-07-28 16:27         ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Jeff Garzik
  2001-07-28 17:44         ` Richard Gooch
@ 2001-07-29 10:15         ` Matthew Gardiner
  2001-07-29 11:10           ` Chris Wedgwood
  2 siblings, 1 reply; 662+ messages in thread
From: Matthew Gardiner @ 2001-07-29 10:15 UTC (permalink / raw)
  To: Alan Cox, Matthew Gardiner; +Cc: Alan Cox, Philip R. Auld, kernel

On Sunday 29 July 2001 04:25, Alan Cox wrote:
> > I've noticed that in the menuconfig there is support for the Vertias
> > Journalling File System. Has there been any push for that to be a
> > "bootable" filesystem so it can be used for Linux?
>
> The Linux freevxfs module is read only currently. Veritas apparently will
> be releasing the genuine article for Linux but binary only with all the
> mess that entails

tsk tsk tsk. A bit disappointing that Vertias has taken that approach. 
However, even still, reiserFS is pretty awsome. Extremely fast and space 
efficient, esp on a 60gig drive ;)

Matthew Gardiner
-- 
WARNING:

This email was written on an OS using the viral 'GPL' as its license.

Please check with Bill Gates before continuing to read this email/posting.

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: missing symbols in 2.4.7-ac2
  2001-07-29  1:53         ` Andrew Morton
@ 2001-07-29 10:21           ` Hugh Dickins
  2001-07-29 10:48             ` Andrew Morton
  0 siblings, 1 reply; 662+ messages in thread
From: Hugh Dickins @ 2001-07-29 10:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Thomas Kotzian, linux-kernel, Alan Cox

On Sun, 29 Jul 2001, Andrew Morton wrote:
> Thomas Kotzian wrote:
> > when compiling with highmem = 4GB
> > problem in 3c59x - module:
> > unresolved symbol nr_free_highpages ...
> 
> Ah.  Sorry.
> Alan, is it OK to export this symbol?

Laconic version: "Probably not: si_meminfo() is your friend".

Verbose version:
I don't think you really want nr_free_highpages(), that's transient
info - it won't usually fall so low as 0 if there is highmem, but do
you want to rely on that?  And nr_free_highpages() is CONFIG_HIGHMEM
only, so you'd need #ifdef CONFIG_HIGHMEM around its call in 3c59x.c.

But si_meminfo() is already exported: I think sysinfo.totalhigh is
what you want to check; if I'm wrong, and you really are interested
in whether there are currently free highpages, sysinfo.freehigh
gives you that too without needing a new export.

(I think there probably will be a need for new interfaces
to export more per-zone memory info, but not for this.)

Hugh

--- linux-2.4.7-ac2/drivers/net/3c59x.c	Sat Jul 28 07:12:03 2001
+++ linux/drivers/net/3c59x.c	Sun Jul 29 10:53:31 2001
@@ -1299,8 +1299,11 @@
 	/* The 3c59x-specific entries in the device structure. */
 	dev->open = vortex_open;
 	if (vp->full_bus_master_tx) {
+		struct sysinfo sysinfo;
+
 		dev->hard_start_xmit = boomerang_start_xmit;
-		if (nr_free_highpages() == 0) {
+		si_meminfo(&sysinfo);
+		if (sysinfo.totalhigh == 0) {
 			/* Actually, it still should work with iommu. */
 			dev->features |= NETIF_F_SG;
 		}


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
  2001-07-28 19:08   ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Alan Cox
@ 2001-07-29 10:24     ` Matthew Gardiner
  2001-07-29 11:07       ` Chris Wedgwood
  2001-07-31 15:19       ` Florian Weimer
  0 siblings, 2 replies; 662+ messages in thread
From: Matthew Gardiner @ 2001-07-29 10:24 UTC (permalink / raw)
  To: Alan Cox, Richard Gooch
  Cc: Jeff Garzik, Alan Cox, Matthew Gardiner, Philip R. Auld, kernel

On Sunday 29 July 2001 07:08, Alan Cox wrote:
> > The right answer for vendors who want to ship binary modules is to
> > ship an Open Source interface layer which shields the vendor from
> > kernel drift (since users will be able to build the interface layer if
> > they need to, without waiting for the vendor).
>
> As people have seen from vmware and from the ever growing piles of
> nvidia crashes the truth about binary modules in general even with glue is
> pain and suffering.
>
> Veritas have some good Linux people though, and while I'm sad they won't
> open source the core of veritas they do at least appear to have the
> knowledgebase to do a good job

1. With the file system, why not charge for commercial use?
2. Regards to hardware manufacturers, what have the got to lose from 
publishing the specs? nothing.

Matthew Gardiner
-- 
WARNING:

This email was written on an OS using the viral 'GPL' as its license.

Please check with Bill Gates before continuing to read this email/posting.

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: missing symbols in 2.4.7-ac2
  2001-07-29 10:21           ` Hugh Dickins
@ 2001-07-29 10:48             ` Andrew Morton
  0 siblings, 0 replies; 662+ messages in thread
From: Andrew Morton @ 2001-07-29 10:48 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Thomas Kotzian, linux-kernel, Alan Cox

Hugh Dickins wrote:
> 
> On Sun, 29 Jul 2001, Andrew Morton wrote:
> > Thomas Kotzian wrote:
> > > when compiling with highmem = 4GB
> > > problem in 3c59x - module:
> > > unresolved symbol nr_free_highpages ...
> >
> > Ah.  Sorry.
> > Alan, is it OK to export this symbol?
> 
> Laconic version: "Probably not: si_meminfo() is your friend".

:)

> Verbose version:
> I don't think you really want nr_free_highpages(), that's transient
> info - it won't usually fall so low as 0 if there is highmem, but do
> you want to rely on that?

Prefer not to.  We want to know "does the system have any highmem
pages".  I didn't know about sysinfo.totalhigh, so I used
nr_free_highpages(), which answers the question "does the system
have any free high pages right now".

It's good enough - if we get it wrong (system was very low on memory
when the driver was initialised) the driver will work - it just won't
perform zerocopy optimisations.

>  And nr_free_highpages() is CONFIG_HIGHMEM
> only, so you'd need #ifdef CONFIG_HIGHMEM around its call in 3c59x.c.

That's OK actually - nr_free_highpages() evaluates to constant zero if
CONFIG_HIGHMEM isn't defined.


--- linux-2.4.7-ac2/drivers/net/3c59x.c Sat Jul 28 07:12:03 2001
+++ linux/drivers/net/3c59x.c   Sun Jul 29 10:53:31 2001
@@ -1299,8 +1299,11 @@
        /* The 3c59x-specific entries in the device structure. */
        dev->open = vortex_open;
        if (vp->full_bus_master_tx) {
+               struct sysinfo sysinfo;
+
                dev->hard_start_xmit = boomerang_start_xmit;
-               if (nr_free_highpages() == 0) {
+               si_meminfo(&sysinfo);
+               if (sysinfo.totalhigh == 0) {
                        /* Actually, it still should work with iommu. */
                        dev->features |= NETIF_F_SG;
                }

Much preferable!  Thanks.

I've checked all the architectures.  Looks fine, works OK.  Alan, please
apply this one.

-

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
  2001-07-29 10:24     ` Matthew Gardiner
@ 2001-07-29 11:07       ` Chris Wedgwood
  2001-07-31 15:19       ` Florian Weimer
  1 sibling, 0 replies; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-29 11:07 UTC (permalink / raw)
  To: Matthew Gardiner
  Cc: Alan Cox, Richard Gooch, Jeff Garzik, Philip R. Auld, kernel

On Sun, Jul 29, 2001 at 10:24:11PM +1200, Matthew Gardiner wrote:

    1. With the file system, why not charge for commercial use?

Maybe they will... but it's not something they could do under the GPL.

    2. Regards to hardware manufacturers, what have the got to lose from
       publishing the specs? nothing.

Many manufactures will claim otherwise... for some hardware products,
the useful life-cycles is only six months, if you can't make money
within that period of time, the product never will --- so there are
arguments for keeping things vague for just a little while.

Also, some hardware vendors cannot release specifications because they
don't own all the IP here either (see my earlier comments) or are part
of some kind of cartel/consortium which is overrun by labatomized
lawyers, the DVD people for example.




  --cw

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-29 10:15         ` ReiserFS / 2.4.6 / Data Corruption Matthew Gardiner
@ 2001-07-29 11:10           ` Chris Wedgwood
  2001-07-29 14:28             ` Luigi Genoni
  0 siblings, 1 reply; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-29 11:10 UTC (permalink / raw)
  To: Matthew Gardiner; +Cc: Alan Cox, Philip R. Auld, kernel

On Sun, Jul 29, 2001 at 10:15:03PM +1200, Matthew Gardiner wrote:

    tsk tsk tsk. A bit disappointing that Vertias has taken that approach. 
    However, even still, reiserFS is pretty awsome. Extremely fast and space 
    efficient, esp on a 60gig drive ;)

Why "tsk tsk tsk" ?  If reiserfs suits you, use it --- you need never
go near VXFS.

Personally, even though I use reiserfs, I am of the opinion that XFS,
and VXFS and both superior, especially when you include volume
management.  Time will show whether or not these very well designed
file-systems are suitable under Linux though, as reiserfs has a
considerable head start.



  --cw

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-28 14:18     ` Matthew Gardiner
  2001-07-28 16:25       ` Alan Cox
  2001-07-28 16:43       ` missing symbols in 2.4.7-ac2 Thomas Kotzian
@ 2001-07-29 11:16       ` Christoph Hellwig
  2 siblings, 0 replies; 662+ messages in thread
From: Christoph Hellwig @ 2001-07-29 11:16 UTC (permalink / raw)
  To: Matthew Gardiner; +Cc: kernel

In article <01072902183404.02683@kiwiunixman.nodomain.nowhere> you wrote:
> I've noticed that in the menuconfig there is support for the Vertias 
> Journalling File System. Has there been any push for that to be a "bootable" 
> filesystem so it can be used for Linux?

I don't see any reason wht it shoudn't be bootable, I just haven't tested it
yet.  If you want to try it, please follow the below steps:

1) Get one of these CD-ROM readonly distribution
2) Copy it over NFS to a UnixWare (or any other x86 System with VxFS)
3) Make a VxFS system big enough for the distribution
4) Copy the Distribution on the VxFS filesystem

And now the difficult part:

5) Adjust the ondisk dev_t to match Linux's major/minor split instead
   of SVR4's.  This can either be done by creating (bogus) SVR4 device
   nodes that are valid Linux ones when read by Linux or by doing this
   with fsdb after they were created.

If you have success with this sppropeach please drop me a mail - I'll add
it to the freevxfs docs then.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-28 22:45                               ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-28 23:50                                 ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-29 13:42                                 ` Hans Reiser
  1 sibling, 0 replies; 662+ messages in thread
From: Hans Reiser @ 2001-07-29 13:42 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Andre Pang, Larry McVoy, linux-kernel

Matthias Andree wrote:
> 
> On Thu, 26 Jul 2001, Hans Reiser wrote:
> 
> > No, Linus is right and the MTA guys are just wrong.  The mailers are
> > the place to fix things, not the kernel.  If the mailer guys want to
> > depend on the kernel being stupidly designed, tough.  Someone should
> > fix their mailer code and then it would run faster on Linux than on
> > any other platform.
> 
> Well, some systems are even documented that way, so there's nothing with
> "depend on the kernel being stupidly designed", but "depend on what
> mount(8) says".
> 
> MTA authors don't play games, they also write that their software relies
> on this behaviour, as laid out.
> 
> --
> Matthias Andree
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
Documenting their code won't make it fast or well designed.

Hans

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-29  0:08                                     ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-29  2:51                                       ` ext3-2.4-0.9.4 Mike Touloumtzis
@ 2001-07-29 14:00                                       ` Rik van Riel
  1 sibling, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-29 14:00 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Lawrence Greenfield, linux-kernel

On Sun, 29 Jul 2001, Matthias Andree wrote:
> On Sat, 28 Jul 2001, Rik van Riel wrote:
>
> > > The standard is only useful if it specifies how to get data safely on
> > > disk - it is quite explicit for fsync(), but you evidently cannot
> > > fsync() a link().
> >
> > As Linus said, fsync() on the directory.
>
> Relying on that to work on other operating systems is no better than
> demanding synchronous meta data writes: relying on undocumented
> behaviour.
>
> If we spake about Linux-specific applications, that'd be okay, but we
> speak about portable applications, and the diversity is bigger than
> useful. Speed is not the only problem the OS has to solve.

I guess many MTAs have a small libc inside of them exactly
in order to handle things like this without fouling up the
core code too much.

Time to make your favorite MTA use link_slowly()  ;)

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-29  9:28                                         ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-29 14:16                                           ` Rik van Riel
  2001-07-29 23:19                                           ` ext3-2.4-0.9.4 Mike Touloumtzis
  2001-07-30 14:41                                           ` ext3-2.4-0.9.4 Ketil Froyn
  2 siblings, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-29 14:16 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

On Sun, 29 Jul 2001, Matthias Andree wrote:

> How can autoconf figure if you need to fsync() the directory? Apart
> from that, which Unix MTA uses autoconf?

Zmailer uses autoconf, Exim also has some nice
tool to make itself build for the right OS using
the right interfaces.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-29 11:10           ` Chris Wedgwood
@ 2001-07-29 14:28             ` Luigi Genoni
  0 siblings, 0 replies; 662+ messages in thread
From: Luigi Genoni @ 2001-07-29 14:28 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Matthew Gardiner, Alan Cox, Philip R. Auld, kernel



On Sun, 29 Jul 2001, Chris Wedgwood wrote:

> On Sun, Jul 29, 2001 at 10:15:03PM +1200, Matthew Gardiner wrote:
>
>     tsk tsk tsk. A bit disappointing that Vertias has taken that approach.
>     However, even still, reiserFS is pretty awsome. Extremely fast and space
>     efficient, esp on a 60gig drive ;)
>
> Why "tsk tsk tsk" ?  If reiserfs suits you, use it --- you need never
> go near VXFS.
It depends, for example if you have to manage a farm (let's say 800
systems) with many Unixes
around, where solaris is the 70% of your installed basis, then
veritas (mainly the VM) could be a solution to keep an uniform
environment. That is a good thing if your sysadmin staff is composed also
by people without a real high skill.
>
> Personally, even though I use reiserfs, I am of the opinion that XFS,
> and VXFS and both superior, especially when you include volume
> management.
a journaling filesystem and a volume manager are two complementary
and usefull things, but anyway are  different things.
While i do agree that Linux LVM is still not complitelly usable in a
production environment, (but anyway ELVM from IBM is somehow immmature),
and some details of its design are not completely, how can I say...,
suitable for future HW developments, I found reiserFS tecnology to be
really interesting. On a technological point of view reiserFS is much
more advanced in front of any other journaled FS around.

I still have to see vxfs with Linux, but i saw it under solaris and HP-UX
(i think I used all journaled aroung, jfs, xfs, reiserFS, ext3, vxfs, gfs,
on all unixes i could), seeing it to too much slow on high end scsi HW,
and XFS on my origin 2000 (8 processor) sometimes takes one CPU just to
manage journaling under heavy I/O. Under Linux xfs is maybe better that
under Irix (!!!???), but its tecnology was thinked for other kind of HW,
and an experienced sysadmin can "feel" this.
> Time will show whether or not these very well designed
> file-systems are suitable under Linux though, as reiserfs has a
> considerable head start.
Yes, time will show. reiserFS can have a wonderfull future, better than
ext3 if it will be mature before ext3, worse if after. But for Linux
jfs and xfs are interesting right now, just untill native journaled will
be ready, then i would bet everyone will stay with reiserFS or ext3, not
considering any other solution.

Luigi



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-29  4:03         ` Gav
@ 2001-07-29 16:10           ` Mike Frisch
  2001-07-30  7:15           ` Steffen Persvold
  1 sibling, 0 replies; 662+ messages in thread
From: Mike Frisch @ 2001-07-29 16:10 UTC (permalink / raw)
  To: linux-kernel

On Sun, Jul 29, 2001 at 04:03:29AM +0000, Gav wrote:
> The machine is now rock solid. I've given it the usual tests, k7burn for 5 
> hours, cp'ing 30G+ across drives a few times etc, and all is good.

Sorry to jump in here, but where can I get "k7burn"?  I've searched on
google.com for it and cannot find any reference.  I am running 2.4.7-ac2
(with Athlon optimizations) with an AMD T-Bird 1.2GHz on an ASUS A7A266
and it appears quite stable.  I would like to see how it fares with this
burn-in program you speak of.

Thanks,

Mike.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-28 22:00           ` PEIFFER Pierre
@ 2001-07-29 20:28             ` Kurt Garloff
  2001-07-30  6:04               ` Daniela Engert
  0 siblings, 1 reply; 662+ messages in thread
From: Kurt Garloff @ 2001-07-29 20:28 UTC (permalink / raw)
  To: PEIFFER Pierre; +Cc: linux-kernel, Bart Hartgers

[-- Attachment #1: Type: text/plain, Size: 3265 bytes --]

Hi Pierre,

thanks for your info!

On Sat, Jul 28, 2001 at 06:00:11PM -0400, PEIFFER Pierre wrote:
> Kurt Garloff a écrit :
> In attached files are the result. I've only kept the (what I suppose to
> be) northbridge info.
> This doesn't tell me anything...

Me neither. I was hoping that only a bit differs. Unfortunately that's not
the case, so I need to have a look in the datasheet.
But those are not publically available :-(
Anybody having them?

> Note: both has been done after booting on  Mandrake-kernel 2.4.3 which
> come with Mandrake distribution (i.e. with lot of patches and
> options...) I don't know the impact on the result...

With a newer lspci you would have seen that 1106:0305 is VT8363/8365
[KT133/KM133].

I removed everything except for the differences. Underlined. Anybody able to
decode? Otherwise trying out all of them can get boring. (Well, I'd start
with 0x68, followed by 0x6b and 0xac  ...)

Working:

> 00:00.0 Host bridge: VIA Technologies, Inc.: Unknown device 0305 (rev 03)
> 	Subsystem: ABIT Computer Corp.: Unknown device a401
> 	Flags: bus master, medium devsel, latency 0
                                                  ^ This looks wrong to me.
> 	Memory at d8000000 (32-bit, prefetchable) [size=64M]
> 	Capabilities: [a0] AGP version 2.0
> 	Capabilities: [c0] Power Management version 2
> 00: 06 11 05 03 06 00 10 a2 03 00 00 06 00 00 00 00
                                              ^ Latency.
> 50: 17 a3 eb b4 02 00 10 10 c0 00 08 10 10 10 10 10
                  ^^ ^^
> 60: 03 aa 02 20 e6 d6 d6 c6 51 28 43 0d 08 3f 00 00
                              ^^       ^^
> a0: 02 c0 20 00 17 02 00 1f 00 00 00 00 2b 12 14 00
                                          ^^
> b0: 49 da 00 60 31 ff 80 05 67 00 00 00 00 00 00 00
            ^^
> f0: 00 00 00 00 00 03 03 00 22 00 00 00 00 00 00 00
                                                ^^ ^^

Buggy: (Own, buggy settings in parens)

> 00:00.0 Host bridge: VIA Technologies, Inc.: Unknown device 0305 (rev 03)
> 	Subsystem: ABIT Computer Corp.: Unknown device a401
> 	Flags: bus master, medium devsel, latency 8
                                                  ^ That's more reasonable.
> 	Memory at d8000000 (32-bit, prefetchable) [size=64M]
> 	Capabilities: [a0] AGP version 2.0
> 	Capabilities: [c0] Power Management version 2
> 00: 06 11 05 03 06 00 10 a2 03 00 00 06 00 08 00 00
                                              ^ Latency
> 50: 17 a3 eb b4 43 89 10 10 c0 00 08 10 10 10 10 10
                  ^^ ^^					(47 8d here)
> 60: 03 aa 02 20 e6 d6 d6 c6 45 28 43 0f 08 3f 00 00
                              ^^       ^^		(41 .. 21 here)
> a0: 02 c0 20 00 17 02 00 1f 00 00 00 00 2f 12 14 00
                                          ^^		(6b here)
> b0: 49 da 88 60 31 ff 80 05 67 00 00 00 00 00 00 00
            ^^						(22 here)
> f0: 00 00 00 00 00 03 03 00 22 00 00 00 00 00 91 06
                                                ^^ ^^	(00 00 here)

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, DE                                SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-29  9:28                                         ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-29 14:16                                           ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-29 23:19                                           ` Mike Touloumtzis
  2001-07-30 14:41                                           ` ext3-2.4-0.9.4 Ketil Froyn
  2 siblings, 0 replies; 662+ messages in thread
From: Mike Touloumtzis @ 2001-07-29 23:19 UTC (permalink / raw)
  To: linux-kernel

On Sun, Jul 29, 2001 at 11:28:10AM +0200, Matthias Andree wrote:
> 
> How can autoconf figure if you need to fsync() the directory? Apart from
> that, which Unix MTA uses autoconf?

My point was not that they should be using autoconf;
I don't know if they are or not.  My point was that
they should use existing published interfaces that are
reasonable, rather than push for guarantees that impose
new requirements on filesystems.  And even without
autoconf it's not hard to figure out what system you're
running on.

    rename(tmpfile, spoolfile);
#ifdef __linux___
    fsync(tmpdir);
    fsync(spooldir);
#endif
    /* transaction is complete */

> 
> Remember, the whole discussion is about getting rid of the need for
> chattr +S and offering the admin the chance to mount or flag a directory
> for synchronous meta data updates.

Right; and I'm arguing that the way to get rid of the need
for chattr +S is to incorporate directory fsync() in the
MTAs, not to cram more features into the filesystems.

Problem: MTA needs to know when rename() has been forced
to disk.

Solution 1: MTA authors use fsync(dirfd) on Linux.

Analysis: This is not the most portable solution, but it
should work on any FS that supports Linux semantics.  You
can't expect such semantics on FAT and other filesystems
that are just supported for compatibility reasons.  But you
could, say, switch filesystems for performance reasons, and
not have your MTA start mysteriously failing, because you
are using the official, documented API to do what you want
to do (at the very least you would be in a much stronger
position when pushing a bug fix :-).

Solution 2: Linux semantics are changed so that rename()
returns only when the data hits the disk.  All filesystems
are expected to implement this change.

Analysis: This sucks.  It precludes some filesystem design
choices, prevents users from making a speed/reliability
tradeoff, and makes each filesystem more complex.

Solution 3: Some filesystems implement synchronous
directory updates for renames, using filesystem-specific
feature flags, chattr, etc.

Analysis: I wouldn't want to try to dictate anything to
the FS authors, but this solution seems inferior to me.
Each filesystem would have to implement such a flag to
become "MTA compatible".  Why add a complex feature to the
filesystem when it can already be accessed via a userspace
API?  It will be more complex for administrators too --
they will have to know which filesystems implement the
synchronous directory metadata.

There are lots of filesystems out there.  Why not use
an interface they should all support rather than ask for
per-filesystem, filesystem-specific improvements?

miket

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-29  1:53         ` ext3-2.4-0.9.4 Chris Wedgwood
@ 2001-07-30  0:32           ` Chris Mason
  2001-07-30 13:49             ` ext3-2.4-0.9.4 Patrick J. LoPresti
  0 siblings, 1 reply; 662+ messages in thread
From: Chris Mason @ 2001-07-30  0:32 UTC (permalink / raw)
  To: Chris Wedgwood, Alan Cox; +Cc: Patrick J. LoPresti, linux-kernel



On Sunday, July 29, 2001 01:53:48 PM +1200 Chris Wedgwood <cw@f00f.org>
wrote:

> On Sat, Jul 28, 2001 at 08:03:37PM +0100, Alan Cox wrote:
> 
>     Ext3 I believe so, Reiserfs I would assume so but Hans can answer
>     definitively
> 
> Reiserfs does not, nor are creates or unlink operations synchronous.
> 
> For MTAs it just happens to work: if you fsync the way transactions
> are written means the metadata for the dirtectories is written as part
> of the transaction --- but I think this is a quirk and not by design?
> 
> Chris?

Correct, in the current 2.4.x code, its a quirk.  fsync(any object) ==
fsync(all pending metadata, including renames).

There is a transcation tracking patch floating around out there that makes
reiserfs fsync/O_SYNC much faster by only committing the last transaction a
given file/dir was involved in.  I had sent this to alan just after 2.4.7
came out, but it looks like I need to resend.

Anyway, during a rename, this patch updates the inode transaction tracking
stuff so an fsync on the file should also commit the directory changes.
But, that isn't something I really intend to advertise much, since the
accepted linux way is fsync(dir).

-chris


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-29 20:28             ` Kurt Garloff
@ 2001-07-30  6:04               ` Daniela Engert
  2001-07-30 13:44                 ` Kurt Garloff
  0 siblings, 1 reply; 662+ messages in thread
From: Daniela Engert @ 2001-07-30  6:04 UTC (permalink / raw)
  To: linux-kernel

Hi Kurt!

On Sun, 29 Jul 2001 22:28:30 +0200, Kurt Garloff wrote:

>Me neither. I was hoping that only a bit differs. Unfortunately that's not
>the case, so I need to have a look in the datasheet.
>But those are not publically available :-(
>Anybody having them?

Try to get a clue yourself from the WPCREDIT KT133 plugin (see below,
stripped down to the differing registers). Some differences look
suspicious to me...

>Working:

>> 00: 06 11 05 03 06 00 10 a2 03 00 00 06 00 00 00 00
>                                              ^ Latency.
>> 50: 17 a3 eb b4 02 00 10 10 c0 00 08 10 10 10 10 10
>                  ^^ ^^
>> 60: 03 aa 02 20 e6 d6 d6 c6 51 28 43 0d 08 3f 00 00
>                              ^^       ^^
>> a0: 02 c0 20 00 17 02 00 1f 00 00 00 00 2b 12 14 00
>                                          ^^
>> b0: 49 da 00 60 31 ff 80 05 67 00 00 00 00 00 00 00
>            ^^
>> f0: 00 00 00 00 00 03 03 00 22 00 00 00 00 00 00 00
                                                ^^ ^^

>Buggy: (Own, buggy settings in parens)

>> 00: 06 11 05 03 06 00 10 a2 03 00 00 06 00 08 00 00
>                                              ^ Latency
>> 50: 17 a3 eb b4 43 89 10 10 c0 00 08 10 10 10 10 10
>                  ^^ ^^				(47 8d here)
>> 60: 03 aa 02 20 e6 d6 d6 c6 45 28 43 0f 08 3f 00 00
>                              ^^       ^^		(41 .. 21 here)
>> a0: 02 c0 20 00 17 02 00 1f 00 00 00 00 2f 12 14 00
>                                          ^^		(6b here)
>> b0: 49 da 88 60 31 ff 80 05 67 00 00 00 00 00 00 00
>            ^^						(22 here)
>> f0: 00 00 00 00 00 03 03 00 22 00 00 00 00 00 91 06
>                                                ^^ ^^	(00 00 here)

PCR(PCI Configration Registers) Editor / WPCREDIT for WIN32
Copyright (c) 2000  H.Oda!

[COMMENT]=for HWup ng. members (Kx) edited by Guruad tnx to author
H.Oda!
[MODEL]=VT8363 (KT133)
[VID]=1106:VIA
[DID]=0305:Host to PCI Bridge

[54:7]=SDRAM Self-Refresh	0=disable   1=enable
[54:6]=Probe Next Tag State T1	0=disable   1=enable
[54:5]=High Priority DRAM Req.	0=disable   1=enable
[54:4]=Continuous DRAM Request	0=disable   1=enable
[54:3]=DRAM Speculative Read	0=disable   1=enable
[54:2]=PCI Master Pipeline Req. 0=disable   1=enable
[54:1]=PCI-to-CPU / CPU-to-PCI	0=disable   1=enable
[54:0]=Fast Write-to-Read	0=disable   1=enable

[55:0]=S2K Compensation CPU Halt0=disable   1=enable

[68:7]=SDRAM Open Page Control	0=precharge  1=remain act
[68:6]=Bank Page Control	0=same bank  1=different
[68:5]=(Reserved)
[68:4]=DRAM Data Latch Delay	0=Latch     1=Delay latch
[68:3]=EDO Test Mode		0=disable   1=enable
[68:2]=Burst Refresh(4 times)	0=disable   1=enable
[68:1]=System Frequency Divider 00= 66 MHz  01=100 MHz
[68:0]=10=133 MHz  11=Reserved

[6B:7]=Arbitration Parking Pol. 00=bus owner 01=CPU side
[6B:6]=10=AGP side  11=Reserved
[6B:5]=Fast Read to Write t-a	0=disable   1=enable
[6B:4]=(Reserved)
[6B:3]=MD Bus Second Level	0=Normal slew 1=More
[6B:2]=CAS Bus Second Level	0=Normal slew 1=More
[6B:1]=Virtual Channel-DRAM	0=disable   1=enable
[6B:0]=Multi-Page Open		0=disable   1=enable

[AC:7]=(Reserved)
[AC:6]=AGP Read Synchronization 0=disable   1=enable
[AC:5]=AGP Read Snoop DRAM P-W-B0=disable   1=enable
[AC:4]=GREQ# Priority		0=disable   1=enable
[AC:3]=2X Rate Supported	0=not	    1=supported
[AC:2]=LPR In-Order Access	0=not	    1=executed
[AC:1]=AGP Arbitration Parking	0=disable   1=enable
[AC:0]=AGP-PCI Master/CPU-PCI TC0=2T or 3T  1=1T

[B2:7]=GD/GBE/GDS, SBA/SBS Ctrl
[B2:6]=(Reserved)
[B2:5]=(Reserved)
[B2:4]=GD[31-16] Staggered Delay0=none	    1=1ns
[B2:3]=(Reserved)
[B2:2]=(Reserved)
[B2:1]=AGP Voltage		0=1.5V	    1=3.3V
[B2:0]=GDS Output Delay 	0=none	    1=0.4ns

[FE]=Back-Door Device ID
[FF]=Back-Door Device ID

'54'=BIU Control 00 RW
'55'=Debug (Do Not Program)
'68'=DRAM Control 00 RW
'6B'=DRAM Arbitration Control 01 RW
'AC'=AGP Control 00 RW
'B2'=AGP Pad Drive / Delay Control 00 RW
'FE..FF' Back-Door Device ID 0000 RW

Ciao,
  Dani

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Daniela Engert, systems engineer at MEDAV GmbH
Gräfenberger Str. 34, 91080 Uttenreuth, Germany
Phone ++49-9131-583-348, Fax ++49-9131-583-11



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26  7:34 ext3-2.4-0.9.4 Andrew Morton
  2001-07-26 11:08 ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-27  9:32 ` Strange remount behaviour with ext3-2.4-0.9.4 Sean Hunter
@ 2001-07-30  6:37 ` Philipp Matthias Hahn
  2001-08-02 13:58   ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2 siblings, 1 reply; 662+ messages in thread
From: Philipp Matthias Hahn @ 2001-07-30  6:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml, ext3-users

On Thu, 26 Jul 2001, Andrew Morton wrote:

> An update to the ext3 filesystem for 2.4 kernels is available at
>
> 	http://www.uow.edu.au/~andrewm/linux/ext3/
I'm using ext3-0.9.4 with linux-2.4.7 / 2.4.8-pre1 and get some hangs on
my dual P2-350:
>From time to time I will have multiple CRON-Daemons in D-state and login
hangs when logging in. It even happens during boot before my MTA is
started.

I have a single ext3 partition which is exported by kernel-nfs-server.

As soon as I do an Alt-SysRq-S forced sync the hang goes away and
everything works normal.

If you need further information send me an eMail. SGIs kdb is already
compiled in so if we need it ...

BYtE
Philipp
-- 
  / /  (_)__  __ ____  __ Philipp Hahn
 / /__/ / _ \/ // /\ \/ /
/____/_/_//_/\_,_/ /_/\_\ pmhahn@titan.lahn.de


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-29  4:03         ` Gav
  2001-07-29 16:10           ` Mike Frisch
@ 2001-07-30  7:15           ` Steffen Persvold
  2001-07-30 10:17             ` Maciej Zenczykowski
  2001-07-30 13:59             ` Gav
  1 sibling, 2 replies; 662+ messages in thread
From: Steffen Persvold @ 2001-07-30  7:15 UTC (permalink / raw)
  To: Gav; +Cc: linux-kernel

Gav wrote:
> Just FYI, I've been running 2.4.7-pre6 for a few weeks on a Abit-KT7-a
> (hpt370) that uses the KT133/VIA chipset, with a 1.33Ghz Athlon and the
> kernel compiled for an Athlon.
> 
> The machine is now rock solid. I've given it the usual tests, k7burn for 5
> hours, cp'ing 30G+ across drives a few times etc, and all is good.
> 
> The broken sound (crackle/pop) with my SB128PCI (same problem as SBLive)
> still didn't go away though, but enabling PCI DRAM PREFETCH on the VT8363
> Bus-PCI Bridge does cure it. This took me a while to find as I can't set this
> in my bios, but powertweak came to the rescue.
> 
> While DRAM Prefetch is supposed to be an option to increase performance, my
> sound is totally unusable without this set. I've heard numerous people
> explain the same problem and it would be interesting to find out if this
> cures their sound troubles too. If this is the case, is this something that
> belongs in quirks, or is it too hardware specific? and would enabling this by
> default hurt anything anyway? Or is this just masking the underlaying problem
> ?

Hmm, I think "DRAM Prefetch" is the one you _don't_ want to turn on, because (and correct
me if i'm wrong) it's causing all the problems with the IDE controller (data trashing).

Regards,
-- 
  Steffen Persvold               Systems Engineer
  Email : mailto:sp@scali.no     Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50      Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51      N-0621 Oslo, Norway

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-30  7:15           ` Steffen Persvold
@ 2001-07-30 10:17             ` Maciej Zenczykowski
  2001-07-30 14:35               ` Luigi Genoni
  2001-07-30 13:59             ` Gav
  1 sibling, 1 reply; 662+ messages in thread
From: Maciej Zenczykowski @ 2001-07-30 10:17 UTC (permalink / raw)
  To: Steffen Persvold; +Cc: Gav, linux-kernel

> Hmm, I think "DRAM Prefetch" is the one you _don't_ want to turn on, because (and correct
> me if i'm wrong) it's causing all the problems with the IDE controller (data trashing).

I think it was IDE Prefetch that should be off (I had this problem on a
AMD 486DX4-133 with Award Bios, turning it on trashed the boot record in
minutes (and many other sectors on the disk too).

Anyone here care to give a link to that program to enable DRAM Prefetch?
My sister has a Duron 750w with VIA motherboard and music and sound pop on
any graphics changes, maybe this is it?

Regards,

Maciej Zenczykowski


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-30  6:04               ` Daniela Engert
@ 2001-07-30 13:44                 ` Kurt Garloff
  2001-07-30 14:15                   ` Michael
  2001-07-30 16:47                   ` Daniela Engert
  0 siblings, 2 replies; 662+ messages in thread
From: Kurt Garloff @ 2001-07-30 13:44 UTC (permalink / raw)
  To: Daniela Engert; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1430 bytes --]

Hi Daniela,

On Mon, Jul 30, 2001 at 08:04:54AM +0200, Daniela Engert wrote:
> On Sun, 29 Jul 2001 22:28:30 +0200, Kurt Garloff wrote:
> 
> >Me neither. I was hoping that only a bit differs. Unfortunately that's not
> >the case, so I need to have a look in the datasheet.
> >But those are not publically available :-(
> >Anybody having them?
> 
> Try to get a clue yourself from the WPCREDIT KT133 plugin (see below,
> stripped down to the differing registers). Some differences look
> suspicious to me...

Hey thanks!

> [54:6]=Probe Next Tag State T1	0=disable   1=enable

Main suspect. (Should be 0)

> [54:0]=Fast Write-to-Read	0=disable   1=enable

Third candidate. (Should be 0)

> [68:4]=DRAM Data Latch Delay	0=Latch     1=Delay latch

Second candidate (Should be 1)

> [68:2]=Burst Refresh(4 times)	0=disable   1=enable

Fourth candidate (Should be 0?)

> [6B:5]=Fast Read to Write t-a	0=disable   1=enable

Should this one match 54:0 (third candidate)?

> [6B:1]=Virtual Channel-DRAM	0=disable   1=enable

Strange, why does this one differ between the configs.

OK, I'll come up with a kernel patches (driver/pci/quirks ...)
for people to test.

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, DE                                SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30  0:32           ` ext3-2.4-0.9.4 Chris Mason
@ 2001-07-30 13:49             ` Patrick J. LoPresti
  2001-07-30 13:55               ` ext3-2.4-0.9.4 Alan Cox
  2001-07-30 16:22               ` ext3-2.4-0.9.4 Rik van Riel
  0 siblings, 2 replies; 662+ messages in thread
From: Patrick J. LoPresti @ 2001-07-30 13:49 UTC (permalink / raw)
  To: Chris Mason; +Cc: Chris Wedgwood, Alan Cox, linux-kernel

Chris Mason <mason@suse.com> writes:

> Correct, in the current 2.4.x code, its a quirk.  fsync(any object) ==
> fsync(all pending metadata, including renames).

This does not help.  The MTAs are doing fsync() on the temporary file
and then using the *subsequent* rename() as the committing operation.

> Anyway, during a rename, this patch updates the inode transaction
> tracking stuff so an fsync on the file should also commit the
> directory changes.  But, that isn't something I really intend to
> advertise much, since the accepted linux way is fsync(dir).

It would be nice to have an option (on either the directory or the
mountpoint) to cause all metadata updates to commit to the journal
without causing all operations to be fully synchronous.  This would
provide compatibility with BSD-centric code without taking the
performance hit of synchronous data.  Heck, just having link() and
rename() perform a commit would be good enough for almost all
applications.

 - Pat

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 13:49             ` ext3-2.4-0.9.4 Patrick J. LoPresti
@ 2001-07-30 13:55               ` Alan Cox
  2001-07-30 14:38                 ` ext3-2.4-0.9.4 Patrick J. LoPresti
  2001-07-31  1:29                 ` ext3-2.4-0.9.4 Andrew McNamara
  2001-07-30 16:22               ` ext3-2.4-0.9.4 Rik van Riel
  1 sibling, 2 replies; 662+ messages in thread
From: Alan Cox @ 2001-07-30 13:55 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: Chris Mason, Chris Wedgwood, Alan Cox, linux-kernel

> Chris Mason <mason@suse.com> writes:
> 
> > Correct, in the current 2.4.x code, its a quirk.  fsync(any object) ==
> > fsync(all pending metadata, including renames).
> 
> This does not help.  The MTAs are doing fsync() on the temporary file
> and then using the *subsequent* rename() as the committing operation.

Which is quaint, because as we've pointed out repeatedly to you rename
is not an atomic operation. Even on a simple BSD or ext2 style fs it can
be two directory block writes,  metadata block writes, a bitmap write
and a cylinder group write.

> It would be nice to have an option (on either the directory or the
> mountpoint) to cause all metadata updates to commit to the journal
> without causing all operations to be fully synchronous.  This would

You mean fsync() on the directory. 

Alan


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-30  7:15           ` Steffen Persvold
  2001-07-30 10:17             ` Maciej Zenczykowski
@ 2001-07-30 13:59             ` Gav
  1 sibling, 0 replies; 662+ messages in thread
From: Gav @ 2001-07-30 13:59 UTC (permalink / raw)
  To: linux-kernel

On Monday 30 July 2001 07:15, Steffen Persvold wrote:

> > While DRAM Prefetch is supposed to be an option to increase performance,
> > my sound is totally unusable without this set. I've heard numerous people
> > explain the same problem and it would be interesting to find out if this
> > cures their sound troubles too. If this is the case, is this something
> > that belongs in quirks, or is it too hardware specific? and would
> > enabling this by default hurt anything anyway? Or is this just masking
> > the underlaying problem ?
>
> Hmm, I think "DRAM Prefetch" is the one you _don't_ want to turn on,
> because (and correct me if i'm wrong) it's causing all the problems with
> the IDE controller (data trashing).

Obviously I can only comment on my own hardware but the machine has been used 
constantly since Thu Jul 12, its now Jul 30 and I havent had a single IDE 
related problem. 

As a hobby, i use the machine for DigitalVideo, and regularly grab 20-30GB 
from my capture card, then process it, which means the IDE bus gets a lot of 
use and seems to be an ideal situation for the data trashing problems to rear 
their ugly heads (no pun intended) but, as i said, I haven't seen any here.

DRAM Prefetch makes my sound usuable, as the VIA fixups for the SB cards do 
not work here, and for (at least) two other people who i have had email 
correspondancy with.

-- Regards, Gavin Baker


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-30 13:44                 ` Kurt Garloff
@ 2001-07-30 14:15                   ` Michael
  2001-07-30 15:46                     ` Kurt Garloff
  2001-07-30 16:47                   ` Daniela Engert
  1 sibling, 1 reply; 662+ messages in thread
From: Michael @ 2001-07-30 14:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: Kurt Garloff, Daniela Engert

> > On Sun, 29 Jul 2001 22:28:30 +0200, Kurt Garloff wrote:
> > [54:6]=Probe Next Tag State T1	0=disable   1=enable
> 
> Main suspect. (Should be 0)

That's set in my stable kt133a system.
 
> > [54:0]=Fast Write-to-Read	0=disable   1=enable
> 
> Third candidate. (Should be 0)

as is this one.
 
> > [68:2]=Burst Refresh(4 times)	0=disable   1=enable
> 
> Fourth candidate (Should be 0?)

I set this one yesterday to see if it would trigger the problem, it
didn't :o/ Same with a few differences between my system and 0x6b, which
didn't either.

Out of curiosity, where are you getting the 'should be 0/1' details from?
-- 
Michael.
 

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-30 10:17             ` Maciej Zenczykowski
@ 2001-07-30 14:35               ` Luigi Genoni
  0 siblings, 0 replies; 662+ messages in thread
From: Luigi Genoni @ 2001-07-30 14:35 UTC (permalink / raw)
  To: Maciej Zenczykowski; +Cc: Steffen Persvold, Gav, linux-kernel

I have this bios setting enabled, and no problems at all on two of my
athlons with VIA KT133A, kernel 2.4.7.

>From this full discussion comes out a big confusion.

For what I saw, many  VIA KT133A do work well, many other
give problems to their sysadmins. but the chipset is almost the same,
and the processors are quite the same (they are all athlon, I read no
bug reports about duron).

this is enought for me to get confused.

My production systems do use scsi disks, and i can understand
they donot have troubles (adaptec 2940, 2980 29160....).
But also the ones with IDE disks are working quite well (some using
ata33, others ata100).
I NEVER used DDRAM, just normal SDRAM 133 Mhz.

So I was thinking to FSB. All my systems with ide disks have 200MhzFSB,
(while my latest production systems do have 266 MhzFSB). Maybe a 266
MhzFSB is just too mutch stress
for some via chipset. But i see no clear logic when problems do appear,
or any big difference with systems that are rock solid.

lets' try to make a point to see a logic for those instabilities...

which kind of hardware bug is this, if the same chipset can work or not
depending  if you are lucky? or a full stock of chipset is buggy and
with a certain HW configuration you will see the bug or
what?

Luigi

On Mon, 30 Jul 2001, Maciej Zenczykowski wrote:

> > Hmm, I think "DRAM Prefetch" is the one you _don't_ want to turn on, because (and correct
> > me if i'm wrong) it's causing all the problems with the IDE controller (data trashing).
>
> I think it was IDE Prefetch that should be off (I had this problem on a
> AMD 486DX4-133 with Award Bios, turning it on trashed the boot record in
> minutes (and many other sectors on the disk too).
>
> Anyone here care to give a link to that program to enable DRAM Prefetch?
> My sister has a Duron 750w with VIA motherboard and music and sound pop on
> any graphics changes, maybe this is it?
>
> Regards,
>
> Maciej Zenczykowski
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 13:55               ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-30 14:38                 ` Patrick J. LoPresti
  2001-07-30 16:27                   ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-31  1:29                 ` ext3-2.4-0.9.4 Andrew McNamara
  1 sibling, 1 reply; 662+ messages in thread
From: Patrick J. LoPresti @ 2001-07-30 14:38 UTC (permalink / raw)
  To: Alan Cox; +Cc: Chris Mason, Chris Wedgwood, linux-kernel

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> > Chris Mason <mason@suse.com> writes:
> > 
> > > Correct, in the current 2.4.x code, its a quirk.  fsync(any object) ==
> > > fsync(all pending metadata, including renames).
> > 
> > This does not help.  The MTAs are doing fsync() on the temporary file
> > and then using the *subsequent* rename() as the committing operation.
> 
> Which is quaint, because as we've pointed out repeatedly to you rename
> is not an atomic operation. Even on a simple BSD or ext2 style fs it can
> be two directory block writes,  metadata block writes, a bitmap write
> and a cylinder group write.

But not on a journalling filesystem.  I assume that a journal "commit"
is atomic.  If it is not, then fsync() on the directory does not solve
the problem either.

Put another way, I am suggesting a mount-time or directory option to
effectively cause rename() and link() to automatically be followed by
an fsync() of the containing directory.  (Actually, from this
perspective, maybe you could fix the MTA in user space with LD_PRELOAD
hackery or somesuch.  Hm...)

> > It would be nice to have an option (on either the directory or the
> > mountpoint) to cause all metadata updates to commit to the journal
> > without causing all operations to be fully synchronous.  This would
> 
> You mean fsync() on the directory. 

In other words, "Get the MTA authors to change their code."  That is a
nice little war, but it is fought at the expense of users who just
want to use the code provided by their vendor and have it work.

The situation is this:

  The relevant standards (POSIX, SuS, etc.) provide no way to perform
  reliable transactions on a file system.

  BSD provides one solution, which is synchronous metatdata.  (I am
  assuming modern BSDs already deal with the multiple-disk-block
  problem to make these transactions properly atomic.  Is this
  assumption false?)

  Linux provides a different solution, which is fsync() on the
  directory.

  All MTAs, and other apps besides, currently use the BSD solution for
  reliable transactions.

Is it really so absurd to ask Linux to provide efficient support of
the BSD semantics as an option?

 - Pat

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-29  9:28                                         ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-29 14:16                                           ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-29 23:19                                           ` ext3-2.4-0.9.4 Mike Touloumtzis
@ 2001-07-30 14:41                                           ` Ketil Froyn
  2 siblings, 0 replies; 662+ messages in thread
From: Ketil Froyn @ 2001-07-30 14:41 UTC (permalink / raw)
  To: linux-kernel

On Sun, 29 Jul 2001, Matthias Andree wrote:

> On Sat, 28 Jul 2001, Mike Touloumtzis wrote:
>
> > You are blurring the boundaries between "undocumented behavior" and
> > "OS-specific behavior".  fsync() on a directory to sync metadata is a
> > defined (according to my copy of fsync(2)), Linux-specific behavior.
> > It is also very reasonable IMHO and in keeping with the traditional
> > Unix notion of directories as lists of files.

> > http://www.google.com/search?q=autoconf
> >
> > Writing portable Unix software has always meant some degree
> > of system-specific accomodation.  It's a bummer but it's life;
> > otherwise Unix wouldn't evolve.
>
> How can autoconf figure if you need to fsync() the directory?

Simple! Grep the fsync(2) manpage ;)

Ketil the joker


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-30 14:15                   ` Michael
@ 2001-07-30 15:46                     ` Kurt Garloff
  2001-07-30 18:43                       ` Kurt Garloff
  0 siblings, 1 reply; 662+ messages in thread
From: Kurt Garloff @ 2001-07-30 15:46 UTC (permalink / raw)
  To: Michael; +Cc: Linux kernel list, Daniela Engert

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

Hi Michael,

thanks for your comments!

On Mon, Jul 30, 2001 at 03:15:38PM +0100, Michael wrote:
> > > On Sun, 29 Jul 2001 22:28:30 +0200, Kurt Garloff wrote:
> > > [54:6]=Probe Next Tag State T1	0=disable   1=enable
> > 
> > Main suspect. (Should be 0)
> 
> That's set in my stable kt133a system.

But did you experience problems at all with your kernel when compiled for
K7? Note that most (if not all) systems seem to work stable with K6 or PPro
optimized kernels.

> > > [54:0]=Fast Write-to-Read	0=disable   1=enable
> > 
> > Third candidate. (Should be 0)
> 
> as is this one.
>  
> > > [68:2]=Burst Refresh(4 times)	0=disable   1=enable
> > 
> > Fourth candidate (Should be 0?)
> 
> I set this one yesterday to see if it would trigger the problem, it
> didn't :o/ Same with a few differences between my system and 0x6b, which
> didn't either.
> 
> Out of curiosity, where are you getting the 'should be 0/1' details from?

Comparing the lspci -vxxx output of working and non-working systems.

You did no comment on the second candidate:

> [68:4]=DRAM Data Latch Delay  0=Latch     1=Delay latch

Second candidate (Should be 1)

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, DE                                SCSI, Security

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 13:49             ` ext3-2.4-0.9.4 Patrick J. LoPresti
  2001-07-30 13:55               ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-30 16:22               ` Rik van Riel
  2001-07-30 16:46                 ` ext3-2.4-0.9.4 Patrick J. LoPresti
                                   ` (2 more replies)
  1 sibling, 3 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-30 16:22 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: Chris Mason, Chris Wedgwood, Alan Cox, linux-kernel

On 30 Jul 2001, Patrick J. LoPresti wrote:

> performance hit of synchronous data.  Heck, just having link() and
> rename() perform a commit would be good enough for almost all
> applications.

It would be "good enough" for some applications,
but it would be absolutely disastrous for most
applications I run (ie. moving source code around).

Exactly what is wrong with doing fsync() on the
directory ?

Why do you want us to turn link() and rename()
into link_slowly() and rename_slowly() ?

Why can't you use a simple wrapper function to
do this for you ?

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 14:38                 ` ext3-2.4-0.9.4 Patrick J. LoPresti
@ 2001-07-30 16:27                   ` Rik van Riel
  0 siblings, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-30 16:27 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: Alan Cox, Chris Mason, Chris Wedgwood, linux-kernel

On 30 Jul 2001, Patrick J. LoPresti wrote:

>   The relevant standards (POSIX, SuS, etc.) provide no way to perform
>   reliable transactions on a file system.
>
>   BSD provides one solution, which is synchronous metatdata.  (I am
>   assuming modern BSDs already deal with the multiple-disk-block
>   problem to make these transactions properly atomic.  Is this
>   assumption false?)
>
>   Linux provides a different solution, which is fsync() on the
>   directory.
>
>   All MTAs, and other apps besides, currently use the BSD solution for
>   reliable transactions.
>
> Is it really so absurd to ask Linux to provide efficient support of
> the BSD semantics as an option?

Yes. You could fix this issue in userland very easily,
it might even work with an LD_PRELOAD ...

Besides BSD softupdates and the various journaling
filesystems which are in use on other Unixen also
don't provide the 4.3BSD solution any more ...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 16:22               ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-30 16:46                 ` Patrick J. LoPresti
  2001-07-30 17:03                   ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-30 17:11                 ` ext3-2.4-0.9.4 Lawrence Greenfield
  2001-07-31  0:16                 ` ext3-2.4-0.9.4 Matthias Andree
  2 siblings, 1 reply; 662+ messages in thread
From: Patrick J. LoPresti @ 2001-07-30 16:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Chris Mason, Chris Wedgwood, Alan Cox, linux-kernel

Rik van Riel <riel@conectiva.com.br> writes:

> Exactly what is wrong with doing fsync() on the
> directory ?

Nothing, except that it requires source code changes to every
application which expects BSD semantics for these operations.
Anecdotal evidence suggests at least the MTA authors are resistant to
making such changes.

> Why do you want us to turn link() and rename()
> into link_slowly() and rename_slowly() ?

I don't by default, only as an option.  You know, just like "chattr
-S" or "mount -o sync" means do_everything_slowly().

> Why can't you use a simple wrapper function to
> do this for you ?

It would not be all that simple; it would have to parse the arguments
to figure out the containing directories, open() a file descriptor on
each, and fsync() them.  Not impossible, but it does introduce several
those additional system calls as performance hits and points of
failure, not to mention possible race conditions.

Still, I suppose you could do this well enough in the C library.  You
might even want it to be the default when "__USE_BSD" is defined or
something.

But it still seems simpler to me just to make it an option in the file
system.

In your next message, you say:

> Besides BSD softupdates and the various journaling
> filesystems which are in use on other Unixen also
> don't provide the 4.3BSD solution any more ...

This surprises me if it is true; do you have a reference?  And what
mechanism *do* the modern BSDs provide to commit metadata changes to
disk?

 - Pat

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-30 13:44                 ` Kurt Garloff
  2001-07-30 14:15                   ` Michael
@ 2001-07-30 16:47                   ` Daniela Engert
  1 sibling, 0 replies; 662+ messages in thread
From: Daniela Engert @ 2001-07-30 16:47 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: linux-kernel

Hi Kurt!

On Mon, 30 Jul 2001 15:44:58 +0200, Kurt Garloff wrote:

Just for reference: these are the values taken from my main machine
(Epox EP8KTA2, VIA KT133) with the latest BIOS:

>> [54:6]=Probe Next Tag State T1	0=disable   1=enable
>Main suspect. (Should be 0)

Set to 1 here.

>> [54:0]=Fast Write-to-Read	0=disable   1=enable
>Third candidate. (Should be 0)

Set to 1 here.

>> [68:4]=DRAM Data Latch Delay	0=Latch     1=Delay latch
>Second candidate (Should be 1)

Set to 1 here.

>> [68:2]=Burst Refresh(4 times)	0=disable   1=enable
>Fourth candidate (Should be 0?)

Set to 0 here.

>> [6B:5]=Fast Read to Write t-a	0=disable   1=enable
>Should this one match 54:0 (third candidate)?

Set to 0 here.

>> [6B:1]=Virtual Channel-DRAM	0=disable   1=enable
>Strange, why does this one differ between the configs.

Set to 0 here.

Unfortunately, this machine doesn't run Linux...

Ciao,
  Dani



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 16:46                 ` ext3-2.4-0.9.4 Patrick J. LoPresti
@ 2001-07-30 17:03                   ` Rik van Riel
  2001-07-31  0:28                     ` ext3-2.4-0.9.4 Matthias Andree
  0 siblings, 1 reply; 662+ messages in thread
From: Rik van Riel @ 2001-07-30 17:03 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: Chris Mason, Chris Wedgwood, Alan Cox, linux-kernel

On 30 Jul 2001, Patrick J. LoPresti wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
>
> > Exactly what is wrong with doing fsync() on the
> > directory ?
>
> Nothing, except that it requires source code changes to every
> application which expects BSD semantics for these operations.
> Anecdotal evidence suggests at least the MTA authors are resistant to
> making such changes.

You may need to make them anyway for Digital's AdvFS,
IRIX XFS, IBM JFS, Veritas' VXFS and BSD softupdates.

Lets face it, FFS is no longer the only available
filesystem. Don't expect FFS semantics from other
filesystems.

> > Why can't you use a simple wrapper function to
> > do this for you ?
>
> It would not be all that simple; it would have to parse the
> arguments to figure out the containing directories, open() a
> file descriptor on each, and fsync() them.

Hmmm, then maybe we'd just want some flag to fsync()
telling the kernel to also sync the parent directory
of the file and do whatever it needs to do to get the
rename() or link() committed ?

> But it still seems simpler to me just to make it an option in
> the file system.

It's always simpler when it's not YOU who has to
implement it ;)

cheers,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 16:22               ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-30 16:46                 ` ext3-2.4-0.9.4 Patrick J. LoPresti
@ 2001-07-30 17:11                 ` Lawrence Greenfield
  2001-07-30 17:25                   ` ext3-2.4-0.9.4 Rik van Riel
                                     ` (2 more replies)
  2001-07-31  0:16                 ` ext3-2.4-0.9.4 Matthias Andree
  2 siblings, 3 replies; 662+ messages in thread
From: Lawrence Greenfield @ 2001-07-30 17:11 UTC (permalink / raw)
  To: Rik van Riel, Patrick J. LoPresti
  Cc: linux-kernel, Alan Cox, Chris Wedgwood, Chris Mason

   From: "Patrick J. LoPresti" <patl@cag.lcs.mit.edu>
   Date: 	30 Jul 2001 12:46:13 -0400

   > Besides BSD softupdates and the various journaling
   > filesystems which are in use on other Unixen also
   > don't provide the 4.3BSD solution any more ...

   This surprises me if it is true; do you have a reference?  And what
   mechanism *do* the modern BSDs provide to commit metadata changes to
   disk?

BSD softupdates allows you to call fsync() on the file, and this will
sync the directories all the way up to the root if necessary.

Thus BSD fsync() actually guarantees that when it returns, the file
(and all of it's filenames) will survive a reboot.

Sendmail does:
fd = open(tmp)
write(fd)
fsync(fd)
rename(tmp, final)
fsync(fd)

Cyrus IMAP does:
fd = open(tmp)
write(fd)
fsync(fd)
link(tmp, final1)
link(tmp, final2)
link(tmp, final3)
fsync(fd)
close(fd)
unlink(tmp)

The idea that Linux fsync() doesn't actually make the file survive
reboots is pretty ridiculous.

Larry



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 17:11                 ` ext3-2.4-0.9.4 Lawrence Greenfield
@ 2001-07-30 17:25                   ` Rik van Riel
  2001-07-30 17:38                     ` ext3-2.4-0.9.4 Chris Wedgwood
                                       ` (2 more replies)
  2001-07-31  0:22                   ` ext3-2.4-0.9.4 Matthias Andree
  2001-08-03 17:24                   ` ext3-2.4-0.9.4 Jan Harkes
  2 siblings, 3 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-30 17:25 UTC (permalink / raw)
  To: Lawrence Greenfield
  Cc: Patrick J. LoPresti, linux-kernel, Alan Cox, Chris Wedgwood, Chris Mason

On Mon, 30 Jul 2001, Lawrence Greenfield wrote:
>    From: "Patrick J. LoPresti" <patl@cag.lcs.mit.edu>
>    Date: 	30 Jul 2001 12:46:13 -0400
>
>    > Besides BSD softupdates and the various journaling
>    > filesystems which are in use on other Unixen also
>    > don't provide the 4.3BSD solution any more ...
>
>    This surprises me if it is true; do you have a reference?  And what
>    mechanism *do* the modern BSDs provide to commit metadata changes to
>    disk?
>
> BSD softupdates allows you to call fsync() on the file, and this will
> sync the directories all the way up to the root if necessary.
>
> Thus BSD fsync() actually guarantees that when it returns, the file
> (and all of it's filenames) will survive a reboot.

Note that this is very different from the "link() should be
synchronous()" mantra we've been hearing over the last days.

These fsync() semantics make lots of sense to me, I'm all
for it.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 17:25                   ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-30 17:38                     ` Chris Wedgwood
  2001-07-30 17:49                     ` ext3-2.4-0.9.4 Lawrence Greenfield
  2001-07-31  0:25                     ` ext3-2.4-0.9.4 Matthias Andree
  2 siblings, 0 replies; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-30 17:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Lawrence Greenfield, Patrick J. LoPresti, linux-kernel, Alan Cox,
	Chris Mason

On Mon, Jul 30, 2001 at 02:25:51PM -0300, Rik van Riel wrote:

    Note that this is very different from the "link() should be
    synchronous()" mantra we've been hearing over the last days.
    
    These fsync() semantics make lots of sense to me, I'm all
    for it.

And what if the file has hundreds or thousands of links? How do we
cleanly keep track of all those?



  --cw


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Support for serial console on legacy free machines
  2001-07-26 22:20   ` Support for serial console on legacy free machines Alan Cox
@ 2001-07-30 17:47     ` Khalid Aziz
  0 siblings, 0 replies; 662+ messages in thread
From: Khalid Aziz @ 2001-07-30 17:47 UTC (permalink / raw)
  To: Alan Cox; +Cc: LKML

Alan Cox wrote:
> 
> > console is "Serial Port Console Redirection" (SPCR) table. This table
> > gives me almost all the information I need to initialize and use a
> > serial console. The bummer is this table was designed by Microsoft and
> > Microsoft owns the copyright on it. Microsoft primarily designed this
> > table for use by Whistler. Their copyright may cause potential problems
> > with using it in Linux. This makes me reluctant to use this table. I
> 
> Such as ?
> 
> If its a table that microsoft added to ACPI and its well thought out I don't
> see a big problem technically. There are a collection of BIOS services we
> use that were microsoft originated

I can not say this table is part of ACPI 2.0. ACPI 2.0 Spec document
lists SPCR in the DESCRIPTION_HEADER signatures but calls it Microsoft
Serial Port Console Redirection Table and refers to the URL on Microsoft
web site. If you go to this URL, you see the Microsoft copyright and
terms of use license. The same applies to DBGP (Debug Port Table).

-- 
Khalid

====================================================================
Khalid Aziz                              Linux Systems Operation R&D
(970)898-9214                                        Hewlett-Packard
khalid@fc.hp.com                                    Fort Collins, CO

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 17:25                   ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-30 17:38                     ` ext3-2.4-0.9.4 Chris Wedgwood
@ 2001-07-30 17:49                     ` Lawrence Greenfield
  2001-07-30 17:59                       ` ext3-2.4-0.9.4 Chris Mason
  2001-07-31  0:25                     ` ext3-2.4-0.9.4 Matthias Andree
  2 siblings, 1 reply; 662+ messages in thread
From: Lawrence Greenfield @ 2001-07-30 17:49 UTC (permalink / raw)
  To: Rik van Riel, Chris Wedgwood
  Cc: Chris Mason, Alan Cox, linux-kernel, Patrick J. LoPresti

   Date: Tue, 31 Jul 2001 05:38:13 +1200
   From: Chris Wedgwood <cw@f00f.org>

   On Mon, Jul 30, 2001 at 02:25:51PM -0300, Rik van Riel wrote:

       Note that this is very different from the "link() should be
       synchronous()" mantra we've been hearing over the last days.

       These fsync() semantics make lots of sense to me, I'm all
       for it.

   And what if the file has hundreds or thousands of links? How do we
   cleanly keep track of all those?

You don't have to keep track of all of them, just the uncommitted
ones.  I could imagine the filesystem forcing periodic commits on
pathological files (those with thousands of links) to limit the number
of pending directory operations per file.

While the softupdates paper doesn't appear to directly address this
concern, clearly their implementation has to deal with it in some way.

Larry


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 17:49                     ` ext3-2.4-0.9.4 Lawrence Greenfield
@ 2001-07-30 17:59                       ` Chris Mason
  2001-07-30 21:39                         ` ext3-2.4-0.9.4 Chris Wedgwood
  0 siblings, 1 reply; 662+ messages in thread
From: Chris Mason @ 2001-07-30 17:59 UTC (permalink / raw)
  To: Lawrence Greenfield, Rik van Riel, Chris Wedgwood
  Cc: Alan Cox, linux-kernel, Patrick J. LoPresti



On Monday, July 30, 2001 01:49:12 PM -0400 Lawrence Greenfield
<leg+@andrew.cmu.edu> wrote:

>    Date: Tue, 31 Jul 2001 05:38:13 +1200
>    From: Chris Wedgwood <cw@f00f.org>
> 
>    On Mon, Jul 30, 2001 at 02:25:51PM -0300, Rik van Riel wrote:
> 
>        Note that this is very different from the "link() should be
>        synchronous()" mantra we've been hearing over the last days.
> 
>        These fsync() semantics make lots of sense to me, I'm all
>        for it.
> 
>    And what if the file has hundreds or thousands of links? How do we
>    cleanly keep track of all those?
> 
> You don't have to keep track of all of them, just the uncommitted
> ones. 

Well, the idea is to get it done in the VFS layer.  reiserfs, ext3, and
probably the other journaled filesystems could keep track of the last
transacation and inode was involved with, making the softupdate style
fsync(file) to commit a rename easy.

But, ext2 and the normal filesystems don't have it quite so good.

-chris



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-30 15:46                     ` Kurt Garloff
@ 2001-07-30 18:43                       ` Kurt Garloff
  2001-07-30 20:44                         ` Gerbrand van der Zouw
  0 siblings, 1 reply; 662+ messages in thread
From: Kurt Garloff @ 2001-07-30 18:43 UTC (permalink / raw)
  To: Michael, Linux kernel list, Daniela Engert; +Cc: Alan Cox


[-- Attachment #1.1: Type: text/plain, Size: 689 bytes --]

Hi,

OK, patches for different bits are attached.
The patch does modify up to 4 bits, which is more than I would like to do in
the end. But you can easily disable some parts of it, if the full patch
proves to solve your trouble.
Please test!

It seemed to solved the trouble here on first sight (booting went further
then normal) but in the end did not turn out to solve the trouble here.
(Here means: MSI K7T Turbo (Ver.3) with AMD K7 1.2GHz.)

Regards,
-- 
Kurt Garloff  <garloff@suse.de>                          Eindhoven, NL
GPG key: See mail header, key servers         Linux kernel development
SuSE GmbH, Nuernberg, DE                                SCSI, Security

[-- Attachment #1.2: 247-viakt133.diff --]
[-- Type: text/plain, Size: 2250 bytes --]

--- linux-2.4.7.compile/drivers/pci/quirks.c.orig	Tue Jul 24 16:50:41 2001
+++ linux-2.4.7.compile/drivers/pci/quirks.c	Mon Jul 30 20:21:56 2001
@@ -160,6 +160,49 @@
 }
 
 /*
+ * KT133a will fsck up under some circumstances if Burst Refresh (4 times)
+ * is enabled or if data latch delay is disabled
+ * and we use the fast streaming K7 optimized zero_page
+ * and copy_page routines from arch/i386/lib/mmx.c 
+ * -- garloff@suse.de, 2001-07-30
+ */
+static void __init quirk_via_noburstrefresh(struct pci_dev *dev)
+{
+	u8 dram_ctrl;
+	pci_read_config_byte(dev, 0x68, &dram_ctrl);
+	if (dram_ctrl & 0x04 || !(dram_ctrl & 0x10)) {
+		if (dram_ctrl & 0x04)
+	  		printk(KERN_INFO "VIA KT133a: Disabling burst refresh.\n");
+		dram_ctrl &= ~0x04;
+		if (!(dram_ctrl & 0x10))
+	  		printk(KERN_INFO "VIA KT133a: Enabling data latch delay.\n");
+		dram_ctrl |= 0x10;
+		pci_write_config_byte(dev, 0x68, dram_ctrl);
+	}
+}
+
+/*
+ * KT133a will fsck up under some circumstances if Probe Next Tag State 
+ * T1 is set to 1 and we use the fast streaming K7 optimized zero_page
+ * and copy_page routines from arch/i386/lib/mmx.c 
+ * -- garloff@suse.de, 2001-07-30
+ */
+static void __init quirk_via_noprobenexttag(struct pci_dev *dev)
+{
+	u8 biu_ctrl;
+	pci_read_config_byte(dev, 0x54, &biu_ctrl);
+	if (biu_ctrl & 0x40 || biu_ctrl & 0x01) {
+		if (biu_ctrl & 0x40)
+			printk(KERN_INFO "VIA KT133a: Disabling probe next tag state T1.\n");
+		if (biu_ctrl & 0x01)
+			printk(KERN_INFO "VIA KT133a: Disabling fast write-to-read.\n");
+		biu_ctrl &= ~0x41;
+		pci_write_config_byte(dev, 0x54, biu_ctrl);
+	}
+}
+
+
+/*
  *	Natoma has some interesting boundary conditions with Zoran stuff
  *	at least
  */
@@ -452,6 +495,9 @@
 	{ PCI_FIXUP_FINAL,	PCI_VENDOR_ID_VIA,	PCI_DEVICE_ID_VIA_82C586_2,	quirk_via_irqpic },
 	{ PCI_FIXUP_FINAL,	PCI_VENDOR_ID_VIA,	PCI_DEVICE_ID_VIA_82C686_5,	quirk_via_irqpic },
 	{ PCI_FIXUP_FINAL,	PCI_VENDOR_ID_VIA,	PCI_DEVICE_ID_VIA_82C686_6,	quirk_via_irqpic },
+
+	{ PCI_FIXUP_FINAL,	PCI_VENDOR_ID_VIA,	PCI_DEVICE_ID_VIA_8363_0,	quirk_via_noburstrefresh },
+	{ PCI_FIXUP_FINAL,	PCI_VENDOR_ID_VIA,	PCI_DEVICE_ID_VIA_8363_0,	quirk_via_noprobenexttag },
 
 	{ 0 }
 };

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-30 18:43                       ` Kurt Garloff
@ 2001-07-30 20:44                         ` Gerbrand van der Zouw
  0 siblings, 0 replies; 662+ messages in thread
From: Gerbrand van der Zouw @ 2001-07-30 20:44 UTC (permalink / raw)
  To: Kurt Garloff; +Cc: linux-kernel

Hi,

Kurt Garloff wrote:

 > It seemed to solved the trouble here on first sight (booting went further
 > then normal) but in the end did not turn out to solve the trouble here.
 > (Here means: MSI K7T Turbo (Ver.3) with AMD K7 1.2GHz.)

from your lspci output I seem to have exactly the same system as you 
have. I tried your patch (247-viakt133.diff) and came up with the same 
result here: it seemed to come further than last time with only 
2.4.6ac5, but then it crashed anyway. If you know of any BIOS parameters 
  that might help for this mobo, please let me know. I could not 
identify a parameter that does the same as the "DRAM Prefetch" for Abit 
mobos.

Regards,

Gerbrand van der Zouw



^ permalink raw reply	[flat|nested] 662+ messages in thread

* rename() (was Re: ext3-2.4-0.9.4)
  2001-07-28 16:46     ` ext3-2.4-0.9.4 Patrick J. LoPresti
  2001-07-28 19:03       ` ext3-2.4-0.9.4 Alan Cox
@ 2001-07-30 21:03       ` Anthony DeBoer
  1 sibling, 0 replies; 662+ messages in thread
From: Anthony DeBoer @ 2001-07-30 21:03 UTC (permalink / raw)
  To: linux-kernel

Patrick J. LoPresti <patl@cag.lcs.mit.edu> wrote:
>The MTAs do this:
>
>    Open temp file
>    Write to temp file
>    fsync() temp file
>    rename() temp file into mail spool
>    indicate success to remote MTA

Don't forget the unlink() temp file just before or after that last step.

>As long as rename() does not return until the metadata are committed,
>this should be a reliable delivery mechanism.  ...

As I understand it, rename() was originally invented for tasks like
installing a new /bin/sh with guarantees that another process running
at the same time would not fail to find a shell, and that if the system
fell over during the install you'd still have a shell on reboot.

See http://www.qef.com/ftp/rename.ps for an interesting history from
someone who was there at the time.  It's undated, but probably a decade
old.

It's my considered opinion that rename() _should_ fsync the target
directory before returning, and between that and the fsync() call on
the file itself (an install program should do the same call sequence as
above) you get the guarantee that the file is intact before you unlink
the temp version and return success.  OTOH, link() and unlink() are not
in the business of providing guarantees like that, and should not sync.

-- 
Anthony de Boer, curator, Anthony's Home for Aged Computing Machinery
<adb@leftmind.net>

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 17:59                       ` ext3-2.4-0.9.4 Chris Mason
@ 2001-07-30 21:39                         ` Chris Wedgwood
  0 siblings, 0 replies; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-30 21:39 UTC (permalink / raw)
  To: Chris Mason
  Cc: Lawrence Greenfield, Rik van Riel, Alan Cox, linux-kernel,
	Patrick J. LoPresti

On Mon, Jul 30, 2001 at 01:59:04PM -0400, Chris Mason wrote:

    Well, the idea is to get it done in the VFS layer.  reiserfs, ext3, and
    probably the other journaled filesystems could keep track of the last
    transacation and inode was involved with, making the softupdate style
    fsync(file) to commit a rename easy.

But, right now, the VFS layer doesn't know about magic attributes
(such as ext2/3 +S).  The VFS would have to be taught about these and
some other things to support both asynchronous and synchronous
metadata updates (and presumably other smarts too).  The trouble is
these attributes themselves and how they are stored is fs specific, we
could always mandate that as of 2.5.x all filesystems _can_ support
some kind of extended API and defined a minimalist set of attributes
for all filesystems and then allow specific filesystems to have their
own.  Arguably if people are going to force ACLs upon the world, then
a common API would be nice across XFS, resierfs4, JFFS, etc.  (NTFS
can use an API specific to the FS itself as NTFS ACLs are much more
complex and different looking beasts that those from early POSIX
drafts).

For journalling filesystems, it would be really nice if setting an
attribute was all that was required to make rename(2) atomic (or at
the very least to make sure that if the rename system call returns,
the data has been written to non-volatile storage).



  --cw



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 16:22               ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-30 16:46                 ` ext3-2.4-0.9.4 Patrick J. LoPresti
  2001-07-30 17:11                 ` ext3-2.4-0.9.4 Lawrence Greenfield
@ 2001-07-31  0:16                 ` Matthias Andree
  2 siblings, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-31  0:16 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

On Mon, 30 Jul 2001, Rik van Riel wrote:

> Exactly what is wrong with doing fsync() on the
> directory ?

It's non-portable and a kludge.

> Why do you want us to turn link() and rename()
> into link_slowly() and rename_slowly() ?

Opening up the directory requires lots of inode lookups which are
unnecessary.

> Why can't you use a simple wrapper function to
> do this for you ?

Because it's more inefficient than necessary and it bloats the
application.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 15:51       ` ext3-2.4-0.9.4 Linus Torvalds
@ 2001-07-31  0:21         ` Matti Aarnio
  2001-07-31  1:23           ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-31 16:41           ` ext3-2.4-0.9.4 Linus Torvalds
  2001-07-31  0:57         ` ext3-2.4-0.9.4 Matthias Andree
  1 sibling, 2 replies; 662+ messages in thread
From: Matti Aarnio @ 2001-07-31  0:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, Jul 26, 2001 at 03:51:35PM +0000, Linus Torvalds wrote:
> To:	linux-kernel@vger.kernel.org
> From:	torvalds@transmeta.com (Linus Torvalds)
> Subject: Re: ext3-2.4-0.9.4
> Date:	Thu, 26 Jul 2001 15:51:35 +0000 (UTC)
....
> Use fsync() on the directory. 
> 
> Logical, isn't it?

  No.  I don't see why I should opendir() a directory, fsync()
that handle, and closedir() the handle.  I would definitely prefer:

       lsync(dirpath)

This could, even, behave like  lstat()  with the path: if the last name
segment is symlink, the sync is done on the i-node data of symlink, not
on what it (possibly) points to.

I didn't check if POSIX folks have thought of that.

> 		Linus

/Matti Aarnio

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 17:11                 ` ext3-2.4-0.9.4 Lawrence Greenfield
  2001-07-30 17:25                   ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-31  0:22                   ` Matthias Andree
  2001-08-03 17:24                   ` ext3-2.4-0.9.4 Jan Harkes
  2 siblings, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-31  0:22 UTC (permalink / raw)
  To: Lawrence Greenfield
  Cc: Rik van Riel, Patrick J. LoPresti, linux-kernel, Alan Cox,
	Chris Wedgwood, Chris Mason

On Mon, 30 Jul 2001, Lawrence Greenfield wrote:

> The idea that Linux fsync() doesn't actually make the file survive
> reboots is pretty ridiculous.

That doesn't apply to ReiserFS or ext3fs, it does apply to ext2fs and
possibly others.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 17:25                   ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-30 17:38                     ` ext3-2.4-0.9.4 Chris Wedgwood
  2001-07-30 17:49                     ` ext3-2.4-0.9.4 Lawrence Greenfield
@ 2001-07-31  0:25                     ` Matthias Andree
  2 siblings, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-31  0:25 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

On Mon, 30 Jul 2001, Rik van Riel wrote:

> > Thus BSD fsync() actually guarantees that when it returns, the file
> > (and all of it's filenames) will survive a reboot.
> 
> Note that this is very different from the "link() should be
> synchronous()" mantra we've been hearing over the last days.

Indeed, but this might still require MTA fixing probably, and opening a
file you just want to rename is quite expensive an operation.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 17:03                   ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-31  0:28                     ` Matthias Andree
  2001-07-31  0:33                       ` ext3-2.4-0.9.4 Rik van Riel
  0 siblings, 1 reply; 662+ messages in thread
From: Matthias Andree @ 2001-07-31  0:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

On Mon, 30 Jul 2001, Rik van Riel wrote:

> Hmmm, then maybe we'd just want some flag to fsync()
> telling the kernel to also sync the parent directory
> of the file and do whatever it needs to do to get the
> rename() or link() committed ?

Heck, you can't tell the kernel to do rename/link/open/unlink
synchronously in-band. This list doesn't care for other OS's. The
semantics FreeBSD (e. g.) offers ARE indeed documented.

This won't work out without kernel support. Portable reliability doesn't
come for free.

chattr +S is bad (slow). bloating all applications to include every
possible brain fart that the random FS inventor let go is even worse.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  0:28                     ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-31  0:33                       ` Rik van Riel
  0 siblings, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-31  0:33 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

On Tue, 31 Jul 2001, Matthias Andree wrote:
> On Mon, 30 Jul 2001, Rik van Riel wrote:
>
> > Hmmm, then maybe we'd just want some flag to fsync()
> > telling the kernel to also sync the parent directory
> > of the file and do whatever it needs to do to get the
> > rename() or link() committed ?
>
> Heck, you can't tell the kernel to do rename/link/open/unlink
> synchronously in-band. This list doesn't care for other OS's.
> The semantics FreeBSD (e. g.) offers ARE indeed documented.

Go back a few posts and read about the semantics
FreeBSD has when the filesystem is mounted with
softupdates.

Then take a deep breath.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-26 15:51       ` ext3-2.4-0.9.4 Linus Torvalds
  2001-07-31  0:21         ` ext3-2.4-0.9.4 Matti Aarnio
@ 2001-07-31  0:57         ` Matthias Andree
  2001-07-31  1:16           ` ext3-2.4-0.9.4 Rik van Riel
                             ` (2 more replies)
  1 sibling, 3 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-31  0:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, 26 Jul 2001, Linus Torvalds wrote:

> In article <20010726143002.E17244@emma1.emma.line.org>,
> Matthias Andree  <matthias.andree@stud.uni-dortmund.de> wrote:
> >
> >However, the remaining problem is being synchronous with respect to open
> >(fixed for ext3 with your fsync() as I understand it), rename, link and
> >unlink. With ext2, and as you write it, with ext3 as well, there is
> >currently no way to tell when the link/rename has been committed to
> >disk, unless you set mount -o sync or chattr +S or call sync() (the
> >former is not an option because it's far too expensive).
> 
> Congratulations. You have been brainwashed by Dan Bernstein.

No, I asked Wietse Venema what assumptions Postfix makes. Since he
refuses to fsync() directories, he has Postfix set chattr +S to enforce
the semantics he expects. No problem here.

> Use fsync() on the directory. 
> 
> Logical, isn't it?

Why go all the lengths to look up each single directory path component
again just to fsync() stuff that doesn't belong to you and that you
don't want synched, possibly the entire device?

Chase up to the root manually, because Linux' ext2 violates SUS v2
fsync() (which requires meta data synched BTW), as has been pointed out
(and fixed in ReiserFS and ext3)?

Admittedly, MTAs are (supposed to be) (per command of RFC-1123) more
paranoid than the average application - and per lack of standard whether
rename/link & Co. need to be synchronous or asynchronous, this is a
problem for the MTA.

So, please tell my why Single Unix Specification v2 specifies EIO for
rename. Asynchronous I/O cannot possibly trigger immediate EIO.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  0:57         ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-31  1:16           ` Rik van Riel
  2001-07-31  1:35           ` ext3-2.4-0.9.4 Mike Castle
  2001-08-01 16:02           ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2 siblings, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-31  1:16 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Linus Torvalds, linux-kernel

On Tue, 31 Jul 2001, Matthias Andree wrote:
> On Thu, 26 Jul 2001, Linus Torvalds wrote:
>
> > Congratulations. You have been brainwashed by Dan Bernstein.

[snip fsync() on directory ... on second thought this isn't enough]

> Chase up to the root manually, because Linux' ext2 violates SUS
> v2 fsync() (which requires meta data synched BTW), as has been
> pointed out (and fixed in ReiserFS and ext3)?

Agreed.  fsync() on the file needs to write the meta
data, this includes the directory and (if needed)
the parent directories all the way up to the root.

> So, please tell my why Single Unix Specification v2 specifies EIO for
> rename. Asynchronous I/O cannot possibly trigger immediate EIO.

Crap. An asynchronous rename() can hit the situation
where it cannot read the disk when searching for the
directory it wants to move the file to.

rename(/from/a/b/file, /to/d/f/file) can fail when
the system gets an IO access on reading "d".

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  0:21         ` ext3-2.4-0.9.4 Matti Aarnio
@ 2001-07-31  1:23           ` Rik van Riel
  2001-07-31  5:25             ` ext3-2.4-0.9.4 Lawrence Greenfield
  2001-07-31 21:29             ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-31 16:41           ` ext3-2.4-0.9.4 Linus Torvalds
  1 sibling, 2 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-31  1:23 UTC (permalink / raw)
  To: Matti Aarnio; +Cc: Linus Torvalds, linux-kernel

On Tue, 31 Jul 2001, Matti Aarnio wrote:
> On Thu, Jul 26, 2001 at 03:51:35PM +0000, Linus Torvalds wrote:

> > Use fsync() on the directory.
> >
> > Logical, isn't it?
>
>   No.  I don't see why I should opendir() a directory, fsync()
> that handle, and closedir() the handle.

And it wouldn't even be enough.  Who guarantees you that
the parent directory of this directory has been written
to disk and we won't lose the entry pointing to this
directory on a crash ?

> I would definitely prefer:
>
>        lsync(dirpath)

Nice idea.  Of course, fsync(file) also has the obligation
to make sure all the metadata of the file is written to
disk. Lots of people seem to be convinced this also includes
the metadata needed to _reach_ the file all the way from the
root of the filesystem...

> I didn't check if POSIX folks have thought of that.

Nice addition.  Easier to use than fsync() - no need to
open the file - and probably easier to implement in the
kernel because this way we'll be handing the whole path
to the kernel, whereas fsync() would have the dubious
task of finding out how this file can be traced all the
way down from the root of the filesystem.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 13:55               ` ext3-2.4-0.9.4 Alan Cox
  2001-07-30 14:38                 ` ext3-2.4-0.9.4 Patrick J. LoPresti
@ 2001-07-31  1:29                 ` Andrew McNamara
  1 sibling, 0 replies; 662+ messages in thread
From: Andrew McNamara @ 2001-07-31  1:29 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

>> This does not help.  The MTAs are doing fsync() on the temporary file
>> and then using the *subsequent* rename() as the committing operation.
>
>Which is quaint, because as we've pointed out repeatedly to you rename
>is not an atomic operation. Even on a simple BSD or ext2 style fs it can
>be two directory block writes,  metadata block writes, a bitmap write
>and a cylinder group write.

This is almost (but not quite) irrelevant. The receiving MTA simply
wants the fsync()/rename() system call to not return until everything
(including directory blocks) have been written to disk, at which point,
it says to the remote end "250 OK". If the receiving machine goes down
at any point up until this one, the sending system will resend the
message.  (Yes, the receiving system may have a corrupt directory, and
this is a problem).

 ---
Andrew McNamara (System Architect)

connect.com.au Pty Ltd
Lvl 3, 213 Miller St, North Sydney, NSW 2060, Australia
Phone: +61 2 9409 2117, Fax: +61 2 9409 2111

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  0:57         ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-31  1:16           ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-31  1:35           ` Mike Castle
  2001-07-31 21:27             ` ext3-2.4-0.9.4 Matthias Andree
  2001-08-01 16:02           ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2 siblings, 1 reply; 662+ messages in thread
From: Mike Castle @ 2001-07-31  1:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds

On Tue, Jul 31, 2001 at 02:57:00AM +0200, Matthias Andree wrote:
> So, please tell my why Single Unix Specification v2 specifies EIO for
> rename. Asynchronous I/O cannot possibly trigger immediate EIO.

It also specifies EIO as possible for write().

Are you saying that, since SUS2 specifies that write() is capable of
returning EIO, and asynchronous I/O cannot possibly trigger immediate EIO, 
that all calls to write() should by synchronous?

mrc
-- 
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  1:23           ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-31  5:25             ` Lawrence Greenfield
  2001-07-31 15:40               ` ext3-2.4-0.9.4 Matti Aarnio
  2001-07-31 21:30               ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-31 21:29             ` ext3-2.4-0.9.4 Matthias Andree
  1 sibling, 2 replies; 662+ messages in thread
From: Lawrence Greenfield @ 2001-07-31  5:25 UTC (permalink / raw)
  To: Matti Aarnio, Rik van Riel; +Cc: linux-kernel, Linus Torvalds

   Date: 	Mon, 30 Jul 2001 22:23:29 -0300 (BRST)
   From: Rik van Riel <riel@conectiva.com.br>
[...]
   > I would definitely prefer:
   >
   >        lsync(dirpath)
[...]
   Nice addition.  Easier to use than fsync() - no need to
   open the file - and probably easier to implement in the
   kernel because this way we'll be handing the whole path
   to the kernel, whereas fsync() would have the dubious
   task of finding out how this file can be traced all the
   way down from the root of the filesystem.

It's not as good as fsync() just doing what it's suppose to do.
You'll force applications that want to issue multiple link()s to issue
multiple lsync()s, forcing the kernel to serialize all of the disk
writes when the application just wants one file (and all of it's
associated filenames) to disk.

Yes, I understand that implementing fsync() so that it syncs all names
to reach the file is difficult.  But if you want the best performance,
you don't want to make applications issue multiple calls each of which
force their own synchronous writes.

Not to mention us whiny application writers won't be happy throwing
lsync()s all over the place.

Larry



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
  2001-07-29 10:00     ` Chris Wedgwood
@ 2001-07-31 15:18       ` Florian Weimer
  0 siblings, 0 replies; 662+ messages in thread
From: Florian Weimer @ 2001-07-31 15:18 UTC (permalink / raw)
  To: linux-kernel

Chris Wedgwood <cw@f00f.org> writes:

> People all need to appreciate sometimes vendors cannot released open
> source drivers even if they wanted too.  Sometimes vendors have the
> ability to released binary only drivers which are derived in part from
> source-code which they license --- but cannot share.

That's particularly true if there is no other documentation for the
hardware other than this reference source code.  This seems to be a
common situation, even with hardware which has good specs, technically
speaking.

-- 
Florian Weimer 	                  Florian.Weimer@RUS.Uni-Stuttgart.DE
University of Stuttgart           http://cert.uni-stuttgart.de/
RUS-CERT                          +49-711-685-5973/fax +49-711-685-5898

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption)
  2001-07-29 10:24     ` Matthew Gardiner
  2001-07-29 11:07       ` Chris Wedgwood
@ 2001-07-31 15:19       ` Florian Weimer
  1 sibling, 0 replies; 662+ messages in thread
From: Florian Weimer @ 2001-07-31 15:19 UTC (permalink / raw)
  To: linux-kernel

Matthew Gardiner <kiwiunixman@yahoo.co.nz> writes:

> 2. Regards to hardware manufacturers, what have the got to lose from 
> publishing the specs? nothing.

Some vendors do not have proper specs or have received them under NDA
themselves.

-- 
Florian Weimer 	                  Florian.Weimer@RUS.Uni-Stuttgart.DE
University of Stuttgart           http://cert.uni-stuttgart.de/
RUS-CERT                          +49-711-685-5973/fax +49-711-685-5898

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  5:25             ` ext3-2.4-0.9.4 Lawrence Greenfield
@ 2001-07-31 15:40               ` Matti Aarnio
  2001-07-31 16:35                 ` ext3-2.4-0.9.4 Anton Altaparmakov
  2001-07-31 21:30               ` ext3-2.4-0.9.4 Matthias Andree
  1 sibling, 1 reply; 662+ messages in thread
From: Matti Aarnio @ 2001-07-31 15:40 UTC (permalink / raw)
  To: Lawrence Greenfield; +Cc: linux-kernel

  The thing about filesystems, and how dimmly MTAs (should) consider
  some performance tweaks is something I have tried to describe at
  ZMailer's manual in part about its the queue:

      http://www.zmailer.org/zman/zadm-queues.html

On Tue, Jul 31, 2001 at 01:25:06AM -0400, Lawrence Greenfield wrote:
...
> It's not as good as fsync() just doing what it's suppose to do.
> You'll force applications that want to issue multiple link()s to issue
> multiple lsync()s, forcing the kernel to serialize all of the disk
> writes when the application just wants one file (and all of it's
> associated filenames) to disk.
> 
> Yes, I understand that implementing fsync() so that it syncs all names
> to reach the file is difficult.  But if you want the best performance,
> you don't want to make applications issue multiple calls each of which
> force their own synchronous writes.
> 
> Not to mention us whiny application writers won't be happy throwing
> lsync()s all over the place.
> 
> Larry

   I quite agree.

   Filesystems are not, unfortunately, rollbackfull logged and committable
   databases, even if we like to use them often in that way.

   An MTA with a fundamental design point of not using any privileged
   programs (no suid anything!) and least esoteric technology possible
   (for wide portability) can only use message submission means available
   to it everywhere -- implementing the queue inside a database system
   is definitely a possibility.   Possibly yielding higher performance
   than one using filesystem for it, but at what cost ??
   (I am thinking of SleepyCat DB multiaccess transaction supported
    version.)

/Matti Aarnio

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31 15:40               ` ext3-2.4-0.9.4 Matti Aarnio
@ 2001-07-31 16:35                 ` Anton Altaparmakov
  0 siblings, 0 replies; 662+ messages in thread
From: Anton Altaparmakov @ 2001-07-31 16:35 UTC (permalink / raw)
  To: Matti Aarnio; +Cc: Lawrence Greenfield, linux-kernel

On Tue, 31 Jul 2001, Matti Aarnio wrote:

>   The thing about filesystems, and how dimmly MTAs (should) consider
>   some performance tweaks is something I have tried to describe at
>   ZMailer's manual in part about its the queue:
> 
>       http://www.zmailer.org/zman/zadm-queues.html
> 
> On Tue, Jul 31, 2001 at 01:25:06AM -0400, Lawrence Greenfield wrote:
> ...
> > It's not as good as fsync() just doing what it's suppose to do.
> > You'll force applications that want to issue multiple link()s to issue
> > multiple lsync()s, forcing the kernel to serialize all of the disk
> > writes when the application just wants one file (and all of it's
> > associated filenames) to disk.
> > 
> > Yes, I understand that implementing fsync() so that it syncs all names
> > to reach the file is difficult.  But if you want the best performance,
> > you don't want to make applications issue multiple calls each of which
> > force their own synchronous writes.
> > 
> > Not to mention us whiny application writers won't be happy throwing
> > lsync()s all over the place.
> > 
> > Larry
> 
>    I quite agree.
> 
>    Filesystems are not, unfortunately, rollbackfull logged and committable
>    databases, even if we like to use them often in that way.

Well it depends on which file system you are talking about. NTFS is for
all intents and purposes a rollbackfull logged and committable
(relational) database and a file system at the same time. It's a shame M$
don't release the specs for it, otherwise it would be just what you are
looking for. - It will take us forever to reverse engineer the
journalling part of NTFS. You can see how long it is taking us just to
get the actual file system part.. and journalling on top of that is going
to be even worse. (Of course once we have the file system part there is
nothing to stop us doing our own thing with respect to journalling but
that's a different discussion.)

Anton

> 
>    An MTA with a fundamental design point of not using any privileged
>    programs (no suid anything!) and least esoteric technology possible
>    (for wide portability) can only use message submission means available
>    to it everywhere -- implementing the queue inside a database system
>    is definitely a possibility.   Possibly yielding higher performance
>    than one using filesystem for it, but at what cost ??
>    (I am thinking of SleepyCat DB multiaccess transaction supported
>     version.)
> 
> /Matti Aarnio
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  0:21         ` ext3-2.4-0.9.4 Matti Aarnio
  2001-07-31  1:23           ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-07-31 16:41           ` Linus Torvalds
  1 sibling, 0 replies; 662+ messages in thread
From: Linus Torvalds @ 2001-07-31 16:41 UTC (permalink / raw)
  To: Matti Aarnio; +Cc: linux-kernel


On Tue, 31 Jul 2001, Matti Aarnio wrote:
> >
> > Logical, isn't it?
>
>   No.  I don't see why I should opendir() a directory, fsync()
> that handle, and closedir() the handle.  I would definitely prefer:
>
>        lsync(dirpath)

Btw, you don't have to do opendir() - that just wastes time. Just do
something like

	int lsync(char *path)
	{
	        int err, fd;
	        fd = open(path, 0);
	        if (fd >= 0) {
	                err = fsync(fd);
	                close(fd);
	        }
	        return err;
	}

and you're done. But it won't do the symlink thing...

		Linus


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: VIA KT133A / athlon / MMX
  2001-07-28 12:47           ` Alan Cox
@ 2001-07-31 19:53             ` David Lang
  0 siblings, 0 replies; 662+ messages in thread
From: David Lang @ 2001-07-31 19:53 UTC (permalink / raw)
  To: Alan Cox; +Cc: cw, ppeiffer, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1006 bytes --]

I have not had a chance to examine all the BIOS settings but attached are
the lspci -vxx for 10 different systems with identical hardware configs.
the ones that end in .good have given me no problems, the three ending in
.bad die after a while. the framewall-b system dies with all LEDs on the
network card lit, all others die with no LEDs on.

David Lang

On Sat, 28 Jul 2001, Alan Cox wrote:

> Date: Sat, 28 Jul 2001 13:47:40 +0100 (BST)
> From: Alan Cox <alan@lxorguk.ukuu.org.uk>
> To: dlang@diginsite.com
> Cc: alan@lxorguk.ukuu.org.uk, cw@f00f.org, ppeiffer@free.fr,
>      linux-kernel@vger.kernel.org
> Subject: Re: VIA KT133A / athlon / MMX
>
> > I have a 1u box at my des that has two MSI boards in it with 1.2G athlons.
> > at the moment they are both running 2.4.5 (athlon optimized), one box has
> > no problems at all while the other dies (no video, no keyboard, etc)
> > within an hour of being booted.
>
> Same bios, same bios settings ?
>
> lspci -vxx on both show the same settings ?
>


[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 71680 bytes --]

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  1:35           ` ext3-2.4-0.9.4 Mike Castle
@ 2001-07-31 21:27             ` Matthias Andree
  0 siblings, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-31 21:27 UTC (permalink / raw)
  To: Mike Castle, linux-kernel, Linus Torvalds

On Mon, 30 Jul 2001, Mike Castle wrote:

> On Tue, Jul 31, 2001 at 02:57:00AM +0200, Matthias Andree wrote:
> > So, please tell my why Single Unix Specification v2 specifies EIO for
> > rename. Asynchronous I/O cannot possibly trigger immediate EIO.
> 
> It also specifies EIO as possible for write().
> 
> Are you saying that, since SUS2 specifies that write() is capable of
> returning EIO, and asynchronous I/O cannot possibly trigger immediate EIO, 
> that all calls to write() should by synchronous?

No, I'm wondering about the semantics. Of course, write() can be
synchronous (O_SYNC or fs mounted sync e. g.).

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  1:23           ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-31  5:25             ` ext3-2.4-0.9.4 Lawrence Greenfield
@ 2001-07-31 21:29             ` Matthias Andree
  2001-07-31 21:54               ` ext3-2.4-0.9.4 Mike Castle
  2001-07-31 23:46               ` ext3-2.4-0.9.4 Chris Wedgwood
  1 sibling, 2 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-31 21:29 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

On Mon, 30 Jul 2001, Rik van Riel wrote:

> > I didn't check if POSIX folks have thought of that.
> 
> Nice addition.  Easier to use than fsync() - no need to
> open the file - and probably easier to implement in the
> kernel because this way we'll be handing the whole path
> to the kernel, whereas fsync() would have the dubious
> task of finding out how this file can be traced all the
> way down from the root of the filesystem.

If I understand SUS v2 correctly, fsync() must sync meta data
corresponding to the file.

If Linux ext2 doesn't to that, it might be a good idea to change that so
it does.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  5:25             ` ext3-2.4-0.9.4 Lawrence Greenfield
  2001-07-31 15:40               ` ext3-2.4-0.9.4 Matti Aarnio
@ 2001-07-31 21:30               ` Matthias Andree
  1 sibling, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-07-31 21:30 UTC (permalink / raw)
  To: Lawrence Greenfield
  Cc: Matti Aarnio, Rik van Riel, linux-kernel, Linus Torvalds

On Tue, 31 Jul 2001, Lawrence Greenfield wrote:

> Not to mention us whiny application writers won't be happy throwing
> lsync()s all over the place.

Not portable -> won't happen usually.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31 21:29             ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-07-31 21:54               ` Mike Castle
  2001-07-31 23:46               ` ext3-2.4-0.9.4 Chris Wedgwood
  1 sibling, 0 replies; 662+ messages in thread
From: Mike Castle @ 2001-07-31 21:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rik van Riel

On Tue, Jul 31, 2001 at 11:29:47PM +0200, Matthias Andree wrote:
> If I understand SUS v2 correctly, fsync() must sync meta data
> corresponding to the file.


Where can I find a common definition for "meta data."

For example, I consider meta data to be things kept in the inode only
(size, timestamps, permissions).  Indirect blocks, maybe.  But, considering
how, in the unix world, file names are NOT associated with files, I have
never considered file names to be meta data.  Instead, file names is a set
of data associated with special files known as "directories."  So, it is
obvious, to me, that expecting fsync to sync changes to directory entries
is silly.

Obviously, however, you have a different definition of what meta data is.

Does SUS2 provide a definition for meta data?

A quick glance at the webside didn't turn anything up for me, but I would
not be surprised that I may have missed it.

mrc
-- 
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: my patches won't compile under 2.4.7
  2001-07-25 19:45     ` my patches won't compile under 2.4.7 Kirk Reiser
  2001-07-25 19:58       ` Alan Cox
@ 2001-07-31 21:54       ` Richard Gooch
  2001-08-01 11:14         ` Kirk Reiser
  2001-08-01 14:57         ` Richard Gooch
  1 sibling, 2 replies; 662+ messages in thread
From: Richard Gooch @ 2001-07-31 21:54 UTC (permalink / raw)
  To: Alan Cox; +Cc: Kirk Reiser, linux-kernel

Alan Cox writes:
> > 
> > As of 2.4.7 my patches to the kernel won't compile.  It appears to be
> > something to do with devfs_fs_kernel.h being part of miscdevices.h.  I
> > have sifted through the code but have not been able to determine
> > exactly why they won't work any more.  Here is the error output from
> > my compile:

I don't see why you're pointing the finger devfs_fs_kernel.h. Other
miscdevice drivers compile fine.

> > gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i586    -c -o speakup.o speakup.c
> > In file included from /usr/src/linux/include/linux/locks.h:8,
> >                  from /usr/src/linux/include/linux/devfs_fs_kernel.h:6,
> >                  from /usr/src/linux/include/linux/miscdevice.h:4,
> >                  from speakup.c:63:
> > /usr/src/linux/include/linux/pagemap.h:35: `currcons' undeclared here (not in a function)
> > /usr/src/linux/include/linux/pagemap.h:35: parse error before `.'
> > make[4]: *** [speakup.o] Error 1

Looking at my copy of include/linux/pagemap.h I see no instance of
"currcons" on line 35 or elsewhere.

> > I'm not sure even where to start trying to describe what I've looked
> > at and what I don't understand.  It appears that page_cache_alloc() is
> > now an inline function with an argument passed to it, where it used to
> > be a #define with no arguments.  I see that struct misc_device now has
> > a new member devfs_handle but the other drivers I've looked at rtc.c

This is not new. struct misc_device has had a "devfs_handle" field for
a long time. Since 2.3.46, in fact. So when you say above "since
2.4.7", I suspect you mean "after virgin 2.2.x". It would have helped
if you had specified this.

My guess is that your patch has some bad #define somewhere. Again, it
would have helped if you had sent the patch as well.

Anyway, I don't think this problem is even remotely related to devfs.
I suggest you post more complete information to the linux-kernel
mailing list. Then maybe someone there can help you.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31 21:29             ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-31 21:54               ` ext3-2.4-0.9.4 Mike Castle
@ 2001-07-31 23:46               ` Chris Wedgwood
  2001-07-31 23:53                 ` ext3-2.4-0.9.4 Rik van Riel
  1 sibling, 1 reply; 662+ messages in thread
From: Chris Wedgwood @ 2001-07-31 23:46 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel

On Tue, Jul 31, 2001 at 11:29:47PM +0200, Matthias Andree wrote:

    If I understand SUS v2 correctly, fsync() must sync meta data
    corresponding to the file.

    If Linux ext2 doesn't to that, it might be a good idea to change
    that so it does.

Define 'meta-data' --- linux sync's any inode and/or bitmap changes,
fsyn on a file will ensure it is intact but not that it can't get
lost.



  --cw


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31 23:46               ` ext3-2.4-0.9.4 Chris Wedgwood
@ 2001-07-31 23:53                 ` Rik van Riel
  0 siblings, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-07-31 23:53 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: linux-kernel

On Wed, 1 Aug 2001, Chris Wedgwood wrote:
> On Tue, Jul 31, 2001 at 11:29:47PM +0200, Matthias Andree wrote:
>
>     If I understand SUS v2 correctly, fsync() must sync meta data
>     corresponding to the file.
>
>     If Linux ext2 doesn't to that, it might be a good idea to change
>     that so it does.
>
> Define 'meta-data' --- linux sync's any inode and/or bitmap
> changes, fsyn on a file will ensure it is intact but not that it
> can't get lost.

Syntactically correct, but quite useless IMHO ;)

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: my patches won't compile under 2.4.7
  2001-07-31 21:54       ` Richard Gooch
@ 2001-08-01 11:14         ` Kirk Reiser
  2001-08-01 14:57         ` Richard Gooch
  1 sibling, 0 replies; 662+ messages in thread
From: Kirk Reiser @ 2001-08-01 11:14 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Alan Cox, linux-kernel

Actually it wasn't Alan pointing the finger it was me.  I was only
trying to figure out what the errors meant and they pointed to
devfs_fs_kernel.h.  The problem as I suspected at eh time was entirely
unrelated.  I moved my #include of misc_devices.h up and removed a
duplicate #include for linux/init.h and poof she compiled.  I am
starting to become a believer in voodoo computing again I guess.

On another note related to devfs though when I compile devfs in the
system just hangs.  I am wondering if I am registering my synth device
before devfs has memory allocated.  I register very early in the boot
process in console_init() and experienced similar problems before because I
don't think  kmalloc() may be available that early in the sequence.

The question then is, do you think that could be why the system is
hanging with devfs configured in?

  Kirk

-- 

Kirk Reiser				The Computer Braille Facility
e-mail: kirk@braille.uwo.ca		University of Western Ontario
phone: (519) 661-3061

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: my patches won't compile under 2.4.7
  2001-07-31 21:54       ` Richard Gooch
  2001-08-01 11:14         ` Kirk Reiser
@ 2001-08-01 14:57         ` Richard Gooch
  1 sibling, 0 replies; 662+ messages in thread
From: Richard Gooch @ 2001-08-01 14:57 UTC (permalink / raw)
  To: Kirk Reiser; +Cc: Alan Cox, linux-kernel

Kirk Reiser writes:
> On another note related to devfs though when I compile devfs in the
> system just hangs.  I am wondering if I am registering my synth device
> before devfs has memory allocated.  I register very early in the boot
> process in console_init() and experienced similar problems before because I
> don't think  kmalloc() may be available that early in the sequence.
> 
> The question then is, do you think that could be why the system is
> hanging with devfs configured in?

Yes. Calling kmalloc() before MM is set up is not allowed. See the
comments in drivers/char/console.c which talks about not calling
kmalloc() before console_init().

Simply move your driver registration after MM is set up. Use
module_init() to declare your initialisation function. This works for
both modules and built-in drivers. Registering a driver before MM
setup is considered bad practice.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-27  4:28                             ` ext3-2.4-0.9.4 Andrew Morton
@ 2001-08-01 15:51                               ` Stephen C. Tweedie
  0 siblings, 0 replies; 662+ messages in thread
From: Stephen C. Tweedie @ 2001-08-01 15:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andre Pang, linux-kernel, Stephen Tweedie

Hi,

On Fri, Jul 27, 2001 at 02:28:03PM +1000, Andrew Morton wrote:

> I believe that `dirsync' would provide synchronous metadata
> operations (ie: the metadata is crashproofed on-disk when
> the syscall returns), but non-sync data.  Correct?

Not quite: dirsync would provide synchronous metadata operations on
directories, but would make no guarantees about other file types.
That way we don't have the cost of doing sync updates to the inodes or
indirect blocks of regular files --- fsync() is adequate to do that on
demand.

Of course, fsync() is also sufficient to do syncing of directory
operations on demand, but it's a bit heavyweight for that purpose,
hence the request for dirsync (either as a chattr flag or as a mount
option.)

> If, however, the application is capable of doing a nice big
> write() (setvbuf!) then really, the two things will be pretty
> much the same.

Almost --- it's the same for create+write+close+fsync, but not for
rename or for unlink (in which case there's not necessarily going to
be a data fsync to force the directory operation out to disk.)

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-31  0:57         ` ext3-2.4-0.9.4 Matthias Andree
  2001-07-31  1:16           ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-31  1:35           ` ext3-2.4-0.9.4 Mike Castle
@ 2001-08-01 16:02           ` Stephen C. Tweedie
  2001-08-01 17:40             ` ext3-2.4-0.9.4 Kurt Roeckx
                               ` (2 more replies)
  2 siblings, 3 replies; 662+ messages in thread
From: Stephen C. Tweedie @ 2001-08-01 16:02 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel; +Cc: Stephen Tweedie, Matthias Andree

Hi,

> Chase up to the root manually, because Linux' ext2 violates SUS v2
> fsync() (which requires meta data synched BTW)

Please quote chapter and verse --- my reading of SUS shows no such
requirement.  

fsync is required to force "all currently queued I/O operations
associated with the file indicated by file descriptor fildes to the
synchronised I/O completion state."  But as you should know, directory
entries and files are NOT the same thing in Unix/SUS.  

Are we expected to fsync the metadata belonging to just the file
itself?  Or all symlinks to the file?  Or all hard links?  Answer, as
best I can determine --- just the file.  That's all SUS talks about.
There can be many ways of reaching that file in the directory
hierarchy, or there can be none, but fsync() doesn't talk at all about
the status of those dirents after the sync.

> , as has been pointed out
> (and fixed in ReiserFS and ext3)?

ext3 happens to provide the guarantee, but that's coincidental and
does not imply that I think of it as being "fixed".  It's just changed
behaviour relative to ext2.

> So, please tell my why Single Unix Specification v2 specifies EIO for
> rename. Asynchronous I/O cannot possibly trigger immediate EIO.

Yes it can --- we may need to read metadata to complete the rename,
and such reads can fail.  

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-01 16:02           ` ext3-2.4-0.9.4 Stephen C. Tweedie
@ 2001-08-01 17:40             ` Kurt Roeckx
  2001-08-02  0:17             ` ext3-2.4-0.9.4 Andrew McNamara
  2001-08-02  9:03             ` ext3-2.4-0.9.4 Matthias Andree
  2 siblings, 0 replies; 662+ messages in thread
From: Kurt Roeckx @ 2001-08-01 17:40 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-kernel

On Wed, Aug 01, 2001 at 05:02:30PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> > Chase up to the root manually, because Linux' ext2 violates SUS v2
> > fsync() (which requires meta data synched BTW)
> 
> Please quote chapter and verse --- my reading of SUS shows no such
> requirement.  
> 
> fsync is required to force "all currently queued I/O operations
> associated with the file indicated by file descriptor fildes to the
> synchronised I/O completion state."  But as you should know, directory
> entries and files are NOT the same thing in Unix/SUS.  

It goed on with "All I/O operations are completed as defined for
synchronised I/O file integrity completion.", whatever it all
means.

For fdatasync() it says:
"The fdatasync() function forces all currently queued I/O
operations associated with the file indicated by file descriptor
fildes to the synchronised I/O completion state.", which is just
the same as it says for fsync().

It also says:
"The functionality is as described for fsync() (with the symbol
_XOPEN_REALTIME defined), with the exception that all I/O
operations are completed as defined for synchronised I/O data
integrity completion."

It doesn't mention meta-data.

I have no idea what it all means.


Kurt


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 2.4.2 ext2fs corruption status
  2001-08-02  0:20   ` 2.4.2 ext2fs corruption status Alan Cox
@ 2001-08-01 19:40     ` Mohamed DOLLIAZAL
  0 siblings, 0 replies; 662+ messages in thread
From: Mohamed DOLLIAZAL @ 2001-08-01 19:40 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andreas Dilger, linux-kernel

Alan Cox wrote:

> > It may be that Red Hat has already released a new kernel RPM since that
> > time, or maybe you need to compile a new kernel.
>
> The official VIA workaround fix is now in 2.4.6ac5 and 2.4.7ac*. The fixes
> in the older kernels were mostly going to do the job but I dont know if they
> were perfect for all cases
>
> The -ac kernel tree also contains important fixes that avoid DMA timeouts
> potentially causing disk corruption by forgetting to write sectors

Hi Alan,

   I'am sorry I forgot to mention that the filesystem corruption happened on
SCSI disks.  I guess there is no DMA on the SCSI disks.
   Do you think that the VIA fixes that are included in the 2.4.6ac5 kernel or
above may solve my problem.

Thanks for your help,

Mohamed.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-01 16:02           ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2001-08-01 17:40             ` ext3-2.4-0.9.4 Kurt Roeckx
@ 2001-08-02  0:17             ` Andrew McNamara
  2001-08-02  9:03             ` ext3-2.4-0.9.4 Matthias Andree
  2 siblings, 0 replies; 662+ messages in thread
From: Andrew McNamara @ 2001-08-02  0:17 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-kernel

>Please quote chapter and verse --- my reading of SUS shows no such
>requirement.  
>
>fsync is required to force "all currently queued I/O operations
>associated with the file indicated by file descriptor fildes to the
>synchronised I/O completion state."  But as you should know, directory
>entries and files are NOT the same thing in Unix/SUS.  

But does fsync() have any meaning if it doesn't ensure the file is
visible within the filesystem? 

This all comes back to the fact that old UFS's made directory
operations syncronous, at a substantial cost in performance. Writing
the directory data wasn't necessary for them, because it was already
commited when the creat() call returned.

I can easily understand people's asthetic objection to having fsync
touch the directory object as well as the file, however what meaning
does fsync() have it it doesn't - under linux, it tells usermode "yes,
your object is committed, but it might be in lost+found next time you
want it", and with the syncronous UFS implementations, it tells
usermode "yes, your object is committed, you can find it where you left
it (unless the directory was corrupted)".

 ---
Andrew McNamara (System Architect)

connect.com.au Pty Ltd
Lvl 3, 213 Miller St, North Sydney, NSW 2060, Australia
Phone: +61 2 9409 2117, Fax: +61 2 9409 2111

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 2.4.2 ext2fs corruption status
       [not found] ` <no.id>
                     ` (35 preceding siblings ...)
  2001-07-29  7:05   ` binary modules (was Re: ReiserFS / 2.4.6 / Data Corruption) Richard Gooch
@ 2001-08-02  0:20   ` Alan Cox
  2001-08-01 19:40     ` Mohamed DOLLIAZAL
  2001-08-02  0:35   ` Memory Write Ordering Question Alan Cox
                     ` (166 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-08-02  0:20 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Mohamed DOLLIAZAL, linux-kernel

> It may be that Red Hat has already released a new kernel RPM since that
> time, or maybe you need to compile a new kernel.

The official VIA workaround fix is now in 2.4.6ac5 and 2.4.7ac*. The fixes
in the older kernels were mostly going to do the job but I dont know if they
were perfect for all cases

The -ac kernel tree also contains important fixes that avoid DMA timeouts
potentially causing disk corruption by forgetting to write sectors

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Memory Write Ordering Question
       [not found] ` <no.id>
                     ` (36 preceding siblings ...)
  2001-08-02  0:20   ` 2.4.2 ext2fs corruption status Alan Cox
@ 2001-08-02  0:35   ` Alan Cox
  2001-08-02 12:24   ` SMP possible with AMD CPUs? Alan Cox
                     ` (165 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-02  0:35 UTC (permalink / raw)
  To: James W. Lake; +Cc: "Linux Kernel Mailing List (E-mail)"

> I'm wondering if anyone has any idea what exactly is causing this.  The
> readl is a so-so work around.  I'd like to figure out how to do it
> correctly.  Does anyone who knows more about Intel CPU's and write
> ordering and PCI have any ideas?

Its entirely a PCI issue. PCI writes are posted and may be deferred. However
a write cannot pass another write to the device, nor a read, so your read
is the real solution.

The full horror is in the PCI specs which you can get on CD nowdays fairly
sanely. Basically PCI is a message passing system disguised as a bus, treat
it as the former and you wont get too badly hurt

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-01 16:02           ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2001-08-01 17:40             ` ext3-2.4-0.9.4 Kurt Roeckx
  2001-08-02  0:17             ` ext3-2.4-0.9.4 Andrew McNamara
@ 2001-08-02  9:03             ` Matthias Andree
  2001-08-02  9:51               ` ext3-2.4-0.9.4 Christoph Hellwig
  2001-08-02 17:26               ` ext3-2.4-0.9.4 Daniel Phillips
  2 siblings, 2 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-02  9:03 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: linux-kernel

On Wed, 01 Aug 2001, Stephen Tweedie wrote:

> > Chase up to the root manually, because Linux' ext2 violates SUS v2
> > fsync() (which requires meta data synched BTW)
> 
> Please quote chapter and verse --- my reading of SUS shows no such
> requirement.  
> 
> fsync is required to force "all currently queued I/O operations
> associated with the file indicated by file descriptor fildes to the
> synchronised I/O completion state."  But as you should know, directory
> entries and files are NOT the same thing in Unix/SUS.  

Read on: "All I/O operations are completed as defined for synchronised
I/O _file_ integrity completion.". To show what that means, see the
glossary.

http://www.opengroup.org/onlinepubs/007908799/xbd/glossary.html#tag_004_000_291

  "synchronised I/O data integrity completion

  [...]

  * For write, when the operation has been completed or diagnosed if
  unsuccessful.  The write is complete only when the data specified in
  the write request is successfully transferred and all file system
  information required to retrieve the data is successfully transferred.

  File attributes that are not necessary for data retrieval (access
  time, modification time, status change time) need not be successfully
  transferred prior to returning to the calling process.

  synchronised I/O file integrity completion

  Identical to a synchronised I/O data integrity completion with the
  addition that all file attributes relative to the I/O operation
  (including access time, modification time, status change time) will be
  successfully transferred prior to returning to the calling process."

As I understand it, the directory entry's st_ino is a file attribute
necessary for data retrieval and also contains the m/a/ctime, so it must
be flushed to disk on fsync() as well.

> There can be many ways of reaching that file in the directory
> hierarchy, or there can be none, but fsync() doesn't talk at all about
> the status of those dirents after the sync.

Well, if there's not a single dirent, you cannot retrieve the data, so
I'd assume at least one dirent needs to be flushed as well. If there's a
simple way to get unflushed dentries to disk (hard links included),
flush them. Not sure about symlinks, but since they don't share the
inode number, that might be rather difficult for the kernel (I didn't
check):

touch 1 ; ln 1 2 ; ln -s 1 3 ; ls -li

 303464 -rw-r--r--   2 emma     users           0 Aug  2 10:56 1
 303464 -rw-r--r--   2 emma     users           0 Aug  2 10:56 2
 303466 lrwxrwxrwx   1 emma     users           1 Aug  2 10:56 3 -> 1

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-02  9:03             ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-08-02  9:51               ` Christoph Hellwig
  2001-08-02  9:56                 ` ext3-2.4-0.9.4 Rik van Riel
  2001-08-02 17:26               ` ext3-2.4-0.9.4 Daniel Phillips
  1 sibling, 1 reply; 662+ messages in thread
From: Christoph Hellwig @ 2001-08-02  9:51 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel, sct

In article <20010802110341.B17927@emma1.emma.line.org> you wrote:
>
> http://www.opengroup.org/onlinepubs/007908799/xbd/glossary.html#tag_004_000_291
>
>   "synchronised I/O data integrity completion
>
>   [...]
>
>   * For write, when the operation has been completed or diagnosed if
>   unsuccessful.  The write is complete only when the data specified in
>   the write request is successfully transferred and all file system
>   information required to retrieve the data is successfully transferred.
>
>   File attributes that are not necessary for data retrieval (access
>   time, modification time, status change time) need not be successfully
>   transferred prior to returning to the calling process.

NOTE: _file_ attributes

>
>   synchronised I/O file integrity completion
>
>   Identical to a synchronised I/O data integrity completion with the
>   addition that all file attributes relative to the I/O operation
>   (including access time, modification time, status change time) will be
>   successfully transferred prior to returning to the calling process."
>
> As I understand it, the directory entry's st_ino is a file attribute
> necessary for data retrieval and also contains the m/a/ctime, so it must
> be flushed to disk on fsync() as well.
>

As long as you have an open fd, no directory entry is needed for
data retrieval.  In fact some fds never have a directory entry
(e.g. sockets - but these don't support fsync anyway) or do not have a
directory entry in their user-visble interface (e.g. posix shm).

And m/a/ctime is in the inode of the file, not in the directory enrty.
(at least for usual UNIX filesystems).

> Well, if there's not a single dirent, you cannot retrieve the data,

Of course you can, you can pass and fd for an unliked file everywhere
using AF_LOCAL descriptor passing.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-02  9:51               ` ext3-2.4-0.9.4 Christoph Hellwig
@ 2001-08-02  9:56                 ` Rik van Riel
  2001-08-02 12:47                   ` ext3-2.4-0.9.4 Eric W. Biederman
  0 siblings, 1 reply; 662+ messages in thread
From: Rik van Riel @ 2001-08-02  9:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Matthias Andree, linux-kernel, sct

On Thu, 2 Aug 2001, Christoph Hellwig wrote:

> > Well, if there's not a single dirent, you cannot retrieve the data,
>
> Of course you can, you can pass and fd for an unliked file
> everywhere using AF_LOCAL descriptor passing.

But this assumes the system doesn't crash, while
fsync() seems meant more as a protection against
the system going down unexpectedly ...

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: SMP possible with AMD CPUs?
       [not found] ` <no.id>
                     ` (37 preceding siblings ...)
  2001-08-02  0:35   ` Memory Write Ordering Question Alan Cox
@ 2001-08-02 12:24   ` Alan Cox
  2001-08-03  7:07     ` Eric W. Biederman
  2001-08-02 12:27   ` 2.4.2 ext2fs corruption status Alan Cox
                     ` (164 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-08-02 12:24 UTC (permalink / raw)
  To: Paul G. Allen; +Cc: linux-kernel

> 	a. The IDE is no longer a 7409 PCI ID but 7411 so it operates as a generic IDE (slow as hell).
[Should run full UDMA in -ac]

> 	b. The AGP is now ID 700C and is not detected unless the agpgart driver is loaded with agp_try_unsupported=1.

Send me the relevant pci idents and I'll add it

> 	d. The PCI bridge ID is different and (again) operates in a generic modeAgain send me the ids
> 	e. The Host bridge ID is now 700C and operates in a generic mode.

Send me the idents for these two

> 3. The BIOS (apparently) doesn't setup the MTRR properly on both CPUs making mtrr bitch about a mismatch.

The mtrr driver fixups should cure that - its a common bios bug.

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 2.4.2 ext2fs corruption status
       [not found] ` <no.id>
                     ` (38 preceding siblings ...)
  2001-08-02 12:24   ` SMP possible with AMD CPUs? Alan Cox
@ 2001-08-02 12:27   ` Alan Cox
  2001-08-02 12:33   ` 2.4 freezes on init Alan Cox
                     ` (163 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-02 12:27 UTC (permalink / raw)
  To: Mohamed DOLLIAZAL; +Cc: Alan Cox, Andreas Dilger, linux-kernel

>    I'am sorry I forgot to mention that the filesystem corruption happened on
> SCSI disks.  I guess there is no DMA on the SCSI disks.

Well there is but its off the scsi controller so should be ok

>    Do you think that the VIA fixes that are included in the 2.4.6ac5 kernel or
> above may solve my problem.

They might do, they might not. But they are worth checking

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 2.4 freezes on init
       [not found] ` <no.id>
                     ` (39 preceding siblings ...)
  2001-08-02 12:27   ` 2.4.2 ext2fs corruption status Alan Cox
@ 2001-08-02 12:33   ` Alan Cox
  2001-08-02 14:26   ` setsockopt(..,SO_RCVBUF,..) sets wrong value Alan Cox
                     ` (162 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-02 12:33 UTC (permalink / raw)
  To: Jakub Burgis; +Cc: linux-kernel

> However, I believe the kernel image that Mandrake 8's installer uses is
> a 2.4 kernel, yet that works fine. Is this a configuration setting I
> need to toggle, or am I stuck until I switch motherboard?

In the Red Hat case we have seen cases where the installer kernel worked and
not much else did. Install kernels are generally built with the very minimum
of reliance on bios features and for 386.

Typically that means they don't enable common problem items like APM, ACPI
and Athlon optimisation in conjunction with VIA chipsets.

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-02  9:56                 ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-08-02 12:47                   ` Eric W. Biederman
  0 siblings, 0 replies; 662+ messages in thread
From: Eric W. Biederman @ 2001-08-02 12:47 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Christoph Hellwig, Matthias Andree, linux-kernel, sct

Rik van Riel <riel@conectiva.com.br> writes:

> On Thu, 2 Aug 2001, Christoph Hellwig wrote:
> 
> > > Well, if there's not a single dirent, you cannot retrieve the data,
> >
> > Of course you can, you can pass and fd for an unliked file
> > everywhere using AF_LOCAL descriptor passing.
> 
> But this assumes the system doesn't crash, while
> fsync() seems meant more as a protection against
> the system going down unexpectedly ...

There is something to that.  However taking this argument to
it's logical extreme I have you have to not only sync every directory
in the current path of the file.  You have to sync your online file
index, because search engines is how we find things right?  

Since the filename in unix is not part of the files metadata it is a
perfectly sane semantic for fsck to drop the file into /lost+found, if
no one cared enough about the index/directory to update it.

In the general case if you have the guarantee that a filesystem does
safe directory updates.  So unless someone does an unlink you won't
loose your old link.  For most cases it doesn't matter as your
directory entry for the file is much older than the file itself, and
has been already synched.  MTA's are the exception to this where there
good filename is written only after the file is written.

The only other argument that seems to come from the MTA case is that
syscalls are slow, and a pain and programmers don't want to make two
or three syscalls just to do this.  Heck if you are doing a sync you
are waiting for a disk which is slow.  So speed doesn't really count.

There is probably an argument in there somewhere about batching up
I/O, so you have to wait a minimum amount of time for your sync.  But
until someone benchmarks, and tries a few different approatches I
won't believe that you need a kernel change even for that.

My question is what does fsync do on directories in other unix's.  It
would be really strange if it didn't behave similiarly to linux.
If forget wether it was AIX or HP-UX where doing a grep foo * would
also grep through "." .  So at least open works on other peoples directores.

Eric

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30  6:37 ` ext3-2.4-0.9.4 Philipp Matthias Hahn
@ 2001-08-02 13:58   ` Stephen C. Tweedie
  0 siblings, 0 replies; 662+ messages in thread
From: Stephen C. Tweedie @ 2001-08-02 13:58 UTC (permalink / raw)
  To: ext3-users; +Cc: Andrew Morton, lkml

Hi,

On Mon, Jul 30, 2001 at 08:37:07AM +0200, Philipp Matthias Hahn wrote:
> On Thu, 26 Jul 2001, Andrew Morton wrote:
> 
> > An update to the ext3 filesystem for 2.4 kernels is available at
> >
> > 	http://www.uow.edu.au/~andrewm/linux/ext3/
> I'm using ext3-0.9.4 with linux-2.4.7 / 2.4.8-pre1 and get some hangs on
> my dual P2-350:
> >From time to time I will have multiple CRON-Daemons in D-state and login
> hangs when logging in. It even happens during boot before my MTA is
> started.

Interesting.  Do you have the ability to hook up a serial console?  If
so, "alt-sysrq-T" to capture a backtrace of all the blocked processes
would be a great help.  Thanks.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: setsockopt(..,SO_RCVBUF,..) sets wrong value
       [not found] ` <no.id>
                     ` (40 preceding siblings ...)
  2001-08-02 12:33   ` 2.4 freezes on init Alan Cox
@ 2001-08-02 14:26   ` Alan Cox
  2001-08-02 14:35   ` kernel gdb for intel Alan Cox
                     ` (161 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-02 14:26 UTC (permalink / raw)
  To: Manfred Bartz; +Cc: linux-kernel

> When I do a setsockopt(..,SO_RCVBUF,..) and then read the value back
> with getsockopt(), the reported value is exactly twice of what I set.
> Running the same code on Solaris and on DEC UNIX reports back the
> exact size I set.
> Looking at the code it seems that the  *2  should not be there:

You are making assumptions not guaranteed in POSIX or SuS. In the Linux case
we deliberately allow more than requested as our memory accounting behaviour
for buffers is very different to BSD


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: kernel gdb for intel
       [not found] ` <no.id>
                     ` (41 preceding siblings ...)
  2001-08-02 14:26   ` setsockopt(..,SO_RCVBUF,..) sets wrong value Alan Cox
@ 2001-08-02 14:35   ` Alan Cox
  2001-08-03 10:07     ` Amit S. Kale
  2001-08-02 14:47   ` 3ware Escalade problems? Adaptec? Alan Cox
                     ` (160 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-08-02 14:35 UTC (permalink / raw)
  To: Brent Baccala; +Cc: linux-kernel

> - doesn't support SMP, since I don't have an Intel SMP box.  I'd guess
> what you'd want it to do is an smp_call_function that would halt all the
> processors and put them into some tight little loop while gdb fiddles
> things.  ideas?

With the old old stuff (pre 2.0) gdb stubs I ended up with two copies, one
per cpu on two serial ports. I found that most useful since I could force
events to happen.

Looks nice to me but about the only way you are likely to get Linus to take
in kernel debugging patches is to turn them into hex and disguise them as USB 
firmware ;)

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 3ware Escalade problems? Adaptec?
       [not found] ` <no.id>
                     ` (42 preceding siblings ...)
  2001-08-02 14:35   ` kernel gdb for intel Alan Cox
@ 2001-08-02 14:47   ` Alan Cox
  2001-08-02 15:03   ` [PATCH] make psaux reconnect adjustable Alan Cox
                     ` (159 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-02 14:47 UTC (permalink / raw)
  To: rothwell; +Cc: linux-kernel

> I've been pricing out a 3ware-based raid system for my own personal use. Are
> the problems wuth the Escalade cards bad enough to consider not using them
> with 2.4.7?

Im really attached to my 3ware cards, they are the best ide raid cards I've
used. The newer boxes I built just use software raid 0/1 which is easy now
that everyone throws 4 UDMA100 channels on their motherboards.

I've also done the i2o driver fixups for the Promise SuperTrak100 with a
card provided by Promise and that works in -ac but not yet Linus tree.
I'm more impressed with the 3ware than the promise card right now, although
it will depend on workload. The promise card has onboard caches and raid5 
hardware which the earlier 3ware didn't.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] make psaux reconnect adjustable
       [not found] ` <no.id>
                     ` (43 preceding siblings ...)
  2001-08-02 14:47   ` 3ware Escalade problems? Adaptec? Alan Cox
@ 2001-08-02 15:03   ` Alan Cox
  2001-08-02 15:08   ` [RFT] Support for ~2144 SCSI discs Alan Cox
                     ` (158 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-02 15:03 UTC (permalink / raw)
  To: Andries.Brouwer
  Cc: alan, garloff, torvalds, brent, linux-kernel, mantel, rubini

> who asked for this code): if what I say is correct you should
> always see 00 following the AA. So, there may exist a more cautious
> patch that will bite fewer people and does not react to AA but to
> the sequence AA 00.

2.2 has had the sysctl for ages, and it defaults to off

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [RFT] Support for ~2144 SCSI discs
       [not found] ` <no.id>
                     ` (44 preceding siblings ...)
  2001-08-02 15:03   ` [PATCH] make psaux reconnect adjustable Alan Cox
@ 2001-08-02 15:08   ` Alan Cox
  2001-08-02 15:13   ` Richard Gooch
                     ` (157 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-02 15:08 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: Richard Gooch, linux-kernel, linux-scsi

> I've seen GFP_KERNEL take 10 minutes in lk 2.4.6 . The 
> mm gets tweaked pretty often so it is difficult to know 
> exactly how it will react when memory is tight. A time 
> bound would be useful on GFP_KERNEL.

kmalloc with GFP_KERNEL has a 128K limit which avoids the bizarre behaviour
you get when you abuse get_free_pages.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [RFT] Support for ~2144 SCSI discs
       [not found] ` <no.id>
                     ` (45 preceding siblings ...)
  2001-08-02 15:08   ` [RFT] Support for ~2144 SCSI discs Alan Cox
@ 2001-08-02 15:13   ` Richard Gooch
  2001-08-02 15:31   ` Alan Cox
                     ` (156 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Richard Gooch @ 2001-08-02 15:13 UTC (permalink / raw)
  To: Alan Cox; +Cc: Douglas Gilbert, linux-kernel, linux-scsi

Alan Cox writes:
> > I've seen GFP_KERNEL take 10 minutes in lk 2.4.6 . The 
> > mm gets tweaked pretty often so it is difficult to know 
> > exactly how it will react when memory is tight. A time 
> > bound would be useful on GFP_KERNEL.
> 
> kmalloc with GFP_KERNEL has a 128K limit which avoids the bizarre
> behaviour you get when you abuse get_free_pages.

Last I heard, get_free_pages() also has a 128 kiB limit. So what's the
difference?

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [RFT] Support for ~2144 SCSI discs
       [not found] ` <no.id>
                     ` (46 preceding siblings ...)
  2001-08-02 15:13   ` Richard Gooch
@ 2001-08-02 15:31   ` Alan Cox
  2001-08-02 23:17     ` Douglas Gilbert
  2001-08-02 15:36   ` [RFT] #2 " Alan Cox
                     ` (155 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-08-02 15:31 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Alan Cox, Douglas Gilbert, linux-kernel, linux-scsi

> > kmalloc with GFP_KERNEL has a 128K limit which avoids the bizarre
> > behaviour you get when you abuse get_free_pages.
> 
> Last I heard, get_free_pages() also has a 128 kiB limit. So what's the
> difference?

get_free_pages doesnt have such a limit. Thats why sg had the problem it did

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [RFT] #2 Support for ~2144 SCSI discs
       [not found] ` <no.id>
                     ` (47 preceding siblings ...)
  2001-08-02 15:31   ` Alan Cox
@ 2001-08-02 15:36   ` Alan Cox
  2001-08-02 15:47   ` Richard Gooch
                     ` (154 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-02 15:36 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Andreas Dilger, linux-kernel, linux-scsi

> So, yes, you can already patch other subsystems to dynamically assign
> major numbers in 2.4.7. I'd like to see people do that. My patch for
> sd.c can also serve as a demonstration on how to use the new API.

Its a bit of an ugly hack but I guess its the best anyone can put together
for a 2.4 kernel tree. Going to a 32bit dev_t is going to make life so much
simpler do all of this without ugly hacks

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [RFT] #2 Support for ~2144 SCSI discs
       [not found] ` <no.id>
                     ` (48 preceding siblings ...)
  2001-08-02 15:36   ` [RFT] #2 " Alan Cox
@ 2001-08-02 15:47   ` Richard Gooch
  2001-08-02 16:34   ` Alan Cox
                     ` (153 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Richard Gooch @ 2001-08-02 15:47 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andreas Dilger, linux-kernel, linux-scsi

Alan Cox writes:
> > So, yes, you can already patch other subsystems to dynamically assign
> > major numbers in 2.4.7. I'd like to see people do that. My patch for
> > sd.c can also serve as a demonstration on how to use the new API.
> 
> Its a bit of an ugly hack but I guess its the best anyone can put
> together for a 2.4 kernel tree. Going to a 32bit dev_t is going to
> make life so much simpler do all of this without ugly hacks

My patch is definately 2.4 material. I see it as a temporary solution
until the whole block I/O subsystem is ripped out and replaced in 2.5.
Since 2.4 will be the latest production kernel for about two years, we
need to find ways of working around current limitations.

That said, in 2.5 I want to see us move away from using device numbers
as the fundamental device handle and move to device instance
structures. That's a lot cleaner, and BTW is devfs-neutral
(i.e. doesn't need devfs to work). Exposing a 32 bit dev_t to
user-space is acceptable, but internally it should be shunned.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [RFT] #2 Support for ~2144 SCSI discs
       [not found] ` <no.id>
                     ` (49 preceding siblings ...)
  2001-08-02 15:47   ` Richard Gooch
@ 2001-08-02 16:34   ` Alan Cox
  2001-08-02 17:00   ` Richard Gooch
                     ` (152 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-02 16:34 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Alan Cox, Andreas Dilger, linux-kernel, linux-scsi

> That said, in 2.5 I want to see us move away from using device numbers
> as the fundamental device handle and move to device instance
> structures. That's a lot cleaner, and BTW is devfs-neutral
> (i.e. doesn't need devfs to work). Exposing a 32 bit dev_t to
> user-space is acceptable, but internally it should be shunned.

You need it internally otherwise you are screwed the moment you have 65536
volumes mounted - because you run out of unique device identifiers for stat.

Fortunately 32bit dev_t (not kdev_t .. which I think is what you are talking
about and will I assume go pointer to struct) is only one syscall change

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [RFT] #2 Support for ~2144 SCSI discs
       [not found] ` <no.id>
                     ` (50 preceding siblings ...)
  2001-08-02 16:34   ` Alan Cox
@ 2001-08-02 17:00   ` Richard Gooch
  2001-08-02 17:34   ` [PATCH] make psaux reconnect adjustable Alan Cox
                     ` (151 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Richard Gooch @ 2001-08-02 17:00 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andreas Dilger, linux-kernel, linux-scsi

Alan Cox writes:
> > That said, in 2.5 I want to see us move away from using device numbers
> > as the fundamental device handle and move to device instance
> > structures. That's a lot cleaner, and BTW is devfs-neutral
> > (i.e. doesn't need devfs to work). Exposing a 32 bit dev_t to
> > user-space is acceptable, but internally it should be shunned.
> 
> You need it internally otherwise you are screwed the moment you have
> 65536 volumes mounted - because you run out of unique device
> identifiers for stat.

I consider that "external" use. The kernel doesn't need it, it just
provides unique numbers for user-space. The kernel just happens to
carry along the information so that user-space can get it as needed.

Aside: the idea of mounting >65536 volumes frightens me. Accidentally
do a "df" and go away for a coffee while your machine hammers away.

> Fortunately 32bit dev_t (not kdev_t .. which I think is what you are
> talking about and will I assume go pointer to struct) is only one
> syscall change

Looks like we agree. And as long as you have <65536 volumes, then
libc5 will continue to work just fine. Which is also good.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-02  9:03             ` ext3-2.4-0.9.4 Matthias Andree
  2001-08-02  9:51               ` ext3-2.4-0.9.4 Christoph Hellwig
@ 2001-08-02 17:26               ` Daniel Phillips
  2001-08-02 17:37                 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
                                   ` (2 more replies)
  1 sibling, 3 replies; 662+ messages in thread
From: Daniel Phillips @ 2001-08-02 17:26 UTC (permalink / raw)
  To: Matthias Andree, Stephen C. Tweedie; +Cc: linux-kernel

On Thursday 02 August 2001 11:03, Matthias Andree wrote:
> On Wed, 01 Aug 2001, Stephen Tweedie wrote:
> > Matthias Andree wrote:
> > > Chase up to the root manually, because Linux' ext2 violates SUS
> > > v2 fsync() (which requires meta data synched BTW)
> >
> > Please quote chapter and verse --- my reading of SUS shows no such
> > requirement.
> >
> > fsync is required to force "all currently queued I/O operations
> > associated with the file indicated by file descriptor fildes to the
> > synchronised I/O completion state."  But as you should know,
> > directory entries and files are NOT the same thing in Unix/SUS.
>
> Read on: "All I/O operations are completed as defined for
> synchronised I/O _file_ integrity completion.". To show what that
> means, see the glossary.
>
> http://www.opengroup.org/onlinepubs/007908799/xbd/glossary.html#tag_0
>04_000_291
>
>   "synchronised I/O data integrity completion
>
>   [...]
>
>   * For write, when the operation has been completed or diagnosed if
>   unsuccessful.  The write is complete only when the data specified
> in the write request is successfully transferred and all file system
> information required to retrieve the data is successfully
> transferred.
>
>   File attributes that are not necessary for data retrieval (access
>   time, modification time, status change time) need not be
> successfully transferred prior to returning to the calling process.
>
>   synchronised I/O file integrity completion
>
>   Identical to a synchronised I/O data integrity completion with the
>   addition that all file attributes relative to the I/O operation
>   (including access time, modification time, status change time) will
> be successfully transferred prior to returning to the calling
> process."
>
> As I understand it, the directory entry's st_ino is a file attribute
> necessary for data retrieval and also contains the m/a/ctime, so it
> must be flushed to disk on fsync() as well.

I believed you've summarized the SUS requirements very well.  Apart 
from legalistic arguments, SUS quite clearly states that fsync should 
not return until you are sure of having recorded not only the file's 
data, but the access path to it.  I interpret this as being able to 
"access the file by its name", and being able to guess by looking in 
lost+found doesn't count.  I don't see the point in niggling about that.

So, it seems clear that an fsync which leaves any window of 
vulnerability where an interruption can leave a file unlinked is not 
SUS-compliant.

> > There can be many ways of reaching that file in the directory
> > hierarchy, or there can be none, but fsync() doesn't talk at all
> > about the status of those dirents after the sync.

This is a legalistic argument.  I don't think we should be looking for 
loopholes in SUS here.  To achieve SUS compliance there are two 
reasonable courses: "fix SUS" or "fix sys_fsync".  Since what SUS 
clearly wants here seems emminently reasonable, I'd suggest putting the 
energy that's currently going into this thread into fixing fsync 
instead.

> Well, if there's not a single dirent, you cannot retrieve the data,
> so I'd assume at least one dirent needs to be flushed as well. If
> there's a simple way to get unflushed dentries to disk (hard links
> included)...

*All* hard links?  No, there is no general way to do that.  However, 
any hard links[1] in the path used to open the file - yes.  There is 
always a chain of parent dentries held locked in the dcache for any 
open file.

I don't know why it is hard or inefficient to implement this at the VFS 
level, though I'm sure there is a reason or this thread wouldn't 
exist.  Stephen, perhaps you could explain for the record why sys_fsync 
can't just walk the chain of dentry parent links doing fdatasync?  Does 
this create VFS or Ext3 locking problems?  Or maybe it repeats work 
that Ext3 is already supposed to have done?

> ...flush them. Not sure about symlinks, but since they don't
> share the inode number, that might be rather difficult for the kernel
> (I didn't check)

The prescription for symlinks is, if you want them safely on disk you 
have to explicitly fsync the containing directory.

[1] In Ext2, all filename dirents are "hard links", i.e., there is no 
way to tell which of the two names is the original after creating a new 
hard link.

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] make psaux reconnect adjustable
       [not found] ` <no.id>
                     ` (51 preceding siblings ...)
  2001-08-02 17:00   ` Richard Gooch
@ 2001-08-02 17:34   ` Alan Cox
  2001-08-02 19:41   ` [PATCH] vxfs fix Alan Cox
                     ` (150 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-02 17:34 UTC (permalink / raw)
  To: Andries.Brouwer; +Cc: garloff, alan, linux-kernel, mantel, rubini, torvalds

> Of course I hope that we'll handle this correctly at some point,
> without any options or parameters. In my eyes a sysctl is heavier
> infrastructure than a boot parameter, so I prefer the latter
> when a temporary fix is needed.

The input device infrastructure pending for 2.5 already handles all of
these issues

^ permalink raw reply	[flat|nested] 662+ messages in thread

* intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:26               ` ext3-2.4-0.9.4 Daniel Phillips
@ 2001-08-02 17:37                 ` Matthias Andree
  2001-08-02 18:35                   ` Alexander Viro
                                     ` (4 more replies)
  2001-08-02 17:54                 ` ext3-2.4-0.9.4 Alexander Viro
  2001-08-03  9:06                 ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2 siblings, 5 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-02 17:37 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Daniel Phillips wrote:

[file name must be flushed on fsync()]
> I don't know why it is hard or inefficient to implement this at the VFS 
> level, though I'm sure there is a reason or this thread wouldn't 
> exist.  Stephen, perhaps you could explain for the record why sys_fsync 
> can't just walk the chain of dentry parent links doing fdatasync?  Does 
> this create VFS or Ext3 locking problems?  Or maybe it repeats work 
> that Ext3 is already supposed to have done?

Well, the course was that I asked whether ext3 would do synchronous
directory updates, and some people jumped in and said that one should
fsync() the parent directory, however, since we figure from SUS, that's
invalid.

After some forth and back, we finally figured that at least ext2 is
implementing fsync() improperly.

So this part is covered.

The other thing is, that Linux is the only known system that does
asynchronous rename/link/unlink/symlink -- people have claimed it might
not be the only one, but failed to name systems.

So we need to assume that Linux is the only system that does
asynchronous rename/link/unlink/symlink, however a directory fsync() is
believed to be rather expensive.

Still, some people object to a dirsync mount option. But this has been
the actual reason for the thread - MTA authors are refusing to pamper
Linux and use chattr +S instead which gives unnecessary (premature) sync
operations on write() - but MTAs know how to fsync().

> The prescription for symlinks is, if you want them safely on disk you 
> have to explicitly fsync the containing directory.

Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
waste inodes on most systems).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-02 17:26               ` ext3-2.4-0.9.4 Daniel Phillips
  2001-08-02 17:37                 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
@ 2001-08-02 17:54                 ` Alexander Viro
  2001-08-02 20:01                   ` ext3-2.4-0.9.4 Daniel Phillips
  2001-08-03  9:06                 ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2 siblings, 1 reply; 662+ messages in thread
From: Alexander Viro @ 2001-08-02 17:54 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Matthias Andree, Stephen C. Tweedie, linux-kernel



On Thu, 2 Aug 2001, Daniel Phillips wrote:

> I don't know why it is hard or inefficient to implement this at the VFS 
> level, though I'm sure there is a reason or this thread wouldn't 
> exist.  Stephen, perhaps you could explain for the record why sys_fsync 
> can't just walk the chain of dentry parent links doing fdatasync?  Does 
> this create VFS or Ext3 locking problems?  Or maybe it repeats work 
> that Ext3 is already supposed to have done?
 
Parent directory can be renamed. Which grandparent should we sync?
New one? Old one? Both? BTW, how about file itself getting renamed during
fsync()?

See the problem? And no, blocking all renames while fsync() happens is
not an answer - it's a DoS.
 
> [1] In Ext2, all filename dirents are "hard links", i.e., there is no 
> way to tell which of the two names is the original after creating a new 
> hard link.

s/Ext2/UNIX/.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37                 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
@ 2001-08-02 18:35                   ` Alexander Viro
  2001-08-02 18:47                     ` Matthias Andree
  2001-08-02 19:47                   ` Bill Rugolsky Jr.
                                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 662+ messages in thread
From: Alexander Viro @ 2001-08-02 18:35 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel



On Thu, 2 Aug 2001, Matthias Andree wrote:

> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.

How the fuck it's expensive? It does _exactly_ the same as file fsync() -
literally the same code. It doesn't write blocks that don't belong to
directory. It doesn't write blocks that are clean. IOW, it does the
minimal work possible.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 18:35                   ` Alexander Viro
@ 2001-08-02 18:47                     ` Matthias Andree
  2001-08-02 22:18                       ` Andreas Dilger
       [not found]                       ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
  0 siblings, 2 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-02 18:47 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Matthias Andree, Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Alexander Viro wrote:

> How the fuck it's expensive? It does _exactly_ the same as file fsync() -
> literally the same code. It doesn't write blocks that don't belong to
> directory. It doesn't write blocks that are clean. IOW, it does the
> minimal work possible.

fsync()ing the dir is not the minimal work possible, if e. g. temporary
files are open that don't need their names synched. Fsync()ing the
directory syncs also these temporary file NAMES that other processes may
have open (but that they unlink rather than fsync()).

Assume:

open -> asynchronous, but filename synched on fsync()
rename/link/unlink(/symlink) -> synchronous

This way, you never need to fsync() the directory, so you never sync()
entries of temporary files. You never lose important files (because the
application uses fsync() and the OS synchs rename/link etc.).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] vxfs fix
       [not found] ` <no.id>
                     ` (52 preceding siblings ...)
  2001-08-02 17:34   ` [PATCH] make psaux reconnect adjustable Alan Cox
@ 2001-08-02 19:41   ` Alan Cox
  2001-08-02 20:57     ` Andreas Dilger
  2001-08-03 11:54   ` kernel gdb for intel Alan Cox
                     ` (149 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-08-02 19:41 UTC (permalink / raw)
  To: Andries.Brouwer; +Cc: torvalds, alan, hch, linux-kernel, viro

> 	From: Alan
> 
> 	Alternatively pass a flag to the mount command saying
> 	"this is a guesswork special" then V7 fs can just return 'not me'
> 
> Parse failure.

Let me try again:

When the read_super method is invoked
AND we are doing a mount without a defined type
	THEN
		Pass the fs a flag from the VFS saying so
	ENDIF

That way the file system can actually say "I cannot reliably check"

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37                 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
  2001-08-02 18:35                   ` Alexander Viro
@ 2001-08-02 19:47                   ` Bill Rugolsky Jr.
  2001-08-03 18:22                     ` Matthias Andree
       [not found]                   ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
                                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 662+ messages in thread
From: Bill Rugolsky Jr. @ 2001-08-02 19:47 UTC (permalink / raw)
  To: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.
> 
> So we need to assume that Linux is the only system that does
> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.
> 
> Still, some people object to a dirsync mount option. But this has been
> the actual reason for the thread - MTA authors are refusing to pamper
> Linux and use chattr +S instead which gives unnecessary (premature) sync
> operations on write() - but MTAs know how to fsync().

Let's inject a little reality into this discussion.  Filesystems are used
for something other than running MTA's written by stubborn "purists".

Solaris: Dell 600 MHz PIII 128MB RAM, largely quiescent:
         Solaris 8 mu4, UFS with logging

Linux:   VA Linux 800 MHZ PIII, 128MB RAM, largely quiescent
         RedHat Linux 7.1 w/ kernel-2.4.6-2.4 (2.4.6-ac5 + ext3-0.9.3).

660MB XFree86-4.1 build tree, cache primed with du -s in each case.

Here's something that we developers probably all do frequently: copy a
tree using hard links, so that we can patch it.

[solaris] find . | wc     
   33027   33027 1251671
[solaris] time find . -depth | cpio -pdul ../foo
0 blocks
 363.46s real    0.84s user   10.13s system 

Plain ext2:

[linux]# time find . -depth | cpio -pdul ../foo
0 blocks

real    0m3.823s user    0m0.240s sys     0m3.570s

Mounted ext3, ordered data mode.

[linux] time find . -depth | cpio -pdul ../foo
0 blocks

real    0m5.106s user    0m0.200s sys     0m3.700s

Mounted ext3, -o sync:

[root@ead51 bar]# time find . -depth | cpio -pdul ../foo
0 blocks

real    1m28.483s user    0m0.470s sys     0m4.410s 

=====================================================

Solaris8 UFS:   363.5 seconds
ext2:             3.8 seconds
ext3:             5.1 seconds
ext3 -o sync:    88.5 seconds

Got it?

Obviously, the last is the result of the poor interaction
of ext3+sync in 0.9.3, but Andrew Morton has already fixed that.
I will try again with 0.9.5 when I have a chance to upgrade that
machine.

I have no idea where BSD falls, but the basic point stands:  unused
features should not penalize other applications.  Andrew Morton has
figured out how to do this efficiently with ext3, and many kudos to him
for doing the work.  Absent that, why should I have to go get a cup of
coffee every time I want to patch a tree, just so some MTA can make
naive assumptions?

Regards,

   Bill Rugolsky

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-02 17:54                 ` ext3-2.4-0.9.4 Alexander Viro
@ 2001-08-02 20:01                   ` Daniel Phillips
  0 siblings, 0 replies; 662+ messages in thread
From: Daniel Phillips @ 2001-08-02 20:01 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Matthias Andree, Stephen C. Tweedie, linux-kernel

On Thursday 02 August 2001 19:54, Alexander Viro wrote:
> On Thu, 2 Aug 2001, Daniel Phillips wrote:
> > I don't know why it is hard or inefficient to implement this at the
> > VFS level, though I'm sure there is a reason or this thread
> > wouldn't exist.  Stephen, perhaps you could explain for the record
> > why sys_fsync can't just walk the chain of dentry parent links
> > doing fdatasync?  Does this create VFS or Ext3 locking problems? 
> > Or maybe it repeats work that Ext3 is already supposed to have
> > done?
>
> Parent directory can be renamed. Which grandparent should we sync?
> New one? Old one? Both?

Either one, or both, it doesn't matter since the application has not 
forced any serialization on this and can't assume any.

> BTW, how about file itself getting renamed during fsync()?

It doesn't matter.  If the application wants to race that way, let it.  
We're talking about ensuring access to the fsynced fd's inode.

> See the problem? And no, blocking all renames while fsync() happens
> is not an answer - it's a DoS.

We would have done our duty by fsyncing the inodes one at a time 
working up the dentry chain towards the root, and not trying to lock 
the whole chain.  If something happens while we're doing that it's an 
application race.

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] vxfs fix
  2001-08-02 19:41   ` [PATCH] vxfs fix Alan Cox
@ 2001-08-02 20:57     ` Andreas Dilger
  0 siblings, 0 replies; 662+ messages in thread
From: Andreas Dilger @ 2001-08-02 20:57 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andries.Brouwer, torvalds, hch, linux-kernel, viro

Alan writes:
> When the read_super method is invoked
> AND we are doing a mount without a defined type
> 	THEN
> 		Pass the fs a flag from the VFS saying so
> 	ENDIF
> 
> That way the file system can actually say "I cannot reliably check"

Isn't this what the "silent" option to read_super is for?  It may be that
it can only be used at root fs mount time.  Other than that, I don't
_think_ the kernel does autoprobing of filesystem types, so it is a
mount(8) issue to just not randomly try the V7 filesystem type.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 18:47                     ` Matthias Andree
@ 2001-08-02 22:18                       ` Andreas Dilger
  2001-08-02 23:11                         ` Matthias Andree
       [not found]                         ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
       [not found]                       ` <5.1.0.14.2.20010803002501.00ada0e0@pop.cus.cam.ac.uk>
  1 sibling, 2 replies; 662+ messages in thread
From: Andreas Dilger @ 2001-08-02 22:18 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Alexander Viro, Daniel Phillips, Stephen C. Tweedie, linux-kernel

Matthais Andree writes:
> fsync()ing the dir is not the minimal work possible, if e. g. temporary
> files are open that don't need their names synched. Fsync()ing the
> directory syncs also these temporary file NAMES that other processes may
> have open (but that they unlink rather than fsync()).
> 
> Assume:
> 
> open -> asynchronous, but filename synched on fsync()
> rename/link/unlink(/symlink) -> synchronous
> 
> This way, you never need to fsync() the directory, so you never sync()
> entries of temporary files. You never lose important files (because the
> application uses fsync() and the OS synchs rename/link etc.).

Do you read what you are writing?  How can a "synchronous" operation for
rename/link/unlink/symlink NOT also write out "temporary" files in the
same directory?  How does calling fsync() on the directory IF YOU REQUIRE
SYNCHRONOUS DIRECTORY OPERATIONS differ from making the specific operations
synchronous from within the kernel???

The only difference I can see is that making these specific operations
ALWAYS be synchronous hurts the common case when they can be async (see
Solaris UFS vs. Linux benchmark elsewhere in this thread), while requiring
an fsync() on the directory == only synchronous operation when it is
actually needed, and no "extra" performance hit.

The only slight point of contention is if you have very large directories
which span several filesystem blocks, in which case it _would_ be possible
to write out some blocks synchronously, while leaving other blocks dirty.
In practise however, you will either only be modifying a small number of
blocks (at the end of the directory) because an MTA usually only creates
files and doesn't delete them, and the actual speed of syncing several
blocks at one time is not noticably different than syncing only one.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 22:18                       ` Andreas Dilger
@ 2001-08-02 23:11                         ` Matthias Andree
       [not found]                         ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
  1 sibling, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-02 23:11 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Matthias Andree, Alexander Viro, Daniel Phillips,
	Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Andreas Dilger wrote:

> > open -> asynchronous, but filename synched on fsync()
> > rename/link/unlink(/symlink) -> synchronous
> > 
> > This way, you never need to fsync() the directory, so you never sync()
> > entries of temporary files. You never lose important files (because the
> > application uses fsync() and the OS synchs rename/link etc.).
> 
> Do you read what you are writing?  How can a "synchronous" operation for
> rename/link/unlink/symlink NOT also write out "temporary" files in the
> same directory?  How does calling fsync() on the directory IF YOU REQUIRE
> SYNCHRONOUS DIRECTORY OPERATIONS differ from making the specific operations
> synchronous from within the kernel???

Can people please try to understand? Can people please start to THINK
before flaming?

I did not say that open() is to be synchronous. I did not write ANYTHING
of fsync()ing directories, I'm trying to get rid of this requirement.

Thus, if the kernel does rename/link synchronously, you'd never ever
fsync() a directory. To synch a filename to disk, you'd just fsync() the
filedescriptor (with a SUS compliant system, that is, i. e. ext3 or
reiserfs, but not ext2).

Now, if someone opens a temporary file, and nukes it later -- unlink()
--, and doesn't want it visible, he never calls fsync() for the file.

However, if some other process then fsync()s the directory, you start
synching the temporary file dirent -> unnecessary, is nuked later on
with an unlink().

That's why fsync() on the directory is on no account the minimum work.

> The only difference I can see is that making these specific operations
> ALWAYS be synchronous hurts the common case when they can be async (see
> Solaris UFS vs. Linux benchmark elsewhere in this thread), while requiring
> an fsync() on the directory == only synchronous operation when it is
> actually needed, and no "extra" performance hit.

In case you haven't noticed, this is about reliability without need to
fsync() the directory that doesn't all belong to your single, stupid
process but may have lots of asynchronous data of other processes -
temporary files for instance. You synch() that as well, which is
unnecessary and brings down other processes' performance.

In case you haven't noticed the other issue:

The whole thread is a FEATURE REQUEST for a dirsync mount option, for
MTAs and other software which requires reliable file systems, where the
name is negotiable. It aims to REDUCE OVERHEAD since chattr +S which is
the only workaround for synch-dirs - and it synchs synchronous files and
writes as well, and rendering things slower than necessary, since
write() can be buffered until you fsync() (and you want that to cut off
seek times).

Call the option bsd_slow_dirs if you like, I don't care. Given the
option, the administrator/user has the choice, currently, he hasn't. He
cannot possibly change all applications ported from other Unices.

Note: hindering this option doesn't get Linux anywhere. Pure file
system benchmarks are not worth a single bit of entropy unless Linux is
benchmarked chattr +S -- it's unreliable otherwise.

I cannot remember how often I explained this during the course of this
thread. Every other day, some ignorant comes out of its cavern and
discusses the whole thing over and over again.

And, once again, fsync()ing the directory is not an option for portable
applications. It's unnecessary on every other system (until someone
shows a production-ready system which by default has asynchronous
directory updates as well, but no-one has so far.)

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [RFT] Support for ~2144 SCSI discs
  2001-08-02 15:31   ` Alan Cox
@ 2001-08-02 23:17     ` Douglas Gilbert
  0 siblings, 0 replies; 662+ messages in thread
From: Douglas Gilbert @ 2001-08-02 23:17 UTC (permalink / raw)
  To: Alan Cox; +Cc: Richard Gooch, linux-kernel, linux-scsi

Alan Cox wrote:
> 
> > > kmalloc with GFP_KERNEL has a 128K limit which avoids the bizarre
> > > behaviour you get when you abuse get_free_pages.
> >
> > Last I heard, get_free_pages() also has a 128 kiB limit. So what's the
> > difference?
> 
> get_free_pages doesnt have such a limit. Thats why sg had the problem it did

Alan,
That is incorrect.

The failure I got with the sg driver with cdrdao
and cdda2wav was with 32 KB buffers, lots of them.
cdda2wav in RH 7.1 was trying to get 100 MB of them!

If you look at the sg driver you will find that it never
attempts a get_free_pages greater than SG_SCATTER_SZ (32 KB).
So that unkillable lockup on those apps demonstrates
rather well that GFP_KERNEL is dangerous.

Doug Gilbert

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]                     ` <20010803021642.B9845@emma1.emma.line.org>
@ 2001-08-03  7:03                       ` Eric W. Biederman
  2001-08-03  8:39                         ` Matthias Andree
  0 siblings, 1 reply; 662+ messages in thread
From: Eric W. Biederman @ 2001-08-03  7:03 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Paul Jakma, linux-kernel

Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:

> On Fri, 03 Aug 2001, Paul Jakma wrote:
> 
> > if the prime directive of MTAs is data integrity paranoia, then
> > surely the best assumption for an MTA to make is that
> > rename/link/unlink/symlink /are/ asynchronous in the general case?
> 
> They do on Linux, use chattr +S, and are much slower than e. g. on
> FreeBSD. Well. Not that I'd written THAT for the first time...

Actually given that this thread keeps coming up, but no one does anything
about it.  I'm tempted to suggest we remove chatrr +S support from ext2.
Then there will be enough pain that someone will fix the MTA instead of
moaning that kernel is slow...

That should be an easy patch to make...

Eric

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: SMP possible with AMD CPUs?
  2001-08-02 12:24   ` SMP possible with AMD CPUs? Alan Cox
@ 2001-08-03  7:07     ` Eric W. Biederman
  0 siblings, 0 replies; 662+ messages in thread
From: Eric W. Biederman @ 2001-08-03  7:07 UTC (permalink / raw)
  To: Alan Cox; +Cc: Paul G. Allen, linux-kernel

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
 
> > 3. The BIOS (apparently) doesn't setup the MTRR properly on both CPUs making
> mtrr bitch about a mismatch.
> 
> 
> The mtrr driver fixups should cure that - its a common bios bug.

There is some truth in that.  But note AMD hasn't released all of the
documentation related to their MTRR's so we can't rely on linux fixing
all of those BIOS bugs.   In this case it happens to be different
caching on the BIOS chip, from different cpus.

An interesting question is what is 0x1e in the AMD fixed mtrr's.

Eric

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37                 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
                                     ` (2 preceding siblings ...)
       [not found]                   ` <Pine.LNX.4.33.0108030051070.1703-100000@fogarty.jakma.org>
@ 2001-08-03  8:30                   ` Stephen C. Tweedie
  2001-08-03 18:28                     ` Matthias Andree
  2001-08-03  8:50                   ` David Weinehall
  4 siblings, 1 reply; 662+ messages in thread
From: Stephen C. Tweedie @ 2001-08-03  8:30 UTC (permalink / raw)
  To: Daniel Phillips, Stephen C. Tweedie, linux-kernel

Hi,

On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:

> So this part is covered.
> 
> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.

Not true.  There are tons of others.

The issue was that synchronous directory updates are *optional* on
many systems (Linux included), but that Linux's support for that is
really inefficient since it ends up syncing file metadata updates too
(and it's much more efficient to use fsync for that.)

> Still, some people object to a dirsync mount option.

Who?  People who have discussed this in the past have certainly not
objected to my knowledge.  It would clearly help situations like this
(as would a dirsync chattr option.)

> > The prescription for symlinks is, if you want them safely on disk you 
> > have to explicitly fsync the containing directory.
> 
> Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
> waste inodes on most systems).

Irrelevant.   We're talking about what makes sensible semantics, not
what assumptions any specific application makes.  It makes no sense to
say that dirsync won't affect symlinks just because some existing
applications don't rely on that!

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  7:03                       ` Eric W. Biederman
@ 2001-08-03  8:39                         ` Matthias Andree
  2001-08-03  9:57                           ` Christoph Hellwig
  2001-08-04  7:55                           ` Eric W. Biederman
  0 siblings, 2 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-03  8:39 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Matthias Andree, Paul Jakma, linux-kernel

On Fri, 03 Aug 2001, Eric W. Biederman wrote:

> Actually given that this thread keeps coming up, but no one does anything
> about it.  I'm tempted to suggest we remove chatrr +S support from ext2.
> Then there will be enough pain that someone will fix the MTA instead of
> moaning that kernel is slow...

They'd just drop Linux from the list of supported OS's, Linux will
disappoint people who trusted it, nothing is gained. Deliberate breakage
will not happen, because it would not help anyone except people with
twisted minds.

NO-ONE, including you, has come up with SERIOUS objections against a
dirsync option, except "is it really so much slower than chattr +S? show
figures" -- ext3 is being tuned to be fast in spite of chattr +S.

Reconsider your position.

Stop trolling please.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 17:37                 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
                                     ` (3 preceding siblings ...)
  2001-08-03  8:30                   ` Stephen C. Tweedie
@ 2001-08-03  8:50                   ` David Weinehall
  2001-08-03 18:31                     ` Matthias Andree
  2001-08-03 19:59                     ` Albert D. Cahalan
  4 siblings, 2 replies; 662+ messages in thread
From: David Weinehall @ 2001-08-03  8:50 UTC (permalink / raw)
  To: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> On Thu, 02 Aug 2001, Daniel Phillips wrote:
> 
> [file name must be flushed on fsync()]
> > I don't know why it is hard or inefficient to implement this at the VFS 
> > level, though I'm sure there is a reason or this thread wouldn't 
> > exist.  Stephen, perhaps you could explain for the record why sys_fsync 
> > can't just walk the chain of dentry parent links doing fdatasync?  Does 
> > this create VFS or Ext3 locking problems?  Or maybe it repeats work 
> > that Ext3 is already supposed to have done?
> 
> Well, the course was that I asked whether ext3 would do synchronous
> directory updates, and some people jumped in and said that one should
> fsync() the parent directory, however, since we figure from SUS, that's
> invalid.
> 
> After some forth and back, we finally figured that at least ext2 is
> implementing fsync() improperly.
> 
> So this part is covered.

Yup, and this should be fixed imho.

> The other thing is, that Linux is the only known system that does
> asynchronous rename/link/unlink/symlink -- people have claimed it might
> not be the only one, but failed to name systems.

And this is a feature, not a bug.

> So we need to assume that Linux is the only system that does
> asynchronous rename/link/unlink/symlink, however a directory fsync() is
> believed to be rather expensive.

A directory fsync() might be expensive on non-Linux filesystems...

> Still, some people object to a dirsync mount option. But this has been
> the actual reason for the thread - MTA authors are refusing to pamper
> Linux and use chattr +S instead which gives unnecessary (premature) sync
> operations on write() - but MTAs know how to fsync().

So what you mean is that MTA authors refuse to pamper Linux through use
of fsync of the directory, but can accept to "pamper" Linux through use
of chattr +S?! This seem ridiculous.  It seems equally ridiculous to
demand that Linux should pamper for MTA authors that can't implement
fsync on the directory instead of writing BSD-specific code.

[snip]

To me this seems mostly like a way of saying "Hey, we've finally found
a way to make Linux look really bad compared to BSD-systems; let's
complain instead of writing alternative code that suits Linux systems
better than this code does." A lot like all the discussions on threads,
ueally.

Then again, I'm probably just extra grouchy today because it rained when
I rode my bike to work.


/David Weinehall
  _                                                                 _
 // David Weinehall <tao@acc.umu.se> /> Northern lights wander      \\
//  Project MCA Linux hacker        //  Dance across the winter sky //
\>  http://www.acc.umu.se/~tao/    </   Full colour fire           </

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-02 17:26               ` ext3-2.4-0.9.4 Daniel Phillips
  2001-08-02 17:37                 ` intermediate summary of ext3-2.4-0.9.4 thread Matthias Andree
  2001-08-02 17:54                 ` ext3-2.4-0.9.4 Alexander Viro
@ 2001-08-03  9:06                 ` Stephen C. Tweedie
  2001-08-03 14:08                   ` ext3-2.4-0.9.4 Daniel Phillips
  2 siblings, 1 reply; 662+ messages in thread
From: Stephen C. Tweedie @ 2001-08-03  9:06 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Matthias Andree, Stephen C. Tweedie, linux-kernel

Hi,

On Thu, Aug 02, 2001 at 07:26:16PM +0200, Daniel Phillips wrote:

> I believed you've summarized the SUS requirements very well.  Apart 
> from legalistic arguments,

Umm, this is a specification.  It is *supposed* to be interpreted
legalistically!

> SUS quite clearly states that fsync should 
> not return until you are sure of having recorded not only the file's 
> data, but the access path to it.  I interpret this as being able to 
> "access the file by its name", and being able to guess by looking in 
> lost+found doesn't count.

But that is just an interpretation.  There's nothing in the spec which
forces that interpretation.

fsync forces the data to disk.  There may be one or more pathnames
which the application also relies on, and if the application does care
about those, the application will have to ensure that they are synced
too.

Look, I agree that your interpretation is useful.  It's just not an
unambiguous requirement of the spec.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]                         ` <5.1.0.14.2.20010803025916.053e2ec0@pop.cus.cam.ac.uk>
@ 2001-08-03  9:16                           ` Matthias Andree
  0 siblings, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-03  9:16 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: linux-kernel

On Fri, 03 Aug 2001, Anton Altaparmakov wrote:

[dirsync chattr/mount options]
> Me neither. With regards to the parallel discussion on SUS compliance it is 
> probably a good idea to have such a thing in some form anyway (although if 
> I understood the discussion correctly, we really want this to happen by 
> default, not just when some flag is set but then again I never read the 
> standards...).

The standard doesn't really command the behaviour, as it seems, but we
might want to look again after SUS v3 has been released (supposed to
happen later this year) - the SUS compliance was rather on fsync than on
rename/link.

However, I'd rather not choose the default for somebody else, because he
may have different requirements, a compile-time switch to set the
default should be fine, THIS one might indeed default to dirsync/noasync
unless changed by make {x,menu,}config.

Assuming that the chattr +S is accompanied by a corresponding -o sync
mount option, I'd expect that the dirsync option be available as chattr
option and as mount option, and choosing default mount options should be
rather easy.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:39                         ` Matthias Andree
@ 2001-08-03  9:57                           ` Christoph Hellwig
  2001-08-04  7:55                           ` Eric W. Biederman
  1 sibling, 0 replies; 662+ messages in thread
From: Christoph Hellwig @ 2001-08-03  9:57 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Matthias Andree, Paul Jakma, linux-kernel, Eric W. Biederman

In article <20010803103954.A11584@emma1.emma.line.org> you wrote:
> They'd just drop Linux from the list of supported OS's, Linux will
> disappoint people who trusted it, nothing is gained. Deliberate breakage
> will not happen, because it would not help anyone except people with
> twisted minds.

Who cares?  There are more than enough sane mailer around..

> NO-ONE, including you, has come up with SERIOUS objections against a
> dirsync option, except "is it really so much slower than chattr +S? show
> figures" -- ext3 is being tuned to be fast in spite of chattr +S.

Talk is cheap.  Code up a non-invasive dirsync option and submit it to
Linus.  I don't see any reason why it won't be accepted..

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: kernel gdb for intel
  2001-08-02 14:35   ` kernel gdb for intel Alan Cox
@ 2001-08-03 10:07     ` Amit S. Kale
  0 siblings, 0 replies; 662+ messages in thread
From: Amit S. Kale @ 2001-08-03 10:07 UTC (permalink / raw)
  To: Alan Cox; +Cc: Brent Baccala, linux-kernel

Alan Cox wrote:
> 
> > - doesn't support SMP, since I don't have an Intel SMP box.  I'd guess
> > what you'd want it to do is an smp_call_function that would halt all the
> > processors and put them into some tight little loop while gdb fiddles
> > things.  ideas?
> 
> With the old old stuff (pre 2.0) gdb stubs I ended up with two copies, one
> per cpu on two serial ports. I found that most useful since I could force
> events to happen.

I can't get this. How can two gdb stubs work correctly
on two serial ports? In absence of any globals, there won't be
any data corruption, though there are inherent assumptions in 
the kernel about progress on all cpus. If GKL, page cache lock etc
are held by one cpu, the other cpu will not be able to make
any/much progress.

Are two gdb stubs useful for debugging some particular kind
of problem? If they are I can think about how I can
add them to current x86 kgdb (kgdb.sourceforge.net).
-- 
Amit Kale
Veritas Software ( http://www.veritas.com )

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: kernel gdb for intel
       [not found] ` <no.id>
                     ` (53 preceding siblings ...)
  2001-08-02 19:41   ` [PATCH] vxfs fix Alan Cox
@ 2001-08-03 11:54   ` Alan Cox
  2001-08-03 17:02   ` DoS with tmpfs #3 Alan Cox
                     ` (148 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-03 11:54 UTC (permalink / raw)
  To: Amit S. Kale; +Cc: Alan Cox, Brent Baccala, linux-kernel

> I can't get this. How can two gdb stubs work correctly
> on two serial ports? In absence of any globals, there won't be
> any data corruption, though there are inherent assumptions in 
> the kernel about progress on all cpus. If GKL, page cache lock etc
> are held by one cpu, the other cpu will not be able to make
> any/much progress.

That is fine. It'll get stuck in a lock. One thing it was useful for was
exactly that - getting a given point and checking the locking cases worked


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03  9:06                 ` ext3-2.4-0.9.4 Stephen C. Tweedie
@ 2001-08-03 14:08                   ` Daniel Phillips
  2001-08-03 14:52                     ` ext3-2.4-0.9.4 Stephen C. Tweedie
                                       ` (3 more replies)
  0 siblings, 4 replies; 662+ messages in thread
From: Daniel Phillips @ 2001-08-03 14:08 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Matthias Andree, Stephen C. Tweedie, linux-kernel

On Friday 03 August 2001 11:06, Stephen C. Tweedie wrote:
> Hi,
>
> On Thu, Aug 02, 2001 at 07:26:16PM +0200, Daniel Phillips wrote:
> > I believe you've summarized the SUS requirements very well.  Apart
> > from legalistic arguments,
>
> Umm, this is a specification.  It is *supposed* to be interpreted
> legalistically!

I'm saying that, when the intent is clear as it is in this case then 
trying to find loopholes in the form of expression is not useful.  
Better to argue that SUS is wrong than to pretend it didn't say what it 
did.

> > SUS quite clearly states that fsync should
> > not return until you are sure of having recorded not only the
> > file's data, but the access path to it.  I interpret this as being
> > able to "access the file by its name", and being able to guess by
> > looking in lost+found doesn't count.
>
> But that is just an interpretation.  There's nothing in the spec
> which forces that interpretation.

Well, look at the name "lost+found".  It's lost, maybe we found the 
data but the name is gone for good.  On the other hand, if we start 
with the same on-disk state that fsck had before creating the 
lost+found, but use a journal to recover the name, it *does* count 
because we have a deterministic mechanism for making fsync's promise 
come true.

> fsync forces the data to disk.  There may be one or more pathnames
> which the application also relies on, and if the application does
> care about those, the application will have to ensure that they are
> synced too.
>
> Look, I agree that your interpretation is useful.  It's just not an
> unambiguous requirement of the spec.

OK, fine, this damn English language and all that.  Do we agree that 
"access" means "be able to open by name"?

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 14:08                   ` ext3-2.4-0.9.4 Daniel Phillips
@ 2001-08-03 14:52                     ` Stephen C. Tweedie
  2001-08-03 15:11                     ` ext3-2.4-0.9.4 David S. Miller
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 662+ messages in thread
From: Stephen C. Tweedie @ 2001-08-03 14:52 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Stephen C. Tweedie, Matthias Andree, linux-kernel

On Fri, Aug 03, 2001 at 04:08:20PM +0200, Daniel Phillips wrote:

> I'm saying that, when the intent is clear as it is in this case then 
> trying to find loopholes in the form of expression is not useful. 

The intent is perfectly clear regarding the data.  It is totally
undefined regarding names. 
 
> Well, look at the name "lost+found".  It's lost, maybe we found the 
> data but the name is gone for good.

That's fine --- it satisfies the SUS requirements.

> On the other hand, if we start 
> with the same on-disk state that fsck had before creating the 
> lost+found, but use a journal to recover the name, it *does* count 
> because we have a deterministic mechanism for making fsync's promise 
> come true.

That's an implementation decision, not a comment on the spec.

> > Look, I agree that your interpretation is useful.  It's just not an
> > unambiguous requirement of the spec.
> 
> OK, fine, this damn English language and all that.  Do we agree that 
> "access" means "be able to open by name"?

The word "access" isn't used there in the spec, so it doesn't matter.
The spec only refers to "all file system information required to
retrieve the data."  Integrity of the data is the only thing
guaranteed, not integrity of the namespace.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 14:08                   ` ext3-2.4-0.9.4 Daniel Phillips
  2001-08-03 14:52                     ` ext3-2.4-0.9.4 Stephen C. Tweedie
@ 2001-08-03 15:11                     ` David S. Miller
  2001-08-03 15:25                       ` ext3-2.4-0.9.4 Daniel Phillips
  2001-08-03 15:18                     ` ext3-2.4-0.9.4 Jan Harkes
  2001-08-03 16:05                     ` ext3-2.4-0.9.4 Rik van Riel
  3 siblings, 1 reply; 662+ messages in thread
From: David S. Miller @ 2001-08-03 15:11 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Daniel Phillips, Matthias Andree, linux-kernel


Stephen C. Tweedie writes:
 > The word "access" isn't used there in the spec, so it doesn't matter.
 > The spec only refers to "all file system information required to
 > retrieve the data."  Integrity of the data is the only thing
 > guaranteed, not integrity of the namespace.

In fact this interpretation would have a severe performance impact
for any implementation.

If you include "metadata of the 'path'" in "all filesystem
information..." then you have to basically sync each and every element
on the path(s) to that file.  This means walking each dentry in the
alias list for the inode, and then walk from each of those to the root
sync'ing along the way.

That would be a rediculious requirement.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 14:08                   ` ext3-2.4-0.9.4 Daniel Phillips
  2001-08-03 14:52                     ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2001-08-03 15:11                     ` ext3-2.4-0.9.4 David S. Miller
@ 2001-08-03 15:18                     ` Jan Harkes
  2001-08-03 15:47                       ` ext3-2.4-0.9.4 Daniel Phillips
  2001-08-03 16:05                     ` ext3-2.4-0.9.4 Rik van Riel
  3 siblings, 1 reply; 662+ messages in thread
From: Jan Harkes @ 2001-08-03 15:18 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Stephen C. Tweedie, Matthias Andree, linux-kernel

On Fri, Aug 03, 2001 at 04:08:20PM +0200, Daniel Phillips wrote:
> > fsync forces the data to disk.  There may be one or more pathnames
> > which the application also relies on, and if the application does
> > care about those, the application will have to ensure that they are
> > synced too.
> >
> > Look, I agree that your interpretation is useful.  It's just not an
> > unambiguous requirement of the spec.
> 
> OK, fine, this damn English language and all that.  Do we agree that 
> "access" means "be able to open by name"?

No, until recently the device/inode number pair used to work very nicely
for both Coda and knfsd when they wanted to access a file. But it only
works from within the kernel where you can use iget. It's just that with
some of the newer filesystems the inode numbers are no longer unique, so
it became something more like device/inum/opaque handle (i.e. iget4).

As far as the fsync semantics are concerned. Yeah, they would be useful,
but only to avoid one directory fsync call that will tell the kernel
_exactly_ what the process wants instead of letting it figure it out by
itself. The argument I saw in this thread that fsync(dir) has too much
overhead because it might sync unrelated data is not very useful,
because that unrelated data will be synced anyways when it's not a
journalling fs.

Name to file object mapping is not part of the metadata associated with
a file. It is the contents of the directory and can only be modified by
directory operations, not operations on the file or filehandle.
open(file, O_CREAT) is really split up into create(parent, file) an
operation on the parent directory, and an open(file) operation which
returns the filehandle.

I also don't see why a rename operation, which operates on the source
and destination parent directories would have to not only look up the
file object but also somehow register with all open filehandles for that
object that both olddir and newdir need to be written back to disk
during the fsync as well. Or should that go the other way around, where
the filehandle has to chase down which directory it was renamed to?

Taken from another part of this thread Alexander and Daniel,
> > That there isn't THE directory in which the file resides. There might
> > be several, and setting the bit on one of them at random and expect
> > it to work is a _lot_ of work. For no real use.
>
> There is only one chain of directories from the fd's dentry up to the
> root, that's the one to sync.

Using the dentry chain is not reliable, for instance instead of moving
dentries around Coda simply unhashes dentries when state on the server
changes. Another process (perhaps on a different machine) might have
moved one of the ancestor directories from one location to another, or
even relinked/unlinked the file that we're working with (ln old new; rm
old) in which case the dentry path is lost, but ideally you'd expect
fsync to sync the 'new' name if it supposedly keeps the namespace
consistent.

Working on a distributed filesystem with somewhat weaker than UNIX
semantics might have skewed my vision. In Coda not every client will be
able to figure out which are all of the possibly paths that can lead to
a file object. And although we currently try our best to block
hardlinked directories they could possibly exist, making the problems
even worse.

Jan


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 15:11                     ` ext3-2.4-0.9.4 David S. Miller
@ 2001-08-03 15:25                       ` Daniel Phillips
  2001-08-03 17:06                         ` ext3-2.4-0.9.4 Bill Rugolsky Jr.
  0 siblings, 1 reply; 662+ messages in thread
From: Daniel Phillips @ 2001-08-03 15:25 UTC (permalink / raw)
  To: David S. Miller, Stephen C. Tweedie; +Cc: Matthias Andree, linux-kernel

On Friday 03 August 2001 17:11, David S. Miller wrote:
> Stephen C. Tweedie writes:
>  > The word "access" isn't used there in the spec, so it doesn't matter.
>  > The spec only refers to "all file system information required to
>  > retrieve the data."  Integrity of the data is the only thing
>  > guaranteed, not integrity of the namespace.
>
> In fact this interpretation would have a severe performance impact
> for any implementation.
>
> If you include "metadata of the 'path'" in "all filesystem
> information..." then you have to basically sync each and every element
> on the path(s) to that file.  This means walking each dentry in the
> alias list for the inode, and then walk from each of those to the root
> sync'ing along the way.
>
> That would be a rediculious requirement.

As I read the (excerpts from the) SUS, this isn't required, only that
at least one namespace path from the root to the fsynced file is
preserved.  I can imagine an efficient implementation for this.

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 15:18                     ` ext3-2.4-0.9.4 Jan Harkes
@ 2001-08-03 15:47                       ` Daniel Phillips
  2001-08-03 15:50                         ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2001-08-03 16:16                         ` ext3-2.4-0.9.4 Jan Harkes
  0 siblings, 2 replies; 662+ messages in thread
From: Daniel Phillips @ 2001-08-03 15:47 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Stephen C. Tweedie, Matthias Andree, linux-kernel

On Friday 03 August 2001 17:18, Jan Harkes wrote:
> On Fri, Aug 03, 2001 at 04:08:20PM +0200, Daniel Phillips wrote:
> > > fsync forces the data to disk.  There may be one or more pathnames
> > > which the application also relies on, and if the application does
> > > care about those, the application will have to ensure that they are
> > > synced too.
> > >
> > > Look, I agree that your interpretation is useful.  It's just not an
> > > unambiguous requirement of the spec.
> >
> > OK, fine, this damn English language and all that.  Do we agree that
> > "access" means "be able to open by name"?
>
> No, until recently the device/inode number pair used to work very nicely
> for both Coda and knfsd when they wanted to access a file. But it only
> works from within the kernel where you can use iget. It's just that with
> some of the newer filesystems the inode numbers are no longer unique, so
> it became something more like device/inum/opaque handle (i.e. iget4).

We are talking about "after waking up from an unexpected interruption".
So, is this still relevant?

> As far as the fsync semantics are concerned. Yeah, they would be useful,
> but only to avoid one directory fsync call that will tell the kernel
> _exactly_ what the process wants instead of letting it figure it out by
> itself.

No, it's not just that.  It's not enough to fsync just the parent, the
parent's parent may have been relinked and not comitted.  Also, the
kernel has the advantage of the locked chain of dcache entries
available to it.  Finally, there is the saving of multiple syscall
overhead, which I didn't mention first time around because it's not a
lot compared to media access overhead.

> The argument I saw in this thread that fsync(dir) has too much
> overhead because it might sync unrelated data is not very useful,
> because that unrelated data will be synced anyways when it's not a
> journalling fs.

Yes, and I gave a detailed explanation earlier of why such overhead
will not amount to much.

> Name to file object mapping is not part of the metadata associated with
> a file. It is the contents of the directory and can only be modified by
> directory operations, not operations on the file or filehandle.

SUS doesn't just pronounce on the file metadata.  Quoting from earlier
in the thread:

----------
>   "synchronised I/O data integrity completion
>
>   [...]
>
>   * For write, when the operation has been completed or diagnosed if
>   unsuccessful.  The write is complete only when the data specified in
>   the write request is successfully transferred and all file system
>   information required to retrieve the data is successfully transferred.
----------

> open(file, O_CREAT) is really split up into create(parent, file) an
> operation on the parent directory, and an open(file) operation which
> returns the filehandle.
>
> I also don't see why a rename operation, which operates on the source
> and destination parent directories would have to not only look up the
> file object but also somehow register with all open filehandles for that
> object that both olddir and newdir need to be written back to disk
> during the fsync as well.

They don't both have to, either one will be good enough.  However,
"neither" is not good enough, according to SUS.

> Or should that go the other way around, where
> the filehandle has to chase down which directory it was renamed to?
>
> Taken from another part of this thread Alexander and Daniel,
>
> > > That there isn't THE directory in which the file resides. There might
> > > be several, and setting the bit on one of them at random and expect
> > > it to work is a _lot_ of work. For no real use.
> >
> > There is only one chain of directories from the fd's dentry up to the
> > root, that's the one to sync.
>
> Using the dentry chain is not reliable, for instance instead of moving
> dentries around Coda simply unhashes dentries when state on the server
> changes.

Could you be more specific about this, are you saying there are cases
where there is no valid parent link from a dcache entry?

> Another process (perhaps on a different machine) might have
> moved one of the ancestor directories from one location to another, or
> even relinked/unlinked the file that we're working with (ln old new; rm
> old) in which case the dentry path is lost, but ideally you'd expect
> fsync to sync the 'new' name if it supposedly keeps the namespace
> consistent.

I doesn't matter which one it syncs.

> Working on a distributed filesystem with somewhat weaker than UNIX
> semantics might have skewed my vision. In Coda not every client will be
> able to figure out which are all of the possibly paths that can lead to
> a file object. And although we currently try our best to block
> hardlinked directories they could possibly exist, making the problems
> even worse.

We don't need all the paths, and not any specific path, just a path.

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 15:47                       ` ext3-2.4-0.9.4 Daniel Phillips
@ 2001-08-03 15:50                         ` Stephen C. Tweedie
  2001-08-03 16:24                           ` ext3-2.4-0.9.4 Daniel Phillips
  2001-08-03 18:11                           ` ext3-2.4-0.9.4 Matthias Andree
  2001-08-03 16:16                         ` ext3-2.4-0.9.4 Jan Harkes
  1 sibling, 2 replies; 662+ messages in thread
From: Stephen C. Tweedie @ 2001-08-03 15:50 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Jan Harkes, Stephen C. Tweedie, Matthias Andree, linux-kernel

Hi,

On Fri, Aug 03, 2001 at 05:47:17PM +0200, Daniel Phillips wrote:

> > As far as the fsync semantics are concerned. Yeah, they would be useful,
> > but only to avoid one directory fsync call that will tell the kernel
> > _exactly_ what the process wants instead of letting it figure it out by
> > itself.
> 
> No, it's not just that.  It's not enough to fsync just the parent, the
> parent's parent may have been relinked and not comitted.  Also, the
> kernel has the advantage of the locked chain of dcache entries
> available to it.

Not necessarily.  If the file has been hard-linked and then the
original link removed, we don't have a valid dcache entry any more
(and yes, this is a common construct for some applications --- it is
often used to work around NFS atomicity problems.)

An MTA using such a construct and expecting fsync to do the right
thing will *not* work if you follow the dcache chain on fsync as you
propose here.

> We don't need all the paths, and not any specific path, just a path.

Exactly, because fsync makes absolutely no gaurantees about the
namespace.  So a lost+found path is quite sufficient.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 14:08                   ` ext3-2.4-0.9.4 Daniel Phillips
                                       ` (2 preceding siblings ...)
  2001-08-03 15:18                     ` ext3-2.4-0.9.4 Jan Harkes
@ 2001-08-03 16:05                     ` Rik van Riel
  3 siblings, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-08-03 16:05 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Stephen C. Tweedie, Matthias Andree, linux-kernel

On Fri, 3 Aug 2001, Daniel Phillips wrote:
> On Friday 03 August 2001 11:06, Stephen C. Tweedie wrote:
> > On Thu, Aug 02, 2001 at 07:26:16PM +0200, Daniel Phillips wrote:

> > Look, I agree that your interpretation is useful.  It's just not an
> > unambiguous requirement of the spec.
>
> OK, fine, this damn English language and all that.  Do we agree that
> "access" means "be able to open by name"?

If we didn't agree on this, Linux would have had an
open_by_inode() system call long ago.

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 15:47                       ` ext3-2.4-0.9.4 Daniel Phillips
  2001-08-03 15:50                         ` ext3-2.4-0.9.4 Stephen C. Tweedie
@ 2001-08-03 16:16                         ` Jan Harkes
  2001-08-03 16:54                           ` ext3-2.4-0.9.4 Daniel Phillips
  1 sibling, 1 reply; 662+ messages in thread
From: Jan Harkes @ 2001-08-03 16:16 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: sct, matthias.andree, linux-kernel

On Fri, Aug 03, 2001 at 05:47:17PM +0200, Daniel Phillips wrote:
> On Friday 03 August 2001 17:18, Jan Harkes wrote:
> > No, until recently the device/inode number pair used to work very nicely
> > for both Coda and knfsd when they wanted to access a file. But it only
> > works from within the kernel where you can use iget. It's just that with
> > some of the newer filesystems the inode numbers are no longer unique, so
> > it became something more like device/inum/opaque handle (i.e. iget4).
> 
> We are talking about "after waking up from an unexpected interruption".
> So, is this still relevant?

The NFS client hasn't been interrupted, and it's filehandle will still
identifies the file object by device/inode.

> > Name to file object mapping is not part of the metadata associated with
> > a file. It is the contents of the directory and can only be modified by
> > directory operations, not operations on the file or filehandle.
> 
> SUS doesn't just pronounce on the file metadata.  Quoting from earlier
> in the thread:
> 
> ----------
> >   "synchronised I/O data integrity completion
> >
> >   [...]
> >
> >   * For write, when the operation has been completed or diagnosed if
> >   unsuccessful.  The write is complete only when the data specified in
> >   the write request is successfully transferred and all file system
> >   information required to retrieve the data is successfully transferred.
> ----------

So that would be the file data, and it's on-disk inode information,
indirect blocks etc. All the information that the file system needs to
retrieve the data is then available, i.e. what is required for iget()
to succeed.

Ok, iget isn't exported to userspace, but fsck will place the file in a
user reachable location.

> > I also don't see why a rename operation, which operates on the source
> > and destination parent directories would have to not only look up the
> > file object but also somehow register with all open filehandles for that
> > object that both olddir and newdir need to be written back to disk
> > during the fsync as well.
> 
> They don't both have to, either one will be good enough.  However,
> "neither" is not good enough, according to SUS.

Ehh, sync only olddir and you just lost any path leading to the file.
Sync only newdir and the file is reachable from two locations, but it's
linkcount is too low.

> > Using the dentry chain is not reliable, for instance instead of moving
> > dentries around Coda simply unhashes dentries when state on the server
> > changes.
> 
> Could you be more specific about this, are you saying there are cases
> where there is no valid parent link from a dcache entry?

No the dcache entry could have a 'stale' fileobject associated with it
that has been superceded by a different object. This dentry is unhashed,
so that the next lookup will instantiate a new dentry which references
the new object. So syncing the stale object is useless, because it
doesn't really exist anymore, but the kernel (and actually the userspace
daemon on the client) doesn't know what the new object is until it is
accessed.

> > Working on a distributed filesystem with somewhat weaker than UNIX
> > semantics might have skewed my vision. In Coda not every client will be
> > able to figure out which are all of the possibly paths that can lead to
> > a file object. And although we currently try our best to block
> > hardlinked directories they could possibly exist, making the problems
> > even worse.
> 
> We don't need all the paths, and not any specific path, just a path.

Even if that path leads to a name that got removed, thereby forcing the
object into lost+found? I thought the MTA did something like,

fd = open(tmp/file)
write(fd)
fsync(fd)
link(tmp/file, new/file)
fsync(fd)		*1
unlink(tmp/file)

*1 If this fsync only syncs the path leading to tmp/file, and the unlink
tmp/file is written back to disk, which is likely because we're only
creating/syncing stuff in tmp.  Now, until new/file is written there is
no path information leading to the file anymore which makes this as
'safe' as not syncing path name information at all.

Now if the application would use the directory sync, it can actually
tell the kernel that that new/file name is the interesting one to keep
and that tmp doesn't even need to be written to disk at all.

Jan


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]                         ` <20010803021406.A9845@emma1.emma.line.org>
@ 2001-08-03 16:20                           ` Jan Harkes
  2001-08-03 22:48                           ` Andreas Dilger
  1 sibling, 0 replies; 662+ messages in thread
From: Jan Harkes @ 2001-08-03 16:20 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

On Fri, Aug 03, 2001 at 02:14:06AM +0200, Matthias Andree wrote:
> On Fri, 03 Aug 2001, Anton Altaparmakov wrote:
> > filedescriptor to be synced to disk, the ONLY possible way to do this it to 
> > sync the parent directory in order to commit the file name to disk. On some 
> 
> Do I really need to sync the WHOLE parent directory? Not just the
> relevant part? My directories hardly have only 1 disk block.

Only dirty blocks are written back to disk, i.e. the parts of the
directory that were modified by adding/removing names. It should be
pretty efficient.

> > to explicitly sync the directory filedescriptor afterwards.
> 
> Which is non-portable and will not be done by many application
> programmers which just use chattr +S instead (makes things S)afe and
> S)low) - and spoil performance that way since it makes not only
> directory writes synchronous, but file (data) writes as well.

"chattr +S" is about as portable as adding fsync(parent), or even worse
as it only works on an ext2 file system. So I'm assuming that this is
just a nice exercise in annoying people.

Jan


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 15:50                         ` ext3-2.4-0.9.4 Stephen C. Tweedie
@ 2001-08-03 16:24                           ` Daniel Phillips
  2001-08-03 18:11                           ` ext3-2.4-0.9.4 Matthias Andree
  1 sibling, 0 replies; 662+ messages in thread
From: Daniel Phillips @ 2001-08-03 16:24 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Jan Harkes, Stephen C. Tweedie, Matthias Andree, linux-kernel

On Friday 03 August 2001 17:50, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, Aug 03, 2001 at 05:47:17PM +0200, Daniel Phillips wrote:
> > > As far as the fsync semantics are concerned. Yeah, they would be
> > > useful, but only to avoid one directory fsync call that will tell the
> > > kernel _exactly_ what the process wants instead of letting it figure it
> > > out by itself.
> >
> > No, it's not just that.  It's not enough to fsync just the parent, the
> > parent's parent may have been relinked and not comitted.  Also, the
> > kernel has the advantage of the locked chain of dcache entries
> > available to it.
>
> Not necessarily.  If the file has been hard-linked and then the
> original link removed, we don't have a valid dcache entry any more
> (and yes, this is a common construct for some applications --- it is
> often used to work around NFS atomicity problems.)

But in that case, the file was opened using the hard link, then the link
was deleted.  Fine.  The user is trying to tell us it's ok to lose the
linked file.  Whether or not it can be accessed through another path
is irrelevant.

> An MTA using such a construct and expecting fsync to do the right
> thing will *not* work if you follow the dcache chain on fsync as you
> propose here.

OK, this case where the walk to the root should fail is a "duh", and
exposes a corner case SUS didn't cover (at least not in the excerpts
I saw).  But this case is a userland race, the right thing to do is
just stop the walk.  This doesn't detract from the value of doing
the walk in the important case that the chain is intact.

> > We don't need all the paths, and not any specific path, just a path.
>
> Exactly, because fsync makes absolutely no gaurantees about the
> namespace.  So a lost+found path is quite sufficient.

Dunno, I think that's a statement that should be held up for further
scrutiny.

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 16:16                         ` ext3-2.4-0.9.4 Jan Harkes
@ 2001-08-03 16:54                           ` Daniel Phillips
  0 siblings, 0 replies; 662+ messages in thread
From: Daniel Phillips @ 2001-08-03 16:54 UTC (permalink / raw)
  To: Jan Harkes; +Cc: sct, matthias.andree, linux-kernel

On Friday 03 August 2001 18:16, Jan Harkes wrote:
> On Fri, Aug 03, 2001 at 05:47:17PM +0200, Daniel Phillips wrote:
> > On Friday 03 August 2001 17:18, Jan Harkes wrote:
> > > No, until recently the device/inode number pair used to work very
> > > nicely for both Coda and knfsd when they wanted to access a file. But
> > > it only works from within the kernel where you can use iget. It's just
> > > that with some of the newer filesystems the inode numbers are no longer
> > > unique, so it became something more like device/inum/opaque handle
> > > (i.e. iget4).
> >
> > We are talking about "after waking up from an unexpected interruption".
> > So, is this still relevant?
>
> The NFS client hasn't been interrupted, and it's filehandle will still
> identifies the file object by device/inode.

Interesting, but at best relevant only to network mounts.

> > > Name to file object mapping is not part of the metadata associated with
> > > a file. It is the contents of the directory and can only be modified by
> > > directory operations, not operations on the file or filehandle.
> >
> > SUS doesn't just pronounce on the file metadata.  Quoting from earlier
> > in the thread:
> >
> > ----------
> >
> > >   "synchronised I/O data integrity completion
> > >
> > >   [...]
> > >
> > >   * For write, when the operation has been completed or diagnosed if
> > >   unsuccessful.  The write is complete only when the data specified in
> > >   the write request is successfully transferred and all file system
> > >   information required to retrieve the data is successfully
> > > transferred.
> >
> > ----------
>
> So that would be the file data, and it's on-disk inode information,
> indirect blocks etc. All the information that the file system needs to
> retrieve the data is then available, i.e. what is required for iget()
> to succeed.

In spite of your interesting network filesystem example I'm not willing
to accept that access by inode is enough.  It's going pretty far out on
a limb, don't you agree?  I doubt that SUS explicitly allows the inum to
be interpreted as "information that the file system needs to retrieve
the data".

> Ok, iget isn't exported to userspace, but fsck will place the file in a
> user reachable location.

Hmmph.  It used to have a name, now it doesn't, and somebody did an fsync
on the file, which returned indicating success.  Do you think that's
right?

> > > I also don't see why a rename operation, which operates on the source
> > > and destination parent directories would have to not only look up the
> > > file object but also somehow register with all open filehandles for
> > > that object that both olddir and newdir need to be written back to disk
> > > during the fsync as well.
> >
> > They don't both have to, either one will be good enough.  However,
> > "neither" is not good enough, according to SUS.
>
> Ehh, sync only olddir and you just lost any path leading to the file.

You sync the one the dcache entry points at.

> Sync only newdir and the file is reachable from two locations, but it's
> linkcount is too low.

SUS didn't say the filesystem integrity had to be perfect, just that
"all file system information required to retrieve the data is
successfully transferred".

> > > Using the dentry chain is not reliable, for instance instead of moving
> > > dentries around Coda simply unhashes dentries when state on the server
> > > changes.
> >
> > Could you be more specific about this, are you saying there are cases
> > where there is no valid parent link from a dcache entry?
>
> No the dcache entry could have a 'stale' fileobject associated with it
> that has been superceded by a different object. This dentry is unhashed,
> so that the next lookup will instantiate a new dentry which references
> the new object. So syncing the stale object is useless, because it
> doesn't really exist anymore, but the kernel (and actually the userspace
> daemon on the client) doesn't know what the new object is until it is
> accessed.

This happens because the dentry was unlinked, but not when it is renamed,
right?  In that case it's ok to fail the walk IMHO, the user explicitly
left a window when it's ok to interpret the fsynced file as "gone".

> > > Working on a distributed filesystem with somewhat weaker than UNIX
> > > semantics might have skewed my vision. In Coda not every client will be
> > > able to figure out which are all of the possibly paths that can lead to
> > > a file object. And although we currently try our best to block
> > > hardlinked directories they could possibly exist, making the problems
> > > even worse.
> >
> > We don't need all the paths, and not any specific path, just a path.
>
> Even if that path leads to a name that got removed, thereby forcing the
> object into lost+found? I thought the MTA did something like,

We'd better get confirmation from the MTA expert in the thread.

> fd = open(tmp/file)
> write(fd)
> fsync(fd)
> link(tmp/file, new/file)
> fsync(fd)		*1
> unlink(tmp/file)
>
> *1 If this fsync only syncs the path leading to tmp/file, and the unlink
> tmp/file is written back to disk, which is likely because we're only
> creating/syncing stuff in tmp.  Now, until new/file is written there is
> no path information leading to the file anymore which makes this as
> 'safe' as not syncing path name information at all.

Nice clear example!  Yes, in essence we would have synced the original
path twice.  If this is what the MTA is really doing I'm willing to join
the "MTA is broken" camp.

> Now if the application would use the directory sync, it can actually
> tell the kernel that that new/file name is the interesting one to keep
> and that tmp doesn't even need to be written to disk at all.

Yep.  Or if it used rename, which updates the dcache entries, instead
of link/unlink.

--
Daniel

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: DoS with tmpfs #3
       [not found] ` <no.id>
                     ` (54 preceding siblings ...)
  2001-08-03 11:54   ` kernel gdb for intel Alan Cox
@ 2001-08-03 17:02   ` Alan Cox
  2001-08-04 23:15   ` Question regarding ACPI Alan Cox
                     ` (147 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-03 17:02 UTC (permalink / raw)
  To: Ivan Kalvatchev; +Cc: linux-kernel

> The same horrible think happens to ramfs, but this
> could be expected. Ramfs don't have size check so that
> hack cannot be used for it.  In this case ramfs must
> be marked as dangerous. 

Ramfs and tmpfs in the -ac tree should behave a lot better. The 
fact you see high pages being a factor sounds to me like a VM rather than
a tmpfs bug. Specifically you should have seen KDE apps terminating with
out of memory kills. 

In paticular in the -ac tree ramfs supports setting limits on the max fs
size, which is essential if you want to use it on something like an iPAQ
where ramfs is a real useful fs to have.

tmpfs would I suspect also benefit immensely from quota support

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 15:25                       ` ext3-2.4-0.9.4 Daniel Phillips
@ 2001-08-03 17:06                         ` Bill Rugolsky Jr.
  2001-08-03 17:22                           ` ext3-2.4-0.9.4 Bill Rugolsky Jr.
  0 siblings, 1 reply; 662+ messages in thread
From: Bill Rugolsky Jr. @ 2001-08-03 17:06 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

On Fri, Aug 03, 2001 at 05:25:09PM +0200, Daniel Phillips wrote:
> As I read the (excerpts from the) SUS, this isn't required, only that
> at least one namespace path from the root to the fsynced file is
> preserved.  I can imagine an efficient implementation for this.

I might be way off base here; if so tell me.
Let's litter some sample code with fsync():

	fd = open("tmp/x",O_CREAT|O_WRONLY);
	...
	fsync(fd);
	rename("tmp/x","spool/x");
	fsync(fd);
	close(fd);

This looks safe and correct, given your semantics.  It is, unless the
link count of that file rises above 1, from say a mail admin quite
reasonably doing

	ln tmp/x peek/x

in the interval specified by "...".  In that case, it's not required
that "tmp/x" or "spool/x" ever hit disk.  So the code is only correct
if the file only has a single link, or we specify ordered meta-data
updates for open/rename/link/...  Following Stephen's argument, is 
"peek/x" any better than "lost+found"?  With more than one admin?

On older non-BSD systems (SYS3?,SYSV 3.x?) that don't do rename(), files can't
be moved without bumping their link counts:

	fd = open("tmp/x",O_CREAT|O_WRONLY);
	...
	fsync(fd);
	link("tmp/x","spool/x");
	fsync(fd);  /* <- possibly a NOP with your semantics */
	unlink("tmp/x");
	fsync(fd);
	close(fd);

Again, this will fail to preserve your desired semantics on crash,
unless we have ordered meta-data updates, or the stronger synchronous
link() requirement.

I think the semantics that you propose might be marginally useful, but
I don't think SUS requires it; my understanding (and that of a
close friend and former POSIX reviewer) has always been that inodes are
distinct from directory entries, and that fsync() preserves the fields
that stat() returns: mode, uid, gid, size, {a,c,m}time.

Regards,

   Bill Rugolsky

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 17:06                         ` ext3-2.4-0.9.4 Bill Rugolsky Jr.
@ 2001-08-03 17:22                           ` Bill Rugolsky Jr.
  0 siblings, 0 replies; 662+ messages in thread
From: Bill Rugolsky Jr. @ 2001-08-03 17:22 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

On Fri, Aug 03, 2001 at 01:06:01PM -0400, Bill Rugolsky Jr. wrote:
> 	ln tmp/x peek/x

In fact, make that

	ln tmp/x peek/x
			<- MTA does the fsync()
	less peek/x
			<- MTA closes the file.
	rm peek/x
		     (peek gets written asynchronously to disk)
	*CRASH*
                           
Then we are back to "lost+found", nothing gained.

Regards,

   Bill Rugolsky

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-07-30 17:11                 ` ext3-2.4-0.9.4 Lawrence Greenfield
  2001-07-30 17:25                   ` ext3-2.4-0.9.4 Rik van Riel
  2001-07-31  0:22                   ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-08-03 17:24                   ` Jan Harkes
       [not found]                     ` <mit.lcs.mail.linux-kernel/20010803132457.A30127@cs.cmu.edu>
  2 siblings, 1 reply; 662+ messages in thread
From: Jan Harkes @ 2001-08-03 17:24 UTC (permalink / raw)
  To: Lawrence Greenfield, Daniel Phillips; +Cc: linux-kernel

On Fri, Aug 03, 2001 at 06:54:12PM +0200, Daniel Phillips wrote:
> On Friday 03 August 2001 18:16, Jan Harkes wrote:
> > On Fri, Aug 03, 2001 at 05:47:17PM +0200, Daniel Phillips wrote:
> > > On Friday 03 August 2001 17:18, Jan Harkes wrote:
> > > > Working on a distributed filesystem with somewhat weaker than UNIX
> > > > semantics might have skewed my vision. In Coda not every client will be
> > > > able to figure out which are all of the possibly paths that can lead to
> > > > a file object. And although we currently try our best to block
> > > > hardlinked directories they could possibly exist, making the problems
> > > > even worse.
> > >
> > > We don't need all the paths, and not any specific path, just a path.
> >
> > Even if that path leads to a name that got removed, thereby forcing the
> > object into lost+found? I thought the MTA did something like,
> 
> We'd better get confirmation from the MTA expert in the thread.
> 
> > fd = open(tmp/file)
> > write(fd)
> > fsync(fd)
> > link(tmp/file, new/file)
> > fsync(fd)		*1
> > unlink(tmp/file)
> >
> > *1 If this fsync only syncs the path leading to tmp/file, and the unlink
> > tmp/file is written back to disk, which is likely because we're only
> > creating/syncing stuff in tmp.  Now, until new/file is written there is
> > no path information leading to the file anymore which makes this as
> > 'safe' as not syncing path name information at all.
> 
> Nice clear example!  Yes, in essence we would have synced the original
> path twice.  If this is what the MTA is really doing I'm willing to join
> the "MTA is broken" camp.

Here is the relevant mail,

On Mon, Jul 30, 2001 at 01:11:32PM -0400, Lawrence Greenfield wrote:
} BSD softupdates allows you to call fsync() on the file, and this will
} sync the directories all the way up to the root if necessary.
} 
} Thus BSD fsync() actually guarantees that when it returns, the file
} (and all of it's filenames) will survive a reboot.
} 
} Sendmail does:
} fd = open(tmp)
} write(fd)
} fsync(fd)
} rename(tmp, final)
} fsync(fd)
} 
} Cyrus IMAP does:
} fd = open(tmp)
} write(fd)
} fsync(fd)
} link(tmp, final1)
} link(tmp, final2)
} link(tmp, final3)
} fsync(fd)
} close(fd)
} unlink(tmp)
} 
} The idea that Linux fsync() doesn't actually make the file survive
} reboots is pretty ridiculous.

As you can see, the 'sync a path leading to the file' semantics from SuS
don't work in the Cyrus IMAP case as is specifically requires to have
_all_ paths committed to disk before fsync returns.

On Fri, Aug 03, 2001 at 06:54:12PM +0200, Daniel Phillips wrote:
> On Friday 03 August 2001 18:16, Jan Harkes wrote:
> > Now if the application would use the directory sync, it can actually
> > tell the kernel that that new/file name is the interesting one to keep
> > and that tmp doesn't even need to be written to disk at all.
> 
> Yep.  Or if it used rename, which updates the dcache entries, instead
> of link/unlink.

MTA's that want to do reliable deliveries using multiple processes (or
on a networked filesystem) tend to not use rename because it implicitly
unlinks the target if it already exists and this could lead to loss of
mail that was already considered as being delivered.

Jan


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 15:50                         ` ext3-2.4-0.9.4 Stephen C. Tweedie
  2001-08-03 16:24                           ` ext3-2.4-0.9.4 Daniel Phillips
@ 2001-08-03 18:11                           ` Matthias Andree
  2001-08-06  2:13                             ` ext3-2.4-0.9.4 Zilvinas Valinskas
  1 sibling, 1 reply; 662+ messages in thread
From: Matthias Andree @ 2001-08-03 18:11 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Daniel Phillips, Jan Harkes, Matthias Andree, linux-kernel

On Fri, 03 Aug 2001, Stephen Tweedie wrote:

> > We don't need all the paths, and not any specific path, just a path.
> 
> Exactly, because fsync makes absolutely no gaurantees about the
> namespace.  So a lost+found path is quite sufficient.

MTA authors don't share this. lost+found is "invisible" for the
application that created the file.

I have yet to meet a distribution which scans lost+found at boot time
and syslogs found files or sends root a mail.

So, effectively, lost+found will NOT be sufficient. Discarding file
names at will is not a good thing.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-02 19:47                   ` Bill Rugolsky Jr.
@ 2001-08-03 18:22                     ` Matthias Andree
  0 siblings, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-03 18:22 UTC (permalink / raw)
  To: Bill Rugolsky Jr.; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Thu, 02 Aug 2001, Bill Rugolsky Jr. wrote:

> I have no idea where BSD falls, but the basic point stands:  unused
> features should not penalize other applications.  Andrew Morton has
> figured out how to do this efficiently with ext3, and many kudos to him
> for doing the work.  Absent that, why should I have to go get a cup of
> coffee every time I want to patch a tree, just so some MTA can make
> naive assumptions?

The whole idea is to have a switch to turn on BSD-style synchronous
directory update semantics. Nothing more, nothing you would not be able
to get rid off. In fact, you can mount file systems async on BSD as
well, but you'd better not have the machine crash. Irrecoverable file
system damage can result. As a compromise, softupdates are nearly as
fast as async, but FS damage is guaranteed to be recoverable.

In either case (async or soft-updates), files can end up in lost+found
after the control had been returned to the application that called open
or link.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:30                   ` Stephen C. Tweedie
@ 2001-08-03 18:28                     ` Matthias Andree
  0 siblings, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-03 18:28 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Daniel Phillips, linux-kernel

On Fri, 03 Aug 2001, Stephen Tweedie wrote:

> > > The prescription for symlinks is, if you want them safely on disk you 
> > > have to explicitly fsync the containing directory.
> > 
> > Yes, and it doesn't matter, since MTAs don't use symlinks (symlinks
> > waste inodes on most systems).
> 
> Irrelevant.   We're talking about what makes sensible semantics, not
> what assumptions any specific application makes.  It makes no sense to
> say that dirsync won't affect symlinks just because some existing
> applications don't rely on that!

It's rather my imagination that tracking hard links might be easier than
symlinks because hard links share the inode number. A more advanced (and
complex) implementation might prove the imagination wrong. I don't want
to consider which one is more efficient.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:50                   ` David Weinehall
@ 2001-08-03 18:31                     ` Matthias Andree
  2001-08-03 19:59                     ` Albert D. Cahalan
  1 sibling, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-03 18:31 UTC (permalink / raw)
  To: David Weinehall; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Fri, 03 Aug 2001, David Weinehall wrote:

> On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:
> > Still, some people object to a dirsync mount option. But this has been
> > the actual reason for the thread - MTA authors are refusing to pamper
> > Linux and use chattr +S instead which gives unnecessary (premature) sync
> > operations on write() - but MTAs know how to fsync().
> 
> So what you mean is that MTA authors refuse to pamper Linux through use
> of fsync of the directory, but can accept to "pamper" Linux through use
> of chattr +S?! This seem ridiculous.  It seems equally ridiculous to
> demand that Linux should pamper for MTA authors that can't implement
> fsync on the directory instead of writing BSD-specific code.

It's a maintenance issue.

You effectively start wrapping up all relevant syscalls and have
system-specific interfaces. One wants the directory fsync()ed, the other
offers a special other trick to get the data flushed... what useful is
portability then if systems are so different?

> To me this seems mostly like a way of saying "Hey, we've finally found
> a way to make Linux look really bad compared to BSD-systems; let's

No wonder if the application chooses fully-synchronous operation on
Linux.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03 19:59                     ` Albert D. Cahalan
@ 2001-08-03 19:54                       ` Gregory Maxwell
  2001-08-04  3:30                       ` don't feed the trolls (was: intermediate summary of ext3-2.4-0.9.4 thread) Matthias Andree
  1 sibling, 0 replies; 662+ messages in thread
From: Gregory Maxwell @ 2001-08-03 19:54 UTC (permalink / raw)
  To: Albert D. Cahalan
  Cc: David Weinehall, Daniel Phillips, Stephen C. Tweedie, linux-kernel

On Fri, Aug 03, 2001 at 03:59:02PM -0400, Albert D. Cahalan wrote:
[snip]
> Somebody can create a big MTA list, listing the good and bad ones.
> Then we get the Linux-hostile MTAs out of the Linux distributions,
> demanding compliance like we do for filesystem layout. We also hunt
> down Linux-related web pages that mention these MTAs and get the
> pages changed or removed. The point is to make these MTAs just
> disappear, never to be seen again. Nice MTAs get promoted.

Think we could just get their authors to 'disappear'? It might be more cost
effective, and I can think of at least one example where removing the author
would have other benefits beyond MTAs. :) :)

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:50                   ` David Weinehall
  2001-08-03 18:31                     ` Matthias Andree
@ 2001-08-03 19:59                     ` Albert D. Cahalan
  2001-08-03 19:54                       ` Gregory Maxwell
  2001-08-04  3:30                       ` don't feed the trolls (was: intermediate summary of ext3-2.4-0.9.4 thread) Matthias Andree
  1 sibling, 2 replies; 662+ messages in thread
From: Albert D. Cahalan @ 2001-08-03 19:59 UTC (permalink / raw)
  To: David Weinehall; +Cc: Daniel Phillips, Stephen C. Tweedie, linux-kernel

David Weinehall writes:
> On Thu, Aug 02, 2001 at 07:37:50PM +0200, Matthias Andree wrote:

>> Still, some people object to a dirsync mount option. But this has been
>> the actual reason for the thread - MTA authors are refusing to pamper
>> Linux and use chattr +S instead which gives unnecessary (premature) sync
>> operations on write() - but MTAs know how to fsync().
>
> So what you mean is that MTA authors refuse to pamper Linux through use
> of fsync of the directory, but can accept to "pamper" Linux through use
> of chattr +S?! This seem ridiculous.  It seems equally ridiculous to
> demand that Linux should pamper for MTA authors that can't implement
> fsync on the directory instead of writing BSD-specific code.
>
> [snip]
>
> To me this seems mostly like a way of saying "Hey, we've finally found
> a way to make Linux look really bad compared to BSD-systems; let's
> complain instead of writing alternative code that suits Linux systems
> better than this code does." A lot like all the discussions on threads,
> ueally.

This is just completely true. One wonders why we seem to enjoy
getting screwed this way. We shouldn't be patching these MTAs or
hacking Linux to act like BSD. We should be avoiding these MTAs.

Somebody can create a big MTA list, listing the good and bad ones.
Then we get the Linux-hostile MTAs out of the Linux distributions,
demanding compliance like we do for filesystem layout. We also hunt
down Linux-related web pages that mention these MTAs and get the
pages changed or removed. The point is to make these MTAs just
disappear, never to be seen again. Nice MTAs get promoted.



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
       [not found]                     ` <mit.lcs.mail.linux-kernel/20010803132457.A30127@cs.cmu.edu>
@ 2001-08-03 21:21                       ` Patrick J. LoPresti
  2001-08-04  3:13                         ` ext3-2.4-0.9.4 Matthias Andree
  2001-08-07  2:09                         ` ext3-2.4-0.9.4 James Antill
  0 siblings, 2 replies; 662+ messages in thread
From: Patrick J. LoPresti @ 2001-08-03 21:21 UTC (permalink / raw)
  To: linux-kernel

Jan Harkes <jaharkes@cs.cmu.edu> writes:

> Here is the relevant mail,
> 
> On Mon, Jul 30, 2001 at 01:11:32PM -0400, Lawrence Greenfield wrote:
> } BSD softupdates allows you to call fsync() on the file, and this will
> } sync the directories all the way up to the root if necessary.
> } 
> } Thus BSD fsync() actually guarantees that when it returns, the file
> } (and all of it's filenames) will survive a reboot.
> } 
> } Sendmail does:
> } fd = open(tmp)
> } write(fd)
> } fsync(fd)
> } rename(tmp, final)
> } fsync(fd)
> } 
> } Cyrus IMAP does:
> } fd = open(tmp)
> } write(fd)
> } fsync(fd)
> } link(tmp, final1)
> } link(tmp, final2)
> } link(tmp, final3)
> } fsync(fd)
> } close(fd)
> } unlink(tmp)
> } 
> } The idea that Linux fsync() doesn't actually make the file survive
> } reboots is pretty ridiculous.
> 
> As you can see, the 'sync a path leading to the file' semantics from SuS
> don't work in the Cyrus IMAP case as is specifically requires to have
> _all_ paths committed to disk before fsync returns.

To fill in more of the table, Qmail does:

  fd = open(tmp)
  write(fd)
  fsync(fd)
  link(tmp,final)
  close(fd)

...and Postfix does:

  fd = open(final)
  write(fd)
  (should be an "fsync(fd)" here, but I cannot find it)
  fchmod(fd,+execute)
  fsync(fd)
  close(fd)

Postfix apparently uses the execute bit to indicate that delivery is
complete.  I am probably misreading the source (version 20010228
Patchlevel 3), but I do not see any fsync() between the write and the
fchmod.  Surely it is there or this delivery scheme is not reliable on
any system, since without an intervening fsync() the writes to the
data and the permissions can happen out of order.

Anyway, it is certainly true that it is largely useless to have
fsync() commit only one path to a file; many applications expect to be
able to force a simple link(x,y) to be committed to disk.

I know this thread is already much too long, but I am still not sure I
understand the conclusions.  I *think* it is true that:

  1) People disagree about what SuS mandates, but at least a few
     critical developers (e.g., sct) say it definitely does not
     require synchronizing directory entries for fsync().

  2) It would be fairly easy and efficient for fsync() to chase one
     chain of directory entries up to the root, but a lot harder and
     slower to find and commit all of them.

  3) Most (?) core developers, including Linus (?), would not object
     to "dirsync" as a mount option and/or directory attribute, but
     somebody has to rise to the occasion and create the patches.

Is this an accurate summary?

 - Pat

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
       [not found]                         ` <20010803021406.A9845@emma1.emma.line.org>
  2001-08-03 16:20                           ` Jan Harkes
@ 2001-08-03 22:48                           ` Andreas Dilger
  1 sibling, 0 replies; 662+ messages in thread
From: Andreas Dilger @ 2001-08-03 22:48 UTC (permalink / raw)
  To: Matthias Andree
  Cc: Anton Altaparmakov, Matthias Andree, Andreas Dilger,
	Alexander Viro, Daniel Phillips, Stephen C. Tweedie,
	linux-kernel



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 21:21                       ` ext3-2.4-0.9.4 Patrick J. LoPresti
@ 2001-08-04  3:13                         ` Matthias Andree
  2001-08-04  3:20                           ` ext3-2.4-0.9.4 Rik van Riel
  2001-08-04  3:50                           ` ext3-2.4-0.9.4 Patrick J. LoPresti
  2001-08-07  2:09                         ` ext3-2.4-0.9.4 James Antill
  1 sibling, 2 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-04  3:13 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

On Fri, 03 Aug 2001, Patrick J. LoPresti wrote:

> To fill in more of the table, Qmail does:
> 
>   fd = open(tmp)
>   write(fd)
>   fsync(fd)
>   link(tmp,final)
>   close(fd)

http://cr.yp.to/qmail/faq/reliability.html

> ...and Postfix does:
> 
>   fd = open(final)
>   write(fd)
>   (should be an "fsync(fd)" here, but I cannot find it)
>   fchmod(fd,+execute)
>   fsync(fd)
>   close(fd)

> Postfix apparently uses the execute bit to indicate that delivery is
> complete.  I am probably misreading the source (version 20010228
> Patchlevel 3), but I do not see any fsync() between the write and the
> fchmod.  Surely it is there or this delivery scheme is not reliable on
> any system, since without an intervening fsync() the writes to the
> data and the permissions can happen out of order.

Not really. The error code if fsync() or close failed are propagated
back to the caller who then decides what to do. smtpd.c nukes the file.
postdrop.c/sendmail.c do not, but the pickup daemon will see that the
file had problems on sync and discard it.

I'm asking Wietse off-list how reliable this approach is and will report
back privately. It should be fairly reliable.

> Anyway, it is certainly true that it is largely useless to have
> fsync() commit only one path to a file; many applications expect to be
> able to force a simple link(x,y) to be committed to disk.

BSD FFS + softupdates sync all file names, traversing from the mount
point down to the actual directory entries that need to be synched.

>   1) People disagree about what SuS mandates, but at least a few
>      critical developers (e.g., sct) say it definitely does not
>      require synchronizing directory entries for fsync().
> 
>   2) It would be fairly easy and efficient for fsync() to chase one
>      chain of directory entries up to the root, but a lot harder and
>      slower to find and commit all of them.

For BSD FFS + softupdates, this is already done.

>   3) Most (?) core developers, including Linus (?), would not object
>      to "dirsync" as a mount option and/or directory attribute, but
>      somebody has to rise to the occasion and create the patches.
> 
> Is this an accurate summary?

It looks so to me. After the MTA behaviour has been dug up, the dirsync
option could be even weaker if fsync() behaved like FFS + softupdates:
sync the directory entries, including those of link and rename, as well.

The only things to consider would be unlink and symlink. symlinks are
tough since you cannot open() them. Not sure about unlink, looks as if
there's really no way apart from fsync(2)ing the directory or sync(2)ing
the world for these two unless there's a dirsync option coming up.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-04  3:50                           ` ext3-2.4-0.9.4 Patrick J. LoPresti
@ 2001-08-04  3:14                             ` Gregory Maxwell
  2001-08-04  4:26                             ` ext3-2.4-0.9.4 Mike Castle
  2001-08-04  4:29                             ` ext3-2.4-0.9.4 Matthias Andree
  2 siblings, 0 replies; 662+ messages in thread
From: Gregory Maxwell @ 2001-08-04  3:14 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: Matthias Andree, linux-kernel

On Fri, Aug 03, 2001 at 11:50:23PM -0400, Patrick J. LoPresti wrote:
> > http://cr.yp.to/qmail/faq/reliability.html
> 
> ...which is consistent.  Qmail is assuming that the link() is
> synchronous, as it was back in the "Good Old Days" of stock FFS.

That isn't the only cruft there from the "Good Old Days":

"Battery backups will keep your server alive, letting you park the disk to
avoid a head crash, when the power goes out."

What the hell? Self-parking drives predate qmail by quite a long time.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-04  3:13                         ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-08-04  3:20                           ` Rik van Riel
  2001-08-04  3:50                           ` ext3-2.4-0.9.4 Patrick J. LoPresti
  1 sibling, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-08-04  3:20 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Patrick J. LoPresti, linux-kernel

On Sat, 4 Aug 2001, Matthias Andree wrote:

> http://cr.yp.to/qmail/faq/reliability.html

You should teach the guy about MTAs one day.

Softupdates and other filesystems are perfectly
safe with mailers which work with the filesystem
instead of prescribing the filesystem authors
how they should do their work ;)

Oh, most of those other mailers are RFC compliant
too, but that's a separate issue ;)

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 662+ messages in thread

* don't feed the trolls (was: intermediate summary of ext3-2.4-0.9.4 thread)
  2001-08-03 19:59                     ` Albert D. Cahalan
  2001-08-03 19:54                       ` Gregory Maxwell
@ 2001-08-04  3:30                       ` Matthias Andree
  2001-08-04 21:22                         ` Albert D. Cahalan
  1 sibling, 1 reply; 662+ messages in thread
From: Matthias Andree @ 2001-08-04  3:30 UTC (permalink / raw)
  To: Albert D. Cahalan; +Cc: linux-kernel

On Fri, 03 Aug 2001, Albert D. Cahalan wrote:

> This is just completely true. One wonders why we seem to enjoy
> getting screwed this way. We shouldn't be patching these MTAs or
> hacking Linux to act like BSD. We should be avoiding these MTAs.

Oh, you should make a start avoiding any MTAs because that way, this
list would get rid of one trouble maker after all.

Don't feed the trolls.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-04  3:13                         ` ext3-2.4-0.9.4 Matthias Andree
  2001-08-04  3:20                           ` ext3-2.4-0.9.4 Rik van Riel
@ 2001-08-04  3:50                           ` Patrick J. LoPresti
  2001-08-04  3:14                             ` ext3-2.4-0.9.4 Gregory Maxwell
                                               ` (2 more replies)
  1 sibling, 3 replies; 662+ messages in thread
From: Patrick J. LoPresti @ 2001-08-04  3:50 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:

> On Fri, 03 Aug 2001, Patrick J. LoPresti wrote:
> 
> > To fill in more of the table, Qmail does:
> > 
> >   fd = open(tmp)
> >   write(fd)
> >   fsync(fd)
> >   link(tmp,final)
> >   close(fd)
> 
> http://cr.yp.to/qmail/faq/reliability.html

...which is consistent.  Qmail is assuming that the link() is
synchronous, as it was back in the "Good Old Days" of stock FFS.

> > ...and Postfix does:
> > 
> >   fd = open(final)
> >   write(fd)
> >   (should be an "fsync(fd)" here, but I cannot find it)
> >   fchmod(fd,+execute)
> >   fsync(fd)
> >   close(fd)
> 
> > Postfix apparently uses the execute bit to indicate that delivery is
> > complete.  I am probably misreading the source (version 20010228
> > Patchlevel 3), but I do not see any fsync() between the write and the
> > fchmod.  Surely it is there or this delivery scheme is not reliable on
> > any system, since without an intervening fsync() the writes to the
> > data and the permissions can happen out of order.
> 
> Not really. The error code if fsync() or close failed are propagated
> back to the caller who then decides what to do. smtpd.c nukes the
> file.

That is not the problem.  The problem is that the system could start
flushing blocks to disk after the call to fchmod and before the call
to fsync.  If so, the system could write the mode bits first and then
crash before writing the data, leaving the execute bit set on the file
but without valid data within.  This could result in a corrupted mail
message.

To avoid this, Postfix *must* do fsync() or fdatasync() after the
write() and before the fchmod()+fsync(); that will insure that the
execute bit implies valid ("committed") data in the file.  I was
unable to find any such call to fsync() or fdatasync(), but as I
mentioned, I am probably simply misreading the code.

> > Anyway, it is certainly true that it is largely useless to have
> > fsync() commit only one path to a file; many applications expect to be
> > able to force a simple link(x,y) to be committed to disk.
> 
> BSD FFS + softupdates sync all file names, traversing from the mount
> point down to the actual directory entries that need to be synched.

...and the Linux developers continue to insist that this is "stupid".
Ah, the joys of gaps in standards.

> It looks so to me. After the MTA behaviour has been dug up, the
> dirsync option could be even weaker if fsync() behaved like FFS +
> softupdates: sync the directory entries, including those of link and
> rename, as well.

Ideally, this would be an option you could set per-application (as
opposed to per-directory or per-mountpoint), because we are really
talking about allowing Linux to support applications written for BSD
file system semantics.  It is not obvious to me what the best
implementation for that would be, though.  Maybe just a compile-time
option to choose the appropriate open/link/rename/etc. operations.

 - Pat

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-04  3:50                           ` ext3-2.4-0.9.4 Patrick J. LoPresti
  2001-08-04  3:14                             ` ext3-2.4-0.9.4 Gregory Maxwell
@ 2001-08-04  4:26                             ` Mike Castle
  2001-08-04  4:30                               ` ext3-2.4-0.9.4 Rik van Riel
  2001-08-04  4:29                             ` ext3-2.4-0.9.4 Matthias Andree
  2 siblings, 1 reply; 662+ messages in thread
From: Mike Castle @ 2001-08-04  4:26 UTC (permalink / raw)
  To: linux-kernel

On Fri, Aug 03, 2001 at 11:50:23PM -0400, Patrick J. LoPresti wrote:
> Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:
> 
> > On Fri, 03 Aug 2001, Patrick J. LoPresti wrote:
> > 
> > > To fill in more of the table, Qmail does:
> > > 
> > >   fd = open(tmp)
> > >   write(fd)
> > >   fsync(fd)
> > >   link(tmp,final)
> > >   close(fd)
> > 
> > http://cr.yp.to/qmail/faq/reliability.html
> 
> ...which is consistent.  Qmail is assuming that the link() is
> synchronous, as it was back in the "Good Old Days" of stock FFS.

Which, from my reading of the archives, even BSD folk say is a "Bad 
Thing(tm)."

mrc

-- 
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-04  3:50                           ` ext3-2.4-0.9.4 Patrick J. LoPresti
  2001-08-04  3:14                             ` ext3-2.4-0.9.4 Gregory Maxwell
  2001-08-04  4:26                             ` ext3-2.4-0.9.4 Mike Castle
@ 2001-08-04  4:29                             ` Matthias Andree
  2001-08-06 16:10                               ` fsync() on directories (was Re: ext3-2.4-0.9.4) Patrick J. LoPresti
  2 siblings, 1 reply; 662+ messages in thread
From: Matthias Andree @ 2001-08-04  4:29 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: Matthias Andree, linux-kernel

On Fri, 03 Aug 2001, Patrick J. LoPresti wrote:

> To avoid this, Postfix *must* do fsync() or fdatasync() after the
> write() and before the fchmod()+fsync(); that will insure that the
> execute bit implies valid ("committed") data in the file.  I was
> unable to find any such call to fsync() or fdatasync(), but as I
> mentioned, I am probably simply misreading the code.

Thanks for clarifying, I'm asking Wietse to figure if Postfix's
queue file format is sufficient to check integrity.

> > It looks so to me. After the MTA behaviour has been dug up, the
> > dirsync option could be even weaker if fsync() behaved like FFS +
> > softupdates: sync the directory entries, including those of link and
> > rename, as well.
> 
> Ideally, this would be an option you could set per-application (as
> opposed to per-directory or per-mountpoint), because we are really
> talking about allowing Linux to support applications written for BSD
> file system semantics.  It is not obvious to me what the best
> implementation for that would be, though.  Maybe just a compile-time
> option to choose the appropriate open/link/rename/etc. operations.

To add to that confusion and alternatvies:
HAVE: async, sync
SUGGEST: 1. BSD dirsync, 2. "weak" unlink/symlink dirsync and have
fsync() track and sync pending link/rename as well, 3. make just symlink
dirsync, 4. be confused of all the options.

Where this could be set: directory inode flag, mount option, process
flag (like umask), include file.


Seriously, if fsync() syncs the effects of link and rename as well,
there's no need to make them synchronous unconditionally except if one
wants to offer a "traditional BSD synchronous directory semantics" mode. 

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-04  4:26                             ` ext3-2.4-0.9.4 Mike Castle
@ 2001-08-04  4:30                               ` Rik van Riel
  0 siblings, 0 replies; 662+ messages in thread
From: Rik van Riel @ 2001-08-04  4:30 UTC (permalink / raw)
  To: Mike Castle; +Cc: linux-kernel

On Fri, 3 Aug 2001, Mike Castle wrote:
> On Fri, Aug 03, 2001 at 11:50:23PM -0400, Patrick J. LoPresti wrote:
> > Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:
> > >
> > > http://cr.yp.to/qmail/faq/reliability.html
> >
> > ...which is consistent.  Qmail is assuming that the link() is
> > synchronous, as it was back in the "Good Old Days" of stock FFS.
>
> Which, from my reading of the archives, even BSD folk say is a "Bad
> Thing(tm)."

...which is consistent, looking at the other things
qmail does in strange ways ;)

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: intermediate summary of ext3-2.4-0.9.4 thread
  2001-08-03  8:39                         ` Matthias Andree
  2001-08-03  9:57                           ` Christoph Hellwig
@ 2001-08-04  7:55                           ` Eric W. Biederman
  1 sibling, 0 replies; 662+ messages in thread
From: Eric W. Biederman @ 2001-08-04  7:55 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Paul Jakma, linux-kernel

Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:

> On Fri, 03 Aug 2001, Eric W. Biederman wrote:
> 
> > Actually given that this thread keeps coming up, but no one does anything
> > about it.  I'm tempted to suggest we remove chatrr +S support from ext2.
> > Then there will be enough pain that someone will fix the MTA instead of
> > moaning that kernel is slow...
> 
> They'd just drop Linux from the list of supported OS's, Linux will
> disappoint people who trusted it, nothing is gained. Deliberate breakage
> will not happen, because it would not help anyone except people with
> twisted minds.

There are some other uses for a fully synchronous disk accesses so I'm
not going to run out and do it.  The point is that work arounds for
strange programs is not a right, it is a nice optional feature.
 
> NO-ONE, including you, has come up with SERIOUS objections against a
> dirsync option, except "is it really so much slower than chattr +S? show
> figures" -- ext3 is being tuned to be fast in spite of chattr +S.

Clear objects against dirsync.  
- Extra code maintenance, makes the fs less reliable 
  (A reason for removing even synchrouns fs operations BTW).
- Unnecessary. fsync(dir) works today.
- dirsync is unlikely to be faster than fsync(file); fsync(dir) [not chattr +S]
  You really need something that can say remember these 5 syscalls,
  and sync the all their changes to disk togeter to really get an
  improvement in sync speed.
- I don't see anyone volunteering to write the code.

> Reconsider your position.

Nope.  Right now I would rather
a) Patch the mail programs to do the needed fsync(dir)
b) Totally remove synchrous disk updates from my OS, and
   make life really painful.
Before adding a dirsync option.

> Stop trolling please.

It wasn't trying to troll, just get this conversation on some productive
grounds.  I think supporting the MTA's is good, so long as it is a two
way relationship.

If someone went out and tried using fsync(dir) and then saying it
sucked we could definentily have more peace.

Using dirsync, and chattr +S hide the real problems that need to get
fixed.  Getting a good reliable, and high performance way to commit
actions to a filesystem.  

We already have one work around on linux that will work reliably.  So
now let's see if we can get a functional high performance solution.

And oh btw, new, functional high performance solutions are not
portable because they haven't been implemented in every operating
system.  Full understanding of the problem, and the solutions are two
new for the implementations to have gotten around.

Eric




^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: don't feed the trolls (was: intermediate summary of ext3-2.4-0.9.4 thread)
  2001-08-04  3:30                       ` don't feed the trolls (was: intermediate summary of ext3-2.4-0.9.4 thread) Matthias Andree
@ 2001-08-04 21:22                         ` Albert D. Cahalan
  2001-08-09 11:58                           ` Matthias Andree
  0 siblings, 1 reply; 662+ messages in thread
From: Albert D. Cahalan @ 2001-08-04 21:22 UTC (permalink / raw)
  To: /dev/null; +Cc: Albert D. Cahalan, linux-kernel

Matthias Andree writes:
> On Fri, 03 Aug 2001, Albert D. Cahalan wrote:

>> This is just completely true. One wonders why we seem to enjoy
>> getting screwed this way. We shouldn't be patching these MTAs or
>> hacking Linux to act like BSD. We should be avoiding these MTAs.
>
> Oh, you should make a start avoiding any MTAs because that way, this
> list would get rid of one trouble maker after all.
>
> Don't feed the trolls.

That wasn't intended to be a troll, though I do realize that it
could cause some noise -- including your post. Plenty of noise is
already being generated trying to accommodate hostile MTA authors.

Seriously, consider:

1. there are MTA authors that actively promote BSD over Linux
2. Linux users and distributions promote their MTA software

There is no sense in this. It is masochism and suicide.
It is worse than a waste of time to accommodate these MTAs.

Getting back on topic... while non-inherited ext2 attributes might
be nice, I'm sure the ext2/VFS authors don't need to be pestered
about it, and certainly not because of some lame software making
non-standard assumptions about filesystem behavior.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Question regarding ACPI
       [not found] ` <no.id>
                     ` (55 preceding siblings ...)
  2001-08-03 17:02   ` DoS with tmpfs #3 Alan Cox
@ 2001-08-04 23:15   ` Alan Cox
  2001-08-05  0:46   ` Error when compiling 2.4.7ac6 Alan Cox
                     ` (146 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-04 23:15 UTC (permalink / raw)
  To: Matthew Gardiner; +Cc: Mr Kernel Dude

> Will ACPI error be corrected. Since I started compiling my own kernel, 
> version 2.4.6, ACPI hasn't worked. Has the kernel maintainer given up on 
> it? or is it still in the ACPI think tank?

You probably want to drop a mail to andrew.grover@intel.com and also see
http://phobos.fs.tum.de/acpi/index.html which is the mailing list for
ACPI. 

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Error when compiling 2.4.7ac6
       [not found] ` <no.id>
                     ` (56 preceding siblings ...)
  2001-08-04 23:15   ` Question regarding ACPI Alan Cox
@ 2001-08-05  0:46   ` Alan Cox
  2001-08-05  1:01   ` MTRR and Athlon Processors Alan Cox
                     ` (145 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-05  0:46 UTC (permalink / raw)
  To: Matthew Gardiner; +Cc: Mr Kernel Dude

> gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes 
> -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common 
> pipe -mpreferred-stack-boundary=2 -march=i686    -DEXPORT_SYMTAB -c check.c
> In file included from check.c:28:
> ldm.h:100: warning: `SYS_IND' redefined
> ldm.h:84: warning: this is the location of the previous definition
> ldm.h:104: warning: `NR_SECTS' redefined
> ldm.h:88: warning: this is the location of the previous definition
> ldm.h:109: warning: `START_SECT' redefined
> ldm.h:92: warning: this is the location of the previous definition
> gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes 
> -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common 
> pipe -mpreferred-stack-boundary=2 -march=i686    -c -o msdos.o msdos.c
> rm -f partitions.o

Thanks - fixed 

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: MTRR and Athlon Processors
       [not found] ` <no.id>
                     ` (57 preceding siblings ...)
  2001-08-05  0:46   ` Error when compiling 2.4.7ac6 Alan Cox
@ 2001-08-05  1:01   ` Alan Cox
  2001-08-05  1:02     ` Paul G. Allen
  2001-08-05  1:39   ` Error when compiling 2.4.7ac6 Anton Altaparmakov
                     ` (144 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-08-05  1:01 UTC (permalink / raw)
  To: Paul G. Allen; +Cc: Linux kernel developer's mailing list

> Is the mtrr code supposed to work properly for Athlon (Model 4) in
> kernel 2.4.7?
> 
> I still get mtrr errors/warnings.

Mismatched mtrr warnings indicate bios writers who cannot read
specifications. The kernel will fix up after it

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: MTRR and Athlon Processors
  2001-08-05  1:01   ` MTRR and Athlon Processors Alan Cox
@ 2001-08-05  1:02     ` Paul G. Allen
  2001-08-05  2:28       ` Dave Jones
  0 siblings, 1 reply; 662+ messages in thread
From: Paul G. Allen @ 2001-08-05  1:02 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux kernel developer's mailing list

Alan Cox wrote:
> 
> > Is the mtrr code supposed to work properly for Athlon (Model 4) in
> > kernel 2.4.7?
> >
> > I still get mtrr errors/warnings.
> 
> Mismatched mtrr warnings indicate bios writers who cannot read
> specifications. 

This does not surprise me. In fact, I need to check Tyan's web site for
a BIOS update.

> The kernel will fix up after it

I also get this message:

Jul 29 03:33:00 keroon kernel: mtrr: type mismatch for f8000000,4000000
old: write-back new: write-combining

This happens quite often, especially with the agpgart and NVdriver
modules.

PGA

-- 
Paul G. Allen
UNIX Admin II/Network Security
Akamai Technologies, Inc.
www.akamai.com

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Error when compiling 2.4.7ac6
       [not found] ` <no.id>
                     ` (58 preceding siblings ...)
  2001-08-05  1:01   ` MTRR and Athlon Processors Alan Cox
@ 2001-08-05  1:39   ` Anton Altaparmakov
  2001-08-05  1:43   ` Alan Cox
                     ` (143 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Anton Altaparmakov @ 2001-08-05  1:39 UTC (permalink / raw)
  To: Alan Cox; +Cc: Matthew Gardiner, Mr Kernel Dude

At 01:46 05/08/2001, Alan Cox wrote:
> > gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes
> > -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common
> > pipe -mpreferred-stack-boundary=2 -march=i686    -DEXPORT_SYMTAB -c check.c
> > In file included from check.c:28:
> > ldm.h:100: warning: `SYS_IND' redefined
> > ldm.h:84: warning: this is the location of the previous definition
> > ldm.h:104: warning: `NR_SECTS' redefined
> > ldm.h:88: warning: this is the location of the previous definition
> > ldm.h:109: warning: `START_SECT' redefined
> > ldm.h:92: warning: this is the location of the previous definition
> > gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes
> > -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common
> > pipe -mpreferred-stack-boundary=2 -march=i686    -c -o msdos.o msdos.c
> > rm -f partitions.o
>
>Thanks - fixed

It's quite funny gcc-2.96 doesn't give these warnings. Perhaps it sees that 
the defines are identical and shuts up?

Anton


-- 
   "Nothing succeeds like success." - Alexandre Dumas
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Error when compiling 2.4.7ac6
       [not found] ` <no.id>
                     ` (59 preceding siblings ...)
  2001-08-05  1:39   ` Error when compiling 2.4.7ac6 Anton Altaparmakov
@ 2001-08-05  1:43   ` Alan Cox
  2001-08-05  1:58   ` Anton Altaparmakov
                     ` (142 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-05  1:43 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Alan Cox, Matthew Gardiner, Mr Kernel Dude

> It's quite funny gcc-2.96 doesn't give these warnings. Perhaps it sees that 
> the defines are identical and shuts up?

They are actually not identical - the bracketing varies

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Error when compiling 2.4.7ac6
       [not found] ` <no.id>
                     ` (60 preceding siblings ...)
  2001-08-05  1:43   ` Alan Cox
@ 2001-08-05  1:58   ` Anton Altaparmakov
  2001-08-05 13:04   ` SMP Support for AMD Athlon MP motherboards Alan Cox
                     ` (141 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Anton Altaparmakov @ 2001-08-05  1:58 UTC (permalink / raw)
  To: Alan Cox; +Cc: Alan Cox, Matthew Gardiner, Mr Kernel Dude

At 02:43 05/08/2001, Alan Cox wrote:
> > It's quite funny gcc-2.96 doesn't give these warnings. Perhaps it sees 
> that
> > the defines are identical and shuts up?
>
>They are actually not identical - the bracketing varies

Oh, ok. I didn't know they hadn't been copied verbatim...

Anton


-- 
   "Nothing succeeds like success." - Alexandre Dumas
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: MTRR and Athlon Processors
  2001-08-05  1:02     ` Paul G. Allen
@ 2001-08-05  2:28       ` Dave Jones
  2001-08-05  2:35         ` Paul G. Allen
  0 siblings, 1 reply; 662+ messages in thread
From: Dave Jones @ 2001-08-05  2:28 UTC (permalink / raw)
  To: Paul G. Allen; +Cc: Alan Cox, Linux kernel developer's mailing list

On Sat, 4 Aug 2001, Paul G. Allen wrote:

> Jul 29 03:33:00 keroon kernel: mtrr: type mismatch for f8000000,4000000
> old: write-back new: write-combining
>
> This happens quite often, especially with the agpgart and NVdriver
> modules.

iirc, this is a problem with the nvidia module, and there's nothing
the kernel can do about it. Complain to nvidia.

regards,

Dave.

-- 
| Dave Jones.        http://www.suse.de/~davej
| SuSE Labs


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: MTRR and Athlon Processors
  2001-08-05  2:28       ` Dave Jones
@ 2001-08-05  2:35         ` Paul G. Allen
  0 siblings, 0 replies; 662+ messages in thread
From: Paul G. Allen @ 2001-08-05  2:35 UTC (permalink / raw)
  Cc: Linux kernel developer's mailing list

Dave Jones wrote:
> 
> On Sat, 4 Aug 2001, Paul G. Allen wrote:
> 
> > Jul 29 03:33:00 keroon kernel: mtrr: type mismatch for f8000000,4000000
> > old: write-back new: write-combining
> >
> > This happens quite often, especially with the agpgart and NVdriver
> > modules.
> 
> iirc, this is a problem with the nvidia module, and there's nothing
> the kernel can do about it. Complain to nvidia.
> 

If I knew who to complain to, I would. I used to have a contact there,
but I seem to have lost his e-mail address. :(

(BTW, There's no update to the Tyan [Pheonix] BIOS as yet either.)

PGA

-- 
Paul G. Allen
UNIX Admin II/Network Security
Akamai Technologies, Inc.
www.akamai.com

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: SMP Support for AMD Athlon MP motherboards
       [not found] ` <no.id>
                     ` (61 preceding siblings ...)
  2001-08-05  1:58   ` Anton Altaparmakov
@ 2001-08-05 13:04   ` Alan Cox
  2001-08-05 13:20   ` 3c509: broken(verified) Alan Cox
                     ` (140 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-05 13:04 UTC (permalink / raw)
  To: Andre Tomt; +Cc: linux-kernel

> Whats the degree of support in Linux for such an AMD mobo? Is the Athlo=
> n MP
> architecture supported at all yet?

It is supported yes. You should have seen both processors and stability

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 3c509: broken(verified)
       [not found] ` <no.id>
                     ` (62 preceding siblings ...)
  2001-08-05 13:04   ` SMP Support for AMD Athlon MP motherboards Alan Cox
@ 2001-08-05 13:20   ` Alan Cox
  2001-08-05 14:23     ` Nico Schottelius
  2001-08-06 13:51   ` Problem in Interrupt Handling Alan Cox
                     ` (139 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-08-05 13:20 UTC (permalink / raw)
  To: Nico Schottelius; +Cc: Linux Kernel Mailing List

> The driver for the 3c509 of 2.4.7 is broken:
> It is impossible to set the transmitter type.
> modprobe 3c509 xcvr=X, where X is
> 0,1,2,3,4 doesn't make a difference.

Looking at the code it should set the type fine. The only bug I can see is
that it will report the default type set in the eeprom not the type you
asked.

If thats the case (please check) then its trivial to fix

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 3c509: broken(verified)
  2001-08-05 13:20   ` 3c509: broken(verified) Alan Cox
@ 2001-08-05 14:23     ` Nico Schottelius
  2001-08-05 16:00       ` safemode
  0 siblings, 1 reply; 662+ messages in thread
From: Nico Schottelius @ 2001-08-05 14:23 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List

Alan Cox wrote:

> > The driver for the 3c509 of 2.4.7 is broken:
> > It is impossible to set the transmitter type.
> > modprobe 3c509 xcvr=X, where X is
> > 0,1,2,3,4 doesn't make a difference.
>
> Looking at the code it should set the type fine. The only bug I can see is
> that it will report the default type set in the eeprom not the type you
> asked.
>
> If thats the case (please check) then its trivial to fix

While I tried to setup the driver I always let one machine
outside ping it.

It is not just the message.

ozean:~ # modprobe 3c509 ; ifconfig eth1 192.168.4.17 up

eth1: 3c5x9 at 0x360, BNC port, address  00 60 97 39 43 b9, IRQ 5.
3c509.c:1.18 12Mar2001 becker@scyld.com
http://www.scyld.com/network/3c509.html

- the light on the hub keeps off, no ping answer

ozean:~ # ifconfig eth1 down ; rmmod 3c509;

ozean:~ # modprobe 3c509 xcvr=4 debug=4

## xcvr=4 is TP (found on scyld.com/network/3c509.html)


3c509.c:1.18 12Mar2001 becker@scyld.com
http://www.scyld.com/network/3c509.html
eth1: Setting Rx mode to 1 addresses.
  3c509 EEPROM word 7 0x6d50.
  3c509 EEPROM word 0 0x0060.
  3c509 EEPROM word 1 0x9739.
  3c509 EEPROM word 2 0x43b9.
  3c509 EEPROM word 8 0xc096.
  3c509 EEPROM word 9 0x5000.

eth1: 3c5x9 at 0x360, BNC port, address  00 60 97 39 43 b9, IRQ 5.
3c509.c:1.18 12Mar2001 becker@scyld.com
http://www.scyld.com/network/3c509.html
  3c509 EEPROM word 7 0xffff.
eth1: Opening, IRQ 5     status@36e 0000.
eth1: Opened 3c509  IRQ 5  status 2000.
eth1: Setting Rx mode to 1 addresses.

ozean:~ # ifconfig eth1 192.168.4.17 up

- ping does not work, no light is seen


That's it! The cable & the hub are okay.


Nico



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 3c509: broken(verified)
  2001-08-05 14:23     ` Nico Schottelius
@ 2001-08-05 16:00       ` safemode
  2001-08-06 15:54         ` Nico Schottelius
  0 siblings, 1 reply; 662+ messages in thread
From: safemode @ 2001-08-05 16:00 UTC (permalink / raw)
  To: Nico Schottelius; +Cc: Linux Kernel Mailing List

On Sunday 05 August 2001 10:23, Nico Schottelius wrote:
> Alan Cox wrote:
> > > The driver for the 3c509 of 2.4.7 is broken:
> > > It is impossible to set the transmitter type.
> > > modprobe 3c509 xcvr=X, where X is
> > > 0,1,2,3,4 doesn't make a difference.
> >
> > Looking at the code it should set the type fine. The only bug I can see
> > is that it will report the default type set in the eeprom not the type
> > you asked.
> >
> > If thats the case (please check) then its trivial to fix
>
> While I tried to setup the driver I always let one machine
> outside ping it.
>
> It is not just the message.
>
> ozean:~ # modprobe 3c509 ; ifconfig eth1 192.168.4.17 up
>
> eth1: 3c5x9 at 0x360, BNC port, address  00 60 97 39 43 b9, IRQ 5.
> 3c509.c:1.18 12Mar2001 becker@scyld.com
> http://www.scyld.com/network/3c509.html
>
> - the light on the hub keeps off, no ping answer
>
> ozean:~ # ifconfig eth1 down ; rmmod 3c509;
>
> ozean:~ # modprobe 3c509 xcvr=4 debug=4
>
> ## xcvr=4 is TP (found on scyld.com/network/3c509.html)
>
>
> 3c509.c:1.18 12Mar2001 becker@scyld.com
> http://www.scyld.com/network/3c509.html
> eth1: Setting Rx mode to 1 addresses.
>   3c509 EEPROM word 7 0x6d50.
>   3c509 EEPROM word 0 0x0060.
>   3c509 EEPROM word 1 0x9739.
>   3c509 EEPROM word 2 0x43b9.
>   3c509 EEPROM word 8 0xc096.
>   3c509 EEPROM word 9 0x5000.
>
> eth1: 3c5x9 at 0x360, BNC port, address  00 60 97 39 43 b9, IRQ 5.
> 3c509.c:1.18 12Mar2001 becker@scyld.com
> http://www.scyld.com/network/3c509.html
>   3c509 EEPROM word 7 0xffff.
> eth1: Opening, IRQ 5     status@36e 0000.
> eth1: Opened 3c509  IRQ 5  status 2000.
> eth1: Setting Rx mode to 1 addresses.
>
> ozean:~ # ifconfig eth1 192.168.4.17 up
>
> - ping does not work, no light is seen
>
>
> That's it! The cable & the hub are okay.
>
>
> Nico
>

i was just using a 3c509 in my friend's old 486 and it was working fine with 
2.4.7.   Just modprobed it and set up the ips and it was able to ping and be 
pinged.   

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 18:11                           ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-08-06  2:13                             ` Zilvinas Valinskas
  0 siblings, 0 replies; 662+ messages in thread
From: Zilvinas Valinskas @ 2001-08-06  2:13 UTC (permalink / raw)
  To: Stephen C. Tweedie, Daniel Phillips, Jan Harkes, linux-kernel

On Fri, Aug 03, 2001 at 08:11:12PM +0200, Matthias Andree wrote:
> On Fri, 03 Aug 2001, Stephen Tweedie wrote:
> 
> > > We don't need all the paths, and not any specific path, just a path.
> > 
> > Exactly, because fsync makes absolutely no gaurantees about the
> > namespace.  So a lost+found path is quite sufficient.
> 
> MTA authors don't share this. lost+found is "invisible" for the
> application that created the file.
> 
> I have yet to meet a distribution which scans lost+found at boot time
> and syslogs found files or sends root a mail.

Debian Woody ...
> 
> So, effectively, lost+found will NOT be sufficient. Discarding file
> names at will is not a good thing.
> 
> -- 
> Matthias Andree
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Zilvinas Valinskas

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Problem in Interrupt Handling ....
       [not found] ` <no.id>
                     ` (63 preceding siblings ...)
  2001-08-05 13:20   ` 3c509: broken(verified) Alan Cox
@ 2001-08-06 13:51   ` Alan Cox
  2001-08-06 23:23   ` Virtual memory error when restarting X Alan Cox
                     ` (138 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-06 13:51 UTC (permalink / raw)
  To: Venu Gopal Krishna Vemula; +Cc: linux-kernel

> serial communication adapter which is based on
> interrrupt driven IO, top half handles registering the
> Immediate task queue and  acknowledging to PIC, bottom
> half performs the actual task of interrupt handling. 

Why are you touching the PIC at all - the kernel handles the PIC for you.
Indeed you IRQ might not even be coming from a PIC

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 3c509: broken(verified)
  2001-08-05 16:00       ` safemode
@ 2001-08-06 15:54         ` Nico Schottelius
  2001-08-06 22:30           ` Nicholas Knight
  0 siblings, 1 reply; 662+ messages in thread
From: Nico Schottelius @ 2001-08-06 15:54 UTC (permalink / raw)
  To: safemode; +Cc: Linux Kernel Mailing List

> i was just using a 3c509 in my friend's old 486 and it was working fine with
> 2.4.7.   Just modprobed it and set up the ips and it was able to ping and be
> pinged.

Did you use twisted pair or coax (bnc) ?

This problems occurs (at least ) when trying to use TP.

Nico

ps: Alan, do you have an solution ?


^ permalink raw reply	[flat|nested] 662+ messages in thread

* fsync() on directories (was Re: ext3-2.4-0.9.4)
  2001-08-04  4:29                             ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-08-06 16:10                               ` Patrick J. LoPresti
  0 siblings, 0 replies; 662+ messages in thread
From: Patrick J. LoPresti @ 2001-08-06 16:10 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

Are the Linux "fsync() the directory" semantics documented anywhere?
I mean other than in the source code and on mailing lists.

It might be easier to convince MTA authors to support these semantics
if there were an "official" document describing them and giving some
guarantee that Linux will contine to support them in the future.  It
would be nice to see this described in linux/Documentation or in the
fsync() man page or both.  Without this, it is hard for a software
author to know that Linux's behavior here is not just an
implementation artifact.

 - Pat

P.S.  Is fdatasync() on a directory guaranteed to do anything?  Just
curious.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 3c509: broken(verified)
  2001-08-06 15:54         ` Nico Schottelius
@ 2001-08-06 22:30           ` Nicholas Knight
  0 siblings, 0 replies; 662+ messages in thread
From: Nicholas Knight @ 2001-08-06 22:30 UTC (permalink / raw)
  To: Nico Schottelius, safemode; +Cc: Linux Kernel Mailing List

On Monday 06 August 2001 08:54 am, Nico Schottelius wrote:
> > i was just using a 3c509 in my friend's old 486 and it was working
> > fine with 2.4.7.   Just modprobed it and set up the ips and it was
> > able to ping and be pinged.
>
> Did you use twisted pair or coax (bnc) ?
>
> This problems occurs (at least ) when trying to use TP.
>
> Nico
>
> ps: Alan, do you have an solution ?

For what it's worth, I'm using a 3c509 card on vanilla 2.4.7 right now, 
using standard twisted pair patch cable, and it works fine. I've used it 
both as a module and compiled in (using compiled in at the moment) on 
2.4.5 and 2.4.7, I've also previously used it on 2.4.3, both compiled in 
and as a module.

The motherboard is a Soyo K7VIA w/single ISA slot, VIA Apollo Pro KX133 
chipset, using an Athlon processor.

The card is connected to a hub and communicates fine with both my other 
system and my cable modem, using DHCP.

You mention the problem is being unable to change the media, I was 
unaware this was even possible with the current 3c509 driver, and most 
people do it on 3c509's and other PNP cards of this sort (such as NE2000 
clones)  by using a DOS boot diskette and the DOS utilities provided by 
the manufacturer.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Virtual memory error when restarting X
       [not found] ` <no.id>
                     ` (64 preceding siblings ...)
  2001-08-06 13:51   ` Problem in Interrupt Handling Alan Cox
@ 2001-08-06 23:23   ` Alan Cox
  2001-08-06 23:54   ` [PATCH] one of $BIGNUM devfs races Alan Cox
                     ` (137 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-06 23:23 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: linux-kernel

> Often after restarting X I see the messages similar to "NV0: still have vm que at nv_close(): 0x4023b000 to 0x40245000" in the logs. I presume that these are being caused by the properitry nvidia drivers that I use with X. However I have also noticed that "Unable to handle kernel paging request at virtual address 6b336b50" and the like have also been turning up. I'm wondering whether the graphics drivers problems could be being caused by a vm problem. The oops that
> is enclosed does not appear to be readily repeatable...  but leaving X causes sometimes the system to lock solid with only SysRq getting through. 

Talk to Nvidia. Its their obfuscated/binary driver set, nobody else can help
you fix it.

Alan



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
       [not found] ` <no.id>
                     ` (65 preceding siblings ...)
  2001-08-06 23:23   ` Virtual memory error when restarting X Alan Cox
@ 2001-08-06 23:54   ` Alan Cox
  2001-08-06 23:55   ` Richard Gooch
                     ` (136 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-06 23:54 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Alexander Viro, Linus Torvalds, Alan Cox, linux-kernel

> Linus: please don't apply.
> Alan: I notice you've put Al's patch into 2.4.7-ac8. Please remove it.

I'll remove it when your preferred fixes are ready. Until then its better
than leaving it broken.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
       [not found] ` <no.id>
                     ` (66 preceding siblings ...)
  2001-08-06 23:54   ` [PATCH] one of $BIGNUM devfs races Alan Cox
@ 2001-08-06 23:55   ` Richard Gooch
  2001-08-06 23:59   ` Richard Gooch
                     ` (135 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Richard Gooch @ 2001-08-06 23:55 UTC (permalink / raw)
  To: Alan Cox; +Cc: Alexander Viro, Linus Torvalds, linux-kernel

Alan Cox writes:
> > Linus: please don't apply.
> > Alan: I notice you've put Al's patch into 2.4.7-ac8. Please remove it.
> 
> I'll remove it when your preferred fixes are ready. Until then its
> better than leaving it broken.

OK, fair enough. When is your next merge with Linus scheduled? I'd
prefer to get a few races fixed before shipping a patch, but I can try
to plan for an earlier release if necessary.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
       [not found] ` <no.id>
                     ` (67 preceding siblings ...)
  2001-08-06 23:55   ` Richard Gooch
@ 2001-08-06 23:59   ` Richard Gooch
  2001-08-07 14:17   ` Encrypted Swap Alan Cox
                     ` (134 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Richard Gooch @ 2001-08-06 23:59 UTC (permalink / raw)
  To: Alan Cox; +Cc: Alexander Viro, Linus Torvalds, linux-kernel

Alan Cox writes:
> > OK, fair enough. When is your next merge with Linus scheduled? I'd
> > prefer to get a few races fixed before shipping a patch, but I can try
> > to plan for an earlier release if necessary.
> 
> I send stuff Linus regularly and sometimes it goes in and sometimes
> it doesn't. Stuff with active maintainers I don't send on to Linus
> unless asked too - hence joystick. input and much of USB are so far
> behind in Linus tree

So does that mean you won't try to merge Al's patch?

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
       [not found] ` <no.id>
@ 2001-08-06 23:59 Alan Cox
  2001-08-09  4:09 ` How/when to send patches - (was Re: [PATCH] one of $BIGNUM devfs races) Neil Brown
       [not found] ` <no.id>
  203 siblings, 2 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-06 23:59 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Alan Cox, Alexander Viro, Linus Torvalds, linux-kernel

> OK, fair enough. When is your next merge with Linus scheduled? I'd
> prefer to get a few races fixed before shipping a patch, but I can try
> to plan for an earlier release if necessary.

I send stuff Linus regularly and sometimes it goes in and sometimes it
doesn't. Stuff with active maintainers I don't send on to Linus unless asked
too - hence joystick. input and much of USB are so far behind in Linus tree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ext3-2.4-0.9.4
  2001-08-03 21:21                       ` ext3-2.4-0.9.4 Patrick J. LoPresti
  2001-08-04  3:13                         ` ext3-2.4-0.9.4 Matthias Andree
@ 2001-08-07  2:09                         ` James Antill
  1 sibling, 0 replies; 662+ messages in thread
From: James Antill @ 2001-08-07  2:09 UTC (permalink / raw)
  To: linux-kernel

"Patrick J. LoPresti" <patl@cag.lcs.mit.edu> writes:


[snip sendmail/cyrus/qmail/postfix]

 Just in case anyone cares here's what exim does (AFAICS)...

 int fd1 = open(f1);
 write(fd1);
 fsync(fd1);
 
 int fd2 = open(tmp);
 write(fd2);
 fsync(fd2);
 rename(tmp, f2); // Good at this point.

 So that seems to rely on all dir operations being sync.

 Ps. I did a patch for exim to do the dir sync though...

http://www.and.org/exim-3.31-dirfsync.patch

-- 
# James Antill -- james@and.org
:0:
* ^From: .*james@and\.org
/dev/null

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Encrypted Swap
       [not found] ` <no.id>
                     ` (68 preceding siblings ...)
  2001-08-06 23:59   ` Richard Gooch
@ 2001-08-07 14:17   ` Alan Cox
  2001-08-07 15:16     ` Crutcher Dunnavant
  2001-08-07 16:22   ` [PATCH] one of $BIGNUM devfs races Alan Cox
                     ` (133 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-08-07 14:17 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Crutcher Dunnavant, linux-kernel

> A relatively cheap way might be a custom pci
> card with a self-destruct RAM bank for
> storing the decryption keys.  Opening the 
> safe cause the card to zero the RAM.  

IBM sell crypto PCI cards with anti tamper environments, they have
development drivers on their oss site too

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Encrypted Swap
  2001-08-07 14:17   ` Encrypted Swap Alan Cox
@ 2001-08-07 15:16     ` Crutcher Dunnavant
  2001-08-07 16:01       ` Chris Wedgwood
  0 siblings, 1 reply; 662+ messages in thread
From: Crutcher Dunnavant @ 2001-08-07 15:16 UTC (permalink / raw)
  To: linux-kernel

++ 07/08/01 15:17 +0100 - Alan Cox:
> > A relatively cheap way might be a custom pci
> > card with a self-destruct RAM bank for
> > storing the decryption keys.  Opening the 
> > safe cause the card to zero the RAM.  
> 
> IBM sell crypto PCI cards with anti tamper environments, they have
> development drivers on their oss site too

Ohh. Some college buddies and I were considering the difficulty involved from
making a pci card with an onboard giger counter and radiatian source (say, from
a smoke detector) wrapped up in some lead.

Sure, there are simpler ways to build chaotic circuits, but a radioactive
peripheral is cool!

-- 
Crutcher        <crutcher@datastacks.com>
GCS d--- s+:>+:- a-- C++++$ UL++++$ L+++$>++++ !E PS+++ PE Y+ PGP+>++++
    R-(+++) !tv(+++) b+(++++) G+ e>++++ h+>++ r* y+>*$

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Encrypted Swap
  2001-08-07 15:16     ` Crutcher Dunnavant
@ 2001-08-07 16:01       ` Chris Wedgwood
  0 siblings, 0 replies; 662+ messages in thread
From: Chris Wedgwood @ 2001-08-07 16:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: Crutcher Dunnavant

On Tue, Aug 07, 2001 at 11:16:35AM -0400, Crutcher Dunnavant wrote:

    Ohh. Some college buddies and I were considering the difficulty
    involved from making a pci card with an onboard giger counter and
    radiatian source (say, from a smoke detector) wrapped up in some
    lead.

    Sure, there are simpler ways to build chaotic circuits, but a
    radioactive peripheral is cool!

Why?  There are plenty of cards out that that have hardware and more.
The one in the machine I presently use is limited to 1Mbit/second of
data from a noise-diode (fed through various functions in hardware to
further obfuscate the data).




  --cw


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
       [not found] ` <no.id>
                     ` (69 preceding siblings ...)
  2001-08-07 14:17   ` Encrypted Swap Alan Cox
@ 2001-08-07 16:22   ` Alan Cox
  2001-08-07 19:04   ` Alan Cox
                     ` (132 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-07 16:22 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Alan Cox, Alexander Viro, Linus Torvalds, linux-kernel

> > unless asked too - hence joystick. input and much of USB are so far
> > behind in Linus tree
> 
> So does that mean you won't try to merge Al's patch?

Correct. I'd just get in your way if I did

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
       [not found] ` <no.id>
                     ` (70 preceding siblings ...)
  2001-08-07 16:22   ` [PATCH] one of $BIGNUM devfs races Alan Cox
@ 2001-08-07 19:04   ` Alan Cox
  2001-08-07 19:16     ` Alexander Viro
  2001-08-07 19:09   ` Richard Gooch
                     ` (131 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Alan Cox @ 2001-08-07 19:04 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Alexander Viro, linux-kernel

> > Very interesting. pwd should be using getcwd(2), which doesn't
> > give a damn for inode numbers. If you have seriously old pwd binary
> > that tries to track the thing down to root by hands - yes, it doesn't
> > work.
> 
> Hm. strace suggests my pwd is walking up the path. But WTF would it
> break? 2.4.7 was fine. What did I break?

Sounds like you are using libc5. The old style pwd should be reliable but
its much slower and can't see across protected directory paths

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
       [not found] ` <no.id>
                     ` (71 preceding siblings ...)
  2001-08-07 19:04   ` Alan Cox
@ 2001-08-07 19:09   ` Richard Gooch
  2001-08-07 19:20     ` Alexander Viro
  2001-08-07 20:03   ` cpu not detected(x86) Alan Cox
                     ` (130 subsequent siblings)
  203 siblings, 1 reply; 662+ messages in thread
From: Richard Gooch @ 2001-08-07 19:09 UTC (permalink / raw)
  To: Alan Cox; +Cc: Alexander Viro, linux-kernel

Alan Cox writes:
> > > Very interesting. pwd should be using getcwd(2), which doesn't
> > > give a damn for inode numbers. If you have seriously old pwd binary
> > > that tries to track the thing down to root by hands - yes, it doesn't
> > > work.
> > 
> > Hm. strace suggests my pwd is walking up the path. But WTF would it
> > break? 2.4.7 was fine. What did I break?
> 
> Sounds like you are using libc5. The old style pwd should be
> reliable but its much slower and can't see across protected
> directory paths

Yes, I use libc5. And I don't care about old pwd being slower. And I
certainly don't want to break it, even if I wasn't using it.
By "protected directory paths", you mean a directory with read access?

Well, rx access is available for the whole path. And the inums looked
fine. So the breakage is odd.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
  2001-08-07 19:04   ` Alan Cox
@ 2001-08-07 19:16     ` Alexander Viro
  2001-08-08 21:16       ` H. Peter Anvin
  0 siblings, 1 reply; 662+ messages in thread
From: Alexander Viro @ 2001-08-07 19:16 UTC (permalink / raw)
  To: Alan Cox; +Cc: Richard Gooch, linux-kernel



On Tue, 7 Aug 2001, Alan Cox wrote:

> > > Very interesting. pwd should be using getcwd(2), which doesn't
> > > give a damn for inode numbers. If you have seriously old pwd binary
> > > that tries to track the thing down to root by hands - yes, it doesn't
> > > work.
> > 
> > Hm. strace suggests my pwd is walking up the path. But WTF would it
> > break? 2.4.7 was fine. What did I break?
> 
> Sounds like you are using libc5. The old style pwd should be reliable but
> its much slower and can't see across protected directory paths

It is not reliable. E.g. on NFS inumbers are not unique - 32 bits is
not enough.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
  2001-08-07 19:09   ` Richard Gooch
@ 2001-08-07 19:20     ` Alexander Viro
  0 siblings, 0 replies; 662+ messages in thread
From: Alexander Viro @ 2001-08-07 19:20 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Alan Cox, linux-kernel



On Tue, 7 Aug 2001, Richard Gooch wrote:

> Yes, I use libc5. And I don't care about old pwd being slower. And I

So fix getcwd(3) in libc5. BFD... Or use your ->dentry in devfs_readdir() -
then you can get the consistency you want for existing inodes and that
will allow b0rken getcwd() to work.

It _is_ b0rken - it relies on unique 32-bit number for inodes. That's
not guaranteed on NFS and there's nothing we could do about that.


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: cpu not detected(x86)
       [not found] ` <no.id>
                     ` (72 preceding siblings ...)
  2001-08-07 19:09   ` Richard Gooch
@ 2001-08-07 20:03   ` Alan Cox
  2001-08-08 11:50   ` [kbuild-devel] Announce: Kernel Build for 2.5, Release 1 is Alan Cox
                     ` (129 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-07 20:03 UTC (permalink / raw)
  To: Grover, Andrew
  Cc: 'Dave Jones', Nico Schottelius, Linux Kernel Mailing List

> Longer-term, we need to change the kernel to not use the TSC for udelay, but
> to use the PM Timer, if ACPI is going to be monkeying with CPU power states.

That can be done, and may be a help. 

The TSC timer isnt a very good source on many non intel chips that stop it
to get the best power figures. It also helps with SMP because on an SMP box
the tsc values may not calibrate.

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [kbuild-devel] Announce: Kernel Build for 2.5, Release 1 is
       [not found] ` <no.id>
                     ` (73 preceding siblings ...)
  2001-08-07 20:03   ` cpu not detected(x86) Alan Cox
@ 2001-08-08 11:50   ` Alan Cox
  2001-08-08 15:20   ` [PATCH] parport_pc.c PnP BIOS sanity check Alan Cox
                     ` (128 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-08 11:50 UTC (permalink / raw)
  To: cate; +Cc: Keith Owens, kbuild-devel, linux-kernel

> If generating some support files requires some non common tools,
> it is the right thing to ship the two files (source and generated).

Its often easiest. Justin does this with the Adaptec driver now and it makes
life both simple for those who want to build kernels and handy for those
who want to hack the stuff.

> BTW we cannot ship the generated file without the source files,
> because of GPL.

If its part of the kernel tools you want to make it available, that doesn't 
mean it has to be shipped with the kernel. 

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] parport_pc.c PnP BIOS sanity check
       [not found] ` <no.id>
                     ` (74 preceding siblings ...)
  2001-08-08 11:50   ` [kbuild-devel] Announce: Kernel Build for 2.5, Release 1 is Alan Cox
@ 2001-08-08 15:20   ` Alan Cox
  2001-08-08 16:13     ` Richard B. Johnson
  2001-08-08 21:58     ` H. Peter Anvin
  2001-08-08 16:53   ` [Dri-devel] Re: DRM Linux kernel merge (update) needed, soon Alan Cox
                     ` (127 subsequent siblings)
  203 siblings, 2 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-08 15:20 UTC (permalink / raw)
  To: Thomas Hood; +Cc: linux-kernel

> The following would seem to be required to protect against
> the case in which PnP BIOS reports an IRQ of 0 for a 
> parport with disabled IRQ.      // Thomas  jdthood_AT_yahoo.co.uk

IRQ 0 is a legal valid IRQ. I suspect the problem is that pnpbios shouldnt
be reporting an IRQ or we should be using some kind of NO_IRQ cookie

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] parport_pc.c PnP BIOS sanity check
  2001-08-08 15:20   ` [PATCH] parport_pc.c PnP BIOS sanity check Alan Cox
@ 2001-08-08 16:13     ` Richard B. Johnson
  2001-08-08 21:58     ` H. Peter Anvin
  1 sibling, 0 replies; 662+ messages in thread
From: Richard B. Johnson @ 2001-08-08 16:13 UTC (permalink / raw)
  To: Alan Cox; +Cc: Thomas Hood, linux-kernel

On Wed, 8 Aug 2001, Alan Cox wrote:

> > The following would seem to be required to protect against
> > the case in which PnP BIOS reports an IRQ of 0 for a 
> > parport with disabled IRQ.      // Thomas  jdthood_AT_yahoo.co.uk
> 
> IRQ 0 is a legal valid IRQ. I suspect the problem is that pnpbios shouldnt
> be reporting an IRQ or we should be using some kind of NO_IRQ cookie

IRQ0 will never by reported by a PCI bus device because it means that
no IRQ is used (they figured that IRQ0 would always be used for something
else). Maybe PnP BIOS also presumes this? If so, the use of IRQ0 to
mean "no IRQ" is valid, although misleading.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

    I was going to compile a list of innovations that could be
    attributed to Microsoft. Once I realized that Ctrl-Alt-Del
    was handled in the BIOS, I found that there aren't any.



^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [Dri-devel] Re: DRM Linux kernel merge (update) needed, soon.
       [not found] ` <no.id>
                     ` (75 preceding siblings ...)
  2001-08-08 15:20   ` [PATCH] parport_pc.c PnP BIOS sanity check Alan Cox
@ 2001-08-08 16:53   ` Alan Cox
  2001-08-08 23:02   ` 386 boot problems with 2.4.7 and 2.4.7-ac9 Alan Cox
                     ` (126 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-08 16:53 UTC (permalink / raw)
  To: Jeff Hartmann; +Cc: Gareth Hughes, DRI-Devel, Linux Kernel List

> I sent a patch to Linus and Alan this morning.

Tweaked to allow either 4.0 or 4.1 DRM to be built (most folks need 4.0
still) and merged

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: ReiserFS / 2.4.6 / Data Corruption
  2001-07-27 13:39       ` Alan Cox
  2001-07-27 13:47         ` bvermeul
  2001-07-28 14:16         ` Matthew Gardiner
@ 2001-08-08 18:42         ` Stephen C. Tweedie
  2 siblings, 0 replies; 662+ messages in thread
From: Stephen C. Tweedie @ 2001-08-08 18:42 UTC (permalink / raw)
  To: Alan Cox
  Cc: bvermeul, Hans Reiser, Erik Mouw, Steve Kieu, Sam Thompson, kernel

Hi,

On Fri, Jul 27, 2001 at 02:39:37PM +0100, Alan Cox wrote:

> > I've been doing that most of the time. But I sometimes forget that.
> > But as I said, it's not something I expected from a journalled filesystem.
> 
> You misunderstand journalling then
> 
> A journalling file system can offer different levels of guarantee. With 
> metadata only journalling you don't take any real performance hit but your
> file system is always consistent on reboot (consistent as in fsck would pass
> it) but it makes no guarantee that data blocks got written.

The default behaviour of ext3 does make this guarantee, for what it's
worth.  If you want the more relaxed mode which doesn't enforce the
flushing of data blocks before a commit, you need to mount with "-o
data=writeback".

> Full data journalling will give you what you expect but at a performance hit
> for many applications.

You can achieve the necessary ordering to avoid stale data blocks
after a crash without the penalty of writing all the data to the
journal.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
  2001-08-07 19:16     ` Alexander Viro
@ 2001-08-08 21:16       ` H. Peter Anvin
  2001-08-08 21:47         ` Alexander Viro
  2001-08-08 23:29         ` Richard Gooch
  0 siblings, 2 replies; 662+ messages in thread
From: H. Peter Anvin @ 2001-08-08 21:16 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.GSO.4.21.0108071510390.18565-100000@weyl.math.psu.edu>
By author:    Alexander Viro <viro@math.psu.edu>
In newsgroup: linux.dev.kernel
> 
> It is not reliable. E.g. on NFS inumbers are not unique - 32 bits is
> not enough.
> 

Unfortunately there is a whole bunch of other things too that rely on
it, and *HAVE* to rely on it -- (st_dev, st_ino) are defined to
specify the identity of a file, and if the current types aren't large
enough we *HAVE* to go to new types.  THERE IS NO OTHER WAY TO TEST
FOR FILE IDENTITY IN UNIX, and being able to perform such a test is
vital for many things, including security and hard link detection
(think tar, cpio, cp -a.)

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
  2001-08-08 21:16       ` H. Peter Anvin
@ 2001-08-08 21:47         ` Alexander Viro
  2001-08-08 23:29         ` Richard Gooch
  1 sibling, 0 replies; 662+ messages in thread
From: Alexander Viro @ 2001-08-08 21:47 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel



On 8 Aug 2001, H. Peter Anvin wrote:

> Followup to:  <Pine.GSO.4.21.0108071510390.18565-100000@weyl.math.psu.edu>
> By author:    Alexander Viro <viro@math.psu.edu>
> In newsgroup: linux.dev.kernel
> > 
> > It is not reliable. E.g. on NFS inumbers are not unique - 32 bits is
> > not enough.
> > 
> 
> Unfortunately there is a whole bunch of other things too that rely on
> it, and *HAVE* to rely on it -- (st_dev, st_ino) are defined to
> specify the identity of a file, and if the current types aren't large
> enough we *HAVE* to go to new types.  THERE IS NO OTHER WAY TO TEST
> FOR FILE IDENTITY IN UNIX, and being able to perform such a test is
> vital for many things, including security and hard link detection

Indeed, but it still doesn't help libc5 getcwd(3), which uses 32 bit
values.

> (think tar, cpio, cp -a.)

I'd rather not.  Too bloody depressive... (If you want details - let's
take it off-list).


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] parport_pc.c PnP BIOS sanity check
  2001-08-08 15:20   ` [PATCH] parport_pc.c PnP BIOS sanity check Alan Cox
  2001-08-08 16:13     ` Richard B. Johnson
@ 2001-08-08 21:58     ` H. Peter Anvin
  2001-08-08 22:12       ` Russell King
  2001-08-10  9:18       ` Eric W. Biederman
  1 sibling, 2 replies; 662+ messages in thread
From: H. Peter Anvin @ 2001-08-08 21:58 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <E15UV8M-0005SE-00@the-village.bc.nu>
By author:    Alan Cox <alan@lxorguk.ukuu.org.uk>
In newsgroup: linux.dev.kernel
>
> > The following would seem to be required to protect against
> > the case in which PnP BIOS reports an IRQ of 0 for a 
> > parport with disabled IRQ.      // Thomas  jdthood_AT_yahoo.co.uk
> 
> IRQ 0 is a legal valid IRQ. I suspect the problem is that pnpbios shouldnt
> be reporting an IRQ or we should be using some kind of NO_IRQ cookie
>

IRQ 0 is hardwired to the system timer in PC systems, though, so it
could simply be assumed that IRQ 0 will never be used for any other
purposes.

Reminds me back in the days when you had to worry about DRQs as well;
DRQ 0 was hardwired in the original PC but then became available in
the AT; there was a whole bunch of things that assumed DRQ 0 wasn't
usable, even though it was perfectly fine.  Not to mention the
motherboard I had which would lock up solid if anything ever used
DRQ 5.

Good riddance, all this crap...

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] parport_pc.c PnP BIOS sanity check
  2001-08-08 21:58     ` H. Peter Anvin
@ 2001-08-08 22:12       ` Russell King
  2001-08-10  9:18       ` Eric W. Biederman
  1 sibling, 0 replies; 662+ messages in thread
From: Russell King @ 2001-08-08 22:12 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

On Wed, Aug 08, 2001 at 02:58:12PM -0700, H. Peter Anvin wrote:
> IRQ 0 is hardwired to the system timer in PC systems, though, so it
                                         ^^^^^^^^^^^^^

Linux doesn't run on only PC systems though, and other systems use
IRQ0 as the (superio-based) parallel port IRQ.

> Good riddance, all this crap...

Indeed - please check the ARM port for our solution to this.  We've
had the NO_IRQ construct for literally years in include/asm-arm/irq.h:

#define NO_IRQ  ((unsigned int)(-1))

Naturally, a similar NO_DMA is defined in dma.h.  The sooner we can get
rid of the "IRQ0 cannot be used" crap from the kernel the better.

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: 386 boot problems with 2.4.7 and 2.4.7-ac9
       [not found] ` <no.id>
                     ` (76 preceding siblings ...)
  2001-08-08 16:53   ` [Dri-devel] Re: DRM Linux kernel merge (update) needed, soon Alan Cox
@ 2001-08-08 23:02   ` Alan Cox
  2001-08-09  9:08   ` Swapping for diskless nodes Alan Cox
                     ` (125 subsequent siblings)
  203 siblings, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-08 23:02 UTC (permalink / raw)
  To: Carl-Johan Kjellander; +Cc: linux-kernel

> This is the panic from 2.4.7-ac9 compiled with gcc-2.96-85 (Red Hat).
> 
> ksymoops 2.4.0 on i686 2.4.7.  Options used

Thanks. For some reason it crashed probing the simple boot flag ACPI
structure. I'll try and work out how and why then send you a diff

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: [PATCH] one of $BIGNUM devfs races
  2001-08-08 21:16       ` H. Peter Anvin
  2001-08-08 21:47         ` Alexander Viro
@ 2001-08-08 23:29         ` Richard Gooch
  1 sibling, 0 replies; 662+ messages in thread
From: Richard Gooch @ 2001-08-08 23:29 UTC (permalink / raw)
  To: Alexander Viro; +Cc: H. Peter Anvin, linux-kernel

Alexander Viro writes:
> 
> On 8 Aug 2001, H. Peter Anvin wrote:
> 
> > Followup to:  <Pine.GSO.4.21.0108071510390.18565-100000@weyl.math.psu.edu>
> > By author:    Alexander Viro <viro@math.psu.edu>
> > In newsgroup: linux.dev.kernel
> > > 
> > > It is not reliable. E.g. on NFS inumbers are not unique - 32 bits is
> > > not enough.
> > 
> > Unfortunately there is a whole bunch of other things too that rely on
> > it, and *HAVE* to rely on it -- (st_dev, st_ino) are defined to
> > specify the identity of a file, and if the current types aren't large
> > enough we *HAVE* to go to new types.  THERE IS NO OTHER WAY TO TEST
> > FOR FILE IDENTITY IN UNIX, and being able to perform such a test is
> > vital for many things, including security and hard link detection
> 
> Indeed, but it still doesn't help libc5 getcwd(3), which uses 32 bit
> values.

FYI: the problem that spawned this sub-thread is fixed. The
devfs-patch-v185 that I released last night fixes this. So the libc5
getcwd(3) is fine with 32 bit inums on devfs.

Filesystems with larger inums are left as an exercise for the reader
:-)

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 662+ messages in thread

* How/when to send patches - (was  Re: [PATCH] one of $BIGNUM devfs races)
  2001-08-06 23:59 [PATCH] one of $BIGNUM devfs races Alan Cox
@ 2001-08-09  4:09 ` Neil Brown
  2001-08-09  5:39   ` Linus Torvalds
  2001-08-09  7:42   ` Alan Cox
       [not found] ` <no.id>
  1 sibling, 2 replies; 662+ messages in thread
From: Neil Brown @ 2001-08-09  4:09 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linus Torvalds, linux-kernel

On Tuesday August 7, alan@lxorguk.ukuu.org.uk wrote:
> > OK, fair enough. When is your next merge with Linus scheduled? I'd
> > prefer to get a few races fixed before shipping a patch, but I can try
> > to plan for an earlier release if necessary.
> 
> I send stuff Linus regularly and sometimes it goes in and sometimes it
> doesn't. Stuff with active maintainers I don't send on to Linus unless asked
> too - hence joystick. input and much of USB are so far behind in Linus tree

This is something I would like to understand better.

Sometimes I send patches to Linus, and a new prepatch comes out within
hours that contains them.
Sometimes I send patches to Linus and it's like sending them to
/dev/null. Sometimes I resend.  Sometimes it helps.

So I wonder "is he busy? does he have other priorities? does he have a
broken mail system?  is he being rude" in decreasing order of
likelyhood from "very" to "very un-".

So I thought I would try sending to Alan and Linus.  Then they
appeared in an -ac patch, but not in a pre patch.

I thought that might be close enough, but if Alan doesn't plan to
forward them the Linus, then it isn't.


Now I am happy to just resent the pending patches every time a pre
patch comes out that doesn't contain then, but I want to be sure that
isn't going to negatively impact Linus at all.

Comments?

NeilBrown

(I'm talking about patches to fs/nfsd and drivers/md)

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: How/when to send patches - (was  Re: [PATCH] one of $BIGNUM devfs races)
  2001-08-09  4:09 ` How/when to send patches - (was Re: [PATCH] one of $BIGNUM devfs races) Neil Brown
@ 2001-08-09  5:39   ` Linus Torvalds
  2001-08-09 20:36     ` Rik van Riel
  2001-08-09  7:42   ` Alan Cox
  1 sibling, 1 reply; 662+ messages in thread
From: Linus Torvalds @ 2001-08-09  5:39 UTC (permalink / raw)
  To: Neil Brown; +Cc: Alan Cox, linux-kernel


On Thu, 9 Aug 2001, Neil Brown wrote:
>
> This is something I would like to understand better.
>
> Sometimes I send patches to Linus, and a new prepatch comes out within
> hours that contains them.
> Sometimes I send patches to Linus and it's like sending them to
> /dev/null. Sometimes I resend.  Sometimes it helps.

Re-sending is always the right thing to do. Sometimes it takes a few
times, and you can add a small exasperated message at the top by the third
time ("Don't you love me any more?").

> Now I am happy to just resent the pending patches every time a pre
> patch comes out that doesn't contain then, but I want to be sure that
> isn't going to negatively impact Linus at all.

It's not. Sometimes (like now), I have other priorities, and right now for
example I've been concentrating on the VM balancing issues (and, in all
honesty, sometimes the "other priorities" aren't Linux issues at all ;).

When that happens, any other patches may still be merged, but they might
equally well just end up staying pending in my mailbox. And if they stay
there for more than a day they are basically so stale that I'll likely
never see them again.

I _seldom_ have pending patches over a pre-patch, so while it is possible
that I'm still mulling over your old patch when a new pre-patch comes out,
it's much more likely that the right answer is just to re-send. Maybe with
a slightly bigger explanation on why the patch is such a good and worthy
patch ;^)

And it's absolutely not worth it to worry about filling up my mailbox with
patches. Rule of thumb is: "if it's not really _really_ important, try to
keep one pre-patch or 48 hours between re-sends". And if it is _really_
important, ping me as often as you like.

		Linus


^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: How/when to send patches - (was  Re: [PATCH] one of $BIGNUM devfs races)
  2001-08-09  4:09 ` How/when to send patches - (was Re: [PATCH] one of $BIGNUM devfs races) Neil Brown
  2001-08-09  5:39   ` Linus Torvalds
@ 2001-08-09  7:42   ` Alan Cox
  1 sibling, 0 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-09  7:42 UTC (permalink / raw)
  To: Neil Brown; +Cc: Alan Cox, Linus Torvalds, linux-kernel

> So I thought I would try sending to Alan and Linus.  Then they
> appeared in an -ac patch, but not in a pre patch.
> 
> I thought that might be close enough, but if Alan doesn't plan to
> forward them the Linus, then it isn't.

I can forward fs/nfs stuff to Linus if you want me to add it to the stuff
I do forward, ditto md (non lvm) stuff. In many ways -ac is far enough, if
it gets to -ac it gets to most folks and its there for all the vendors

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Swapping for diskless nodes
       [not found] ` <no.id>
                     ` (77 preceding siblings ...)
  2001-08-08 23:02   ` 386 boot problems with 2.4.7 and 2.4.7-ac9 Alan Cox
@ 2001-08-09  9:08   ` Alan Cox
  2001-08-09 10:50     ` Ingo Oeser
                       ` (3 more replies)
  2001-08-09 15:14   ` Alan Cox
                     ` (124 subsequent siblings)
  203 siblings, 4 replies; 662+ messages in thread
From: Alan Cox @ 2001-08-09  9:08 UTC (permalink / raw)
  To: Dirk W. Steinberg; +Cc: linux-kernel

> what is the best/recommended way to do remote swapping via the network
> for diskless workstations or compute nodes in clusters in Linux 2.4?=20
> Last time i checked was linux 2.2, and there were some races related=20
> to network swapping back then. Has this been fixed for 2.4?

The best answer probably is "don't". Networks are high latency things for
paging and paging is latency sensitive. If performance is not an issue then
the nbd driver ought to work. You may need to check it uses the right
GFP_ levels to avoid deadlocks and you might need to up the amount of atomic
pool memory. Hopefully other hacks arent needed

[The general case of network swap is basically insoluble but its possible to
 make it perfectly usable as Sun proved]

Alan

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Swapping for diskless nodes
  2001-08-09  9:08   ` Swapping for diskless nodes Alan Cox
@ 2001-08-09 10:50     ` Ingo Oeser
  2001-08-09 13:12       ` Dirk W. Steinberg
  2001-08-09 20:47       ` Rik van Riel
  2001-08-09 14:17     ` Dirk W. Steinberg
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 662+ messages in thread
From: Ingo Oeser @ 2001-08-09 10:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

On Thu, Aug 09, 2001 at 10:08:37AM +0100, Alan Cox wrote:
> > what is the best/recommended way to do remote swapping via the network
> > for diskless workstations or compute nodes in clusters in Linux 2.4?=20
> > Last time i checked was linux 2.2, and there were some races related=20
> > to network swapping back then. Has this been fixed for 2.4?
> 
> The best answer probably is "don't". Networks are high latency things for
> paging and paging is latency sensitive. If performance is not an issue then
> the nbd driver ought to work. You may need to check it uses the right
> GFP_ levels to avoid deadlocks and you might need to up the amount of atomic
> pool memory. Hopefully other hacks arent needed

While we are on it: I have an old machine with 64MB of RAM and a
new, fast machine with 1GB of RAM. 

Sometimes I need more RAM on the old one and asked myself,
whether I could first swap over network to the other one, into
its tmpfs, before digging into real swap on a hard disk.

I have only three machines attached to this small internal
100Mbit LAN.

Both machines use Kernel 2.4.x.

Are there any races I have to consider?

Thanks & Regards

Ingo Oeser
-- 
In der Wunschphantasie vieler Mann-Typen [ist die Frau] unsigned und
operatorvertraeglich. --- Dietz Proepper in dasr

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: don't feed the trolls (was: intermediate summary of ext3-2.4-0.9.4 thread)
  2001-08-04 21:22                         ` Albert D. Cahalan
@ 2001-08-09 11:58                           ` Matthias Andree
  0 siblings, 0 replies; 662+ messages in thread
From: Matthias Andree @ 2001-08-09 11:58 UTC (permalink / raw)
  To: Albert D. Cahalan; +Cc: linux-kernel

On Sat, 04 Aug 2001, Albert D. Cahalan wrote:

> Seriously, consider:
> 
> 1. there are MTA authors that actively promote BSD over Linux
> 2. Linux users and distributions promote their MTA software

I do not endorse this behaviour (particularly, qmail not supporting
softupdates is rather ridiculous), but I understand that MTA authors
would rather want to rely on fsync() also bringing related meta data do
disk (as ext3 and reiserfs for Linux 2.4 already do even across a
rename()!) than to add dir=open("directory"); fsync(dir); close(dir) all
over the place.

> Getting back on topic... while non-inherited ext2 attributes might

What would they be good for? Make MTA that have in the past achieved
reliable behaviour with chattr +S unreliable?

> be nice, I'm sure the ext2/VFS authors don't need to be pestered
> about it, and certainly not because of some lame software making
> non-standard assumptions about filesystem behavior.

Well, the software documents its requirements and assumptions. I don't
see anything nonstandard with relying on fsync(). If ext2fs doesn't meet
the assumptions without chattr +S or mount -o sync, but allows to
enforce this behaviour chattr +S, deliberately breaking ext2 attributes
inheritance will make Linux deliberately unsuitable for this MTA -- or
at least, slow it down through the need to use mount -o sync.

Deliberately breaking things just to show somebody else "you cannot even
rely that chattr behaviour is invariant" is ridiculous and definitely
not the right way to go.

If the MTA author chooses chattr +S over fsync-directory, what's wrong
with that?

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Swapping for diskless nodes
  2001-08-09 10:50     ` Ingo Oeser
@ 2001-08-09 13:12       ` Dirk W. Steinberg
  2001-08-09 20:47       ` Rik van Riel
  1 sibling, 0 replies; 662+ messages in thread
From: Dirk W. Steinberg @ 2001-08-09 13:12 UTC (permalink / raw)
  To: Ingo Oeser; +Cc: linux-kernel, linux-mm, Alan Cox

I'd like to second that example where you have weak diskless nodes and
a big server with a lot of memory. The important point here is that the
remote paging does not need to really write to the remote disk, especially
not synchronously. The page could eventually be migrated to the remote
disk asynchronously, or maybe not at all if there is no memory pressure
at the remote system.

In such a scenario I would disagree with Alan that network paging is 
high latency as compared to disk access. I have a fully switched 100 Mpbs
full-duplex ethernet network, and sending a page across the net into
the memory of a fast server could have much less latency that writing 
that page out to a local old, slow IDE disk. Clusters could even have
special high-bandwidth, low latency networks that could be used for
remote paging.

In a perfect world, all nodes in a cluster would be able to dynamically 
share a pool of "cluster swap" space, so any locally available swap that
is not used could be utilized by other nodes in the cluster.

/ Dirk

Ingo Oeser wrote:
> On Thu, Aug 09, 2001 at 10:08:37AM +0100, Alan Cox wrote:
> > > what is the best/recommended way to do remote swapping via the network
> > > for diskless workstations or compute nodes in clusters in Linux 2.4?=20
> > > Last time i checked was linux 2.2, and there were some races related=20
> > > to network swapping back then. Has this been fixed for 2.4?
> >
> > The best answer probably is "don't". Networks are high latency things for
> > paging and paging is latency sensitive. If performance is not an issue then
> > the nbd driver ought to work. You may need to check it uses the right
> > GFP_ levels to avoid deadlocks and you might need to up the amount of atomic
> > pool memory. Hopefully other hacks arent needed
> 
> While we are on it: I have an old machine with 64MB of RAM and a
> new, fast machine with 1GB of RAM.
> 
> Sometimes I need more RAM on the old one and asked myself,
> whether I could first swap over network to the other one, into
> its tmpfs, before digging into real swap on a hard disk.
> 
> I have only three machines attached to this small internal
> 100Mbit LAN.
> 
> Both machines use Kernel 2.4.x.

^ permalink raw reply	[flat|nested] 662+ messages in thread

* Re: Swapping for diskless nodes
  2001-08-09  9:08   ` Swapping for diskless nodes Alan Cox
  2001-08-09 10:50     ` Ingo Oeser
@ 2001-08-09 14:17     ` Dirk W. Steinberg
  2001-08-09 14:36       ` Andreas Haumer
  2001-08-09 19:27     ` Pavel Machek
  2001-08-09 20:38     ` Rik van Riel