[ANNOUNCE] Ext3 vs Reiserfs benchmarks

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [ANNOUNCE] Ext3 vs Reiserfs benchmarks
@ 2002-07-12 16:21 Dax Kelson
  2002-07-12 17:05 ` Andreas Dilger
                   ` (3 more replies)
  0 siblings, 4 replies; 90+ messages in thread
From: Dax Kelson @ 2002-07-12 16:21 UTC (permalink / raw)
  To: linux-kernel

Tested:

ext3 data=ordered
ext3 data=writeback
reiserfs
reiserfs notail

http://www.gurulabs.com/ext3-reiserfs.html

Any suggestions or comments appreciated.

Dax Kelson
Guru Labs


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-12 16:21 [ANNOUNCE] Ext3 vs Reiserfs benchmarks Dax Kelson
@ 2002-07-12 17:05 ` Andreas Dilger
  2002-07-12 17:26   ` kwijibo
  2002-07-12 20:34 ` Chris Mason
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 90+ messages in thread
From: Andreas Dilger @ 2002-07-12 17:05 UTC (permalink / raw)
  To: Dax Kelson; +Cc: linux-kernel

On Jul 12, 2002  10:21 -0600, Dax Kelson wrote:
> ext3 data=ordered
> ext3 data=writeback
> reiserfs
> reiserfs notail
> 
> http://www.gurulabs.com/ext3-reiserfs.html
> 
> Any suggestions or comments appreciated.

Did you try data=journal mode on ext3?  For real-life workloads sync-IO
workloads like mail (e.g.  not benchmarks where the system is 100% busy)
you can have considerable performance benefits from doing the sync IO
directly to the journal instead of partly to the journal and partly to
the rest of the filesystem.

The reason why "real life" is important here is because the data=journal
mode writes all the files to disk twice - once to the journal and again
to the filesystem, so you must have some "slack" in your disk bandwidth
in order to benefit from this increased throughput on the part of the
mail transport.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-12 17:05 ` Andreas Dilger
@ 2002-07-12 17:26   ` kwijibo
  2002-07-12 17:36     ` Andreas Dilger
  0 siblings, 1 reply; 90+ messages in thread
From: kwijibo @ 2002-07-12 17:26 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Dax Kelson, linux-kernel

I compared reiserfs with notails and with tails to
ext3 in journaled mode about a month ago.
Strangely enough the machine that was being
built is eventually slated for a mail machine.  I used
postmark to simulate the mail environment.

Benchmarks are available here:
http://labs.zianet.com

Let me know if I am missing any info on there.

Steven

Andreas Dilger wrote:

>On Jul 12, 2002  10:21 -0600, Dax Kelson wrote:
>  
>
>>ext3 data=ordered
>>ext3 data=writeback
>>reiserfs
>>reiserfs notail
>>
>>http://www.gurulabs.com/ext3-reiserfs.html
>>
>>Any suggestions or comments appreciated.
>>    
>>
>
>Did you try data=journal mode on ext3?  For real-life workloads sync-IO
>workloads like mail (e.g.  not benchmarks where the system is 100% busy)
>you can have considerable performance benefits from doing the sync IO
>directly to the journal instead of partly to the journal and partly to
>the rest of the filesystem.
>
>The reason why "real life" is important here is because the data=journal
>mode writes all the files to disk twice - once to the journal and again
>to the filesystem, so you must have some "slack" in your disk bandwidth
>in order to benefit from this increased throughput on the part of the
>mail transport.
>
>Cheers, Andreas
>--
>Andreas Dilger
>http://www-mddsp.enel.ucalgary.ca/People/adilger/
>http://sourceforge.net/projects/ext2resize/
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>
>  
>




^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-12 17:26   ` kwijibo
@ 2002-07-12 17:36     ` Andreas Dilger
  0 siblings, 0 replies; 90+ messages in thread
From: Andreas Dilger @ 2002-07-12 17:36 UTC (permalink / raw)
  To: kwijibo; +Cc: Dax Kelson, linux-kernel

On Jul 12, 2002  11:26 -0600, kwijibo@zianet.com wrote:
> I compared reiserfs with notails and with tails to
> ext3 in journaled mode about a month ago.
> Strangely enough the machine that was being
> built is eventually slated for a mail machine.  I used
> postmark to simulate the mail environment.
> 
> Benchmarks are available here:
> http://labs.zianet.com
> 
> Let me know if I am missing any info on there.

Yes, I saw this benchmark when it was first posted.  It isn't clear
from the web pages that you are using data=journal for ext3.  Note
that this is only a benefit for sync I/O workloads like mail and
NFS, but not other types of usage.  Also, for sync I/O workloads
you can get a big boost by using an external journal device.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-12 16:21 [ANNOUNCE] Ext3 vs Reiserfs benchmarks Dax Kelson
  2002-07-12 17:05 ` Andreas Dilger
@ 2002-07-12 20:34 ` Chris Mason
  2002-07-13  4:44 ` Daniel Phillips
  2002-07-14 20:40 ` Dax Kelson
  3 siblings, 0 replies; 90+ messages in thread
From: Chris Mason @ 2002-07-12 20:34 UTC (permalink / raw)
  To: Dax Kelson; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

On Fri, 2002-07-12 at 12:21, Dax Kelson wrote:
> Tested:
> 
> ext3 data=ordered
> ext3 data=writeback
> reiserfs
> reiserfs notail
> 
> http://www.gurulabs.com/ext3-reiserfs.html
> 
> Any suggestions or comments appreciated.

postmark is an interesting workload, but it does not do fsync or renames
on the working set, and postfix does lots of both while delivering. 
postmark does do a good job of showing the difference between lots of
files in one directory (great for reiserfs) and lots of directories with
fewer files in each (better for ext3).

Andreas Dilger already mentioned -o data=journal on ext3, you can try
the beta reiserfs patches that add support for data=journal and
data=ordered at:

ftp.suse.com/pub/people/mason/patches/data-logging

They improve reiserfs performance for just about everything, but
data=journal is especially good for fsync/O_SYNC heavy workloads.

Andrew Morton sent me a benchmark of his that tries to simulate
postfix.  He has posted it to l-k before but a quick google search found
dead links only, so I'm attaching it.  What I like about his synctest is
the results are consistent and you can play with various
fsync/rename/unlink options.

-chris


[-- Attachment #2: synctest.c --]
[-- Type: text/x-c, Size: 7672 bytes --]

/*
 * Test and benchmark synchronous operations.
 * stolen from Andrew Morton
 */

#undef _XOPEN_SOURCE	/* MAP_ANONYMOUS */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <stdarg.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/resource.h>
#include <sys/wait.h>
#include <sys/mman.h>

/*
 * Lots of yummy globals!
 */
char *progname, *dirname;
int verbose, use_fsync, use_osync;
int fsync_dir;
int n_threads = 1, n_iters = 100;
int *child_status;
int this_child_index;
int dir_fd;
int show_tids;
int threads_per_dir = 1;
int thread_group;
int do_unlink;
int rename_pass;

#define N_FILES		100
#define UNLINK_LAG	30
#define RENAME_PASSES	3

void show(char *fmt, ...)
{
	if (verbose) {
		va_list ap;

		va_start(ap, fmt);
		vfprintf(stdout, fmt, ap);
		fflush( stdout );
		va_end(ap);
	}
}

/*
 * - Create a file.
 * - Write some data to it
 * - Maybe fsync() it.
 * - Close it
 * - Maybe fsync() its parent dir
 * - rename() it.
 * - maybe fsync() its parent dir
 * - rename() it.
 * - maybe fsync() its parent dir
 * - rename() it.
 * - maybe fsync() its parent dir
 * - UNLINK_LAG files later, maybe unlink it.
 * - maybe fsync() its parent dir
 *
 * Repeat the above N_FILES times
 */

char *mk_dirname(void)
{
	char *ret = malloc(strlen(dirname) + 64);

	sprintf(ret, "%s/%05d", dirname, thread_group);
	return ret;
}

char *mk_filename(int fileno)
{
	char *ret = malloc(strlen(dirname) + 64);

	sprintf(ret, "%s/%05d/%05d-%05d",
			dirname, thread_group, getpid(), fileno);
	return ret;
}

char *mk_new_filename(int fileno, int pass)
{
	char *ret = malloc(strlen(dirname) + 64);

	sprintf(ret, "%s/%05d/%02d-%05d-%05d",
			dirname, thread_group, pass, getpid(), fileno);
	return ret;
}

void sync_dir(void)
{
	if (fsync_dir) {
		show("fsync(%s)\n", dirname);
		if (fsync(dir_fd) < 0) {
			fprintf(stderr, "%s: failed to fsync dir `%s': %s\n",
				progname, dirname, strerror(errno));
			exit(1);
		}
	}
}

void make_dir(void)
{
	char *n = mk_dirname();

	show("mkdir(%s)\n", n);
	if (mkdir(n, 0777) < 0) {
		fprintf(stderr, "%s: Cannot make directory `%s': %s\n",
			progname, n, strerror(errno));
		exit(1);
	}
	free(n);
}

void remove_dir(void)
{
	char *n = mk_dirname();
	show("rmdir(%s)\n", n);
	rmdir(n);
	free(n);
}

void write_stuff_to(int fd, char *name)
{
	static char buf[500000];
	static int to_write = 5000;

	show("write %d bytes to `%s'\n", sizeof(buf), name);
	if (write(fd, buf, to_write) != to_write) {
		fprintf(stderr, "%s: failed to write %d bytes to `%s': %s\n",
			progname, to_write, name, strerror(errno));
		exit(1);
	}

	to_write *= 1.1;
	if (to_write > 250000)
		to_write = 5000;
}

void unlink_one_file(int fileno, int pass)
{
	if (do_unlink) {
		char *name = mk_new_filename(fileno, pass);

		show("unlink(%s)\n", name);
		if (unlink(name) < 0) {
			fprintf(stderr, "%s: failed to unlink `%s': %s\n",
				progname, name, strerror(errno));
			exit(1);
		}
		sync_dir();
		free(name);
	}
}

void do_one_file(int fileno)
{
	char *name = mk_filename(fileno);
	int fd, flags;

	flags = O_RDWR|O_CREAT|O_TRUNC;
	if (use_osync)
		flags |= O_SYNC;

	show("open(%s)\n", name);
	fd = open(name, flags, 0666);
	if (fd < 0) {
		fprintf(stderr, "%s: failed to create file `%s': %s\n",
			progname, name, strerror(errno));
		exit(1);
	}

	write_stuff_to(fd, name);

	if (use_fsync) {
		show("fsync(%s)\n", name);
		if (fsync(fd) < 0) {
			fprintf(stderr, "%s: failed to fsync `%s': %s\n",
				progname, name, strerror(errno));
			exit(1);
		}
	}

	show("close(%s)\n", name);
	if (close(fd) < 0) {
		fprintf(stderr, "%s: failed to close `%s': %s\n",
			progname, name, strerror(errno));
		exit(1);
	}

	sync_dir();

	for (rename_pass = 0; rename_pass < RENAME_PASSES; rename_pass++) {
		char *newname = mk_new_filename(fileno, rename_pass);

		show("rename(%s, %s)\n", name, newname);
		if (rename(name, newname) < 0) {
			fprintf(stderr,
				"%s: failed to rename `%s' to `%s': %s\n",
				progname, name, newname, strerror(errno));
			exit(1);
		}
		sync_dir();
		free(name);
		name = newname;
	}
	rename_pass--;
	free(name);
}

void do_child(void)
{
	int fileno;
	char *dn = mk_dirname();
	int dotcount;

	dir_fd = open(dn, O_RDONLY);
	if (dir_fd < 0) {
		fprintf(stderr, "%s: failed to open dir `%s': %s\n",
			progname, dn, strerror(errno));
		exit(1);
	}
	free(dn);

	dotcount = N_FILES / 10;
	if (dotcount == 0)
		dotcount = 1;

	for (fileno = 0; fileno < N_FILES; fileno++) {
		if (fileno % dotcount == 0) {
			printf(".");
			fflush(stdout);
		}
		do_one_file(fileno);
		if (fileno >= UNLINK_LAG)
			unlink_one_file(fileno - UNLINK_LAG, RENAME_PASSES - 1);
	}
	for (fileno = N_FILES - UNLINK_LAG; fileno < N_FILES; fileno++)
		unlink_one_file(fileno, RENAME_PASSES - 1);
}

void doit(void)
{
	int child;
	int children_left;

	child_status = (int *)mmap(	0,
				n_threads * sizeof(*child_status),
				PROT_READ|PROT_WRITE,
				MAP_SHARED|MAP_ANONYMOUS,
				-1,
				0);
	if (child_status == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}

	memset(child_status, 0, n_threads * sizeof(*child_status));

	thread_group = -1;
	for (this_child_index = 0;
			this_child_index < n_threads; this_child_index++)
	{
		if (this_child_index % threads_per_dir == 0) {
			thread_group++;
			make_dir();
		}

		if (fork() == 0) {
			int iter;

			for (iter = 0; iter < n_iters; iter++)
				do_child();
			child_status[this_child_index] = 1;
			exit(0);
		}
	}

	/* Parent */
	children_left = n_threads;
	while (children_left) {
		int status;

		if( wait3(&status, 0, 0) < 0 ) {
			if( errno != EINTR ) {
				perror("wait3");
				exit(1);
			}
			continue;
		}
		for (child = 0; child < n_threads; child++) {
			if (child_status[child] == 1) {
				child_status[child] = 2;
				printf("*");
				fflush(stdout);
				children_left--;
			}
		}
	}
	for (thread_group = 0; 
			thread_group < ( n_threads / threads_per_dir ); 
			thread_group++ )
		remove_dir();

	printf("\n");
}

void usage(void)
{
	fprintf(stderr,
		"Usage: %s [-fFosuv] [-p threads-pre-dir ][-n iters] [-t threads] dirname\n",
			progname);
	fprintf(stderr, "        -f:    Use fsync() on close\n"); 
	fprintf(stderr, "        -F:    Use fsync() on parent dir\n"); 
	fprintf(stderr, "        -n:    Number of iterations\n");
	fprintf(stderr, "        -o:    Open files O_SYNC\n");
	fprintf(stderr, "        -p:    Number of threads per directory\n");
	fprintf(stderr, "        -t:    Number of threads\n");
	fprintf(stderr, "        -u:    Unlink files during test\n");
	fprintf(stderr, "        -v:    Verbose\n"); 
	fprintf(stderr, "   dirname:    Directory to run tests in\n");
	exit(1);
}


int main(int argc, char *argv[])
{
	int c;

	progname = argv[0];
	while ((c = getopt(argc, argv, "vFfout:n:p:")) != -1) {
		switch (c) {
		case 'f':
			use_fsync++;
			break;
		case 'F':
			fsync_dir++;
			break;
		case 'n':
			n_iters = strtol(optarg, NULL, 10);
			break;
		case 'o':
			use_osync++;
			break;
		case 'p':
			threads_per_dir = strtol(optarg, NULL, 10);
			break;
		case 't':
			n_threads = strtol(optarg, NULL, 10);
			break;
		case 'u':
			do_unlink++;
			break;
		case 'v':
			verbose++;
			break;
		}
	}

	if (optind == argc)
		usage();
	dirname = argv[optind++];
	if (optind != argc)
		usage();

	doit();
	exit(0);
}

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-12 16:21 [ANNOUNCE] Ext3 vs Reiserfs benchmarks Dax Kelson
  2002-07-12 17:05 ` Andreas Dilger
  2002-07-12 20:34 ` Chris Mason
@ 2002-07-13  4:44 ` Daniel Phillips
  2002-07-14 20:40 ` Dax Kelson
  3 siblings, 0 replies; 90+ messages in thread
From: Daniel Phillips @ 2002-07-13  4:44 UTC (permalink / raw)
  To: Dax Kelson, linux-kernel

On Friday 12 July 2002 18:21, Dax Kelson wrote:
> Any suggestions or comments appreciated.

"it is clear that IF your server is stable and not prone to crashing, and/or 
you have the write cache on your hard drives battery backed, you should 
strongly consider using the writeback journaling mode of Ext3 versus ordered."

You probably want to suggest UPS there rather than battery backed disk
cache, since the writeback caching is predominantly on the cpu side.

-- 
Daniel

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-12 16:21 [ANNOUNCE] Ext3 vs Reiserfs benchmarks Dax Kelson
                   ` (2 preceding siblings ...)
  2002-07-13  4:44 ` Daniel Phillips
@ 2002-07-14 20:40 ` Dax Kelson
  2002-07-15  8:26   ` Sam Vilain
  3 siblings, 1 reply; 90+ messages in thread
From: Dax Kelson @ 2002-07-14 20:40 UTC (permalink / raw)
  To: linux-kernel

On Fri, 2002-07-12 at 10:21, Dax Kelson wrote:
>
> Any suggestions or comments appreciated.
> 

Thanks for the feedback. Look for more testing from us soon addressing
the suggestions brought up.

Dax


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-14 20:40 ` Dax Kelson
@ 2002-07-15  8:26   ` Sam Vilain
  2002-07-15 12:30     ` Alan Cox
  0 siblings, 1 reply; 90+ messages in thread
From: Sam Vilain @ 2002-07-15  8:26 UTC (permalink / raw)
  To: Dax Kelson; +Cc: linux-kernel

Dax Kelson <dax@gurulabs.com> wrote:

> > Any suggestions or comments appreciated.
> Thanks for the feedback. Look for more testing from us soon addressing
> the suggestions brought up.

One more thing - can I just make the comment that testing freshly formatted filesystems is not going to show up ext2's real weaknesses, that happen to old filesystems - particularly those where the filesystem has been allowed to fill up.

I timed *15 minutes* for a system I admin to unlink a single 1G file on a fairly old ext2 filesystem the other day (perhaps ext3 would have improved this, I'm not sure).  It took 30 minutes to scan a snort log directory log on ext2, but less than 2 minutes on reiser - and only 3 seconds once it was in the buffercache.

You are testing for a mail server - how many mailboxes are in your spool directory for the tests?  Try it with about five to ten thousand mailboxes and see how your results vary.
--
   Sam Vilain, sam@vilain.net     WWW: http://sam.vilain.net/
    7D74 2A09 B2D3 C30F F78E      GPG: http://sam.vilain.net/sam.asc
    278A A425 30A9 05B5 2F13

Although Mr Chavez 'was democratically elected,' one had to bear in
mind that 'Legitimacy is something that is conferred not just by a
majority of the voters'"
 - The office of George "Dubya" Bush commenting on the Venezuelan
   election

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15  8:26   ` Sam Vilain
@ 2002-07-15 12:30     ` Alan Cox
  2002-07-15 12:02       ` Sam Vilain
  2002-07-15 12:09       ` Matti Aarnio
  0 siblings, 2 replies; 90+ messages in thread
From: Alan Cox @ 2002-07-15 12:30 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Dax Kelson, linux-kernel

On Mon, 2002-07-15 at 09:26, Sam Vilain wrote:
> You are testing for a mail server - how many mailboxes are in your spool 
> directory for the tests?  Try it with about five to ten thousand
> mailboxes and see how your results vary.

If your mail server can't get heirarchical mail spools right, get one
that can. 

Alan

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 12:30     ` Alan Cox
@ 2002-07-15 12:02       ` Sam Vilain
  2002-07-15 13:23         ` Alan Cox
                           ` (2 more replies)
  2002-07-15 12:09       ` Matti Aarnio
  1 sibling, 3 replies; 90+ messages in thread
From: Sam Vilain @ 2002-07-15 12:02 UTC (permalink / raw)
  To: Alan Cox; +Cc: dax, linux-kernel

Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > You are testing for a mail server - how many mailboxes are in your spool 
> > directory for the tests?  Try it with about five to ten thousand
> > mailboxes and see how your results vary.
> If your mail server can't get heirarchical mail spools right, get one
> that can. 

Translation

   "Yes, we know that there is no directory hashing in ext2/3.  You'll have to find another solution to the problem, I'm afraid.  Why not ease the burden on the filesystem by breaking up the task for it, and giving it to it in small pieces.  That way it's much less likely to choke."

 :-)

Sure, you could set up hierarchical mail spools.  But it sure stinks of a temporary solution for a long-term problem.  What about the next application that grows to massive proportions?

Hey, while I've got your attention, how do you go about debugging your kernel?  I'm trying to add fair scheduling to the new O(1) scheduler, something of a token bucket filter counting jiffies used by a process/user/s_context (in scheduler_tick()) and tweaking their priority accordingly (in effective_prio()).  It'd be really nice if I could run it under UML or something like that so I can trace through it with gdb, but I couldn't get the UML patch to apply to your tree.  Any hints?
--
   Sam Vilain, sam@vilain.net     WWW: http://sam.vilain.net/
    7D74 2A09 B2D3 C30F F78E      GPG: http://sam.vilain.net/sam.asc
    278A A425 30A9 05B5 2F13


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 12:02       ` Sam Vilain
@ 2002-07-15 13:23         ` Alan Cox
  2002-07-15 13:40           ` Chris Mason
  2002-07-15 15:12         ` Andrea Arcangeli
  2002-07-15 16:03         ` Andreas Dilger
  2 siblings, 1 reply; 90+ messages in thread
From: Alan Cox @ 2002-07-15 13:23 UTC (permalink / raw)
  To: Sam Vilain; +Cc: dax, linux-kernel

On Mon, 2002-07-15 at 13:02, Sam Vilain wrote:
> Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>    "Yes, we know that there is no directory hashing in ext2/3.  You'll have 
> to find another solution to the problem, I'm afraid.  Why not ease the
> burden on the filesystem by breaking up the task for it, and giving it
> to it in small pieces.  That way it's much less likely to choke."

Actually there are several other reasons for it. It sucks a lot less
when you need to use ls and friends to inspect part of the spool. It
also makes it much easier to split the mail spool over multiple disks as
it grows without having to backup/restore the spool area

> Sure, you could set up hierarchical mail spools.  But it sure stinks of a
> temporary solution for a long-term problem.  What about the next
> application that grows to massive proportions?

JFS ?

> Hey, while I've got your attention, how do you go about debugging your 
> kernel?  I'm trying to add fair scheduling to the new O(1) scheduler,
> something of a token bucket filter counting jiffies used by a
> process/user/s_context (in scheduler_tick()) and tweaking their 
> priority accordingly (in effective_prio()).  It'd be really nice if I
> could run it under UML or something like that so I can trace through
> it with gdb, but I couldn't get the UML patch to apply to your tree. 
> Any hints?

The UML tree and my tree don't quite merge easily. Your best bet is to
grab the Red Hat Limbo beta packages for the kernel source, which if I
remember rightly are both -ac based and include the option to build UML

Alan


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 13:23         ` Alan Cox
@ 2002-07-15 13:40           ` Chris Mason
  2002-07-15 19:40             ` Andrew Morton
  0 siblings, 1 reply; 90+ messages in thread
From: Chris Mason @ 2002-07-15 13:40 UTC (permalink / raw)
  To: Alan Cox; +Cc: Sam Vilain, dax, linux-kernel

On Mon, 2002-07-15 at 09:23, Alan Cox wrote:
> On Mon, 2002-07-15 at 13:02, Sam Vilain wrote:
> > Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> >    "Yes, we know that there is no directory hashing in ext2/3.  You'll have 
> > to find another solution to the problem, I'm afraid.  Why not ease the
> > burden on the filesystem by breaking up the task for it, and giving it
> > to it in small pieces.  That way it's much less likely to choke."
> 
> Actually there are several other reasons for it. It sucks a lot less
> when you need to use ls and friends to inspect part of the spool. It
> also makes it much easier to split the mail spool over multiple disks as
> it grows without having to backup/restore the spool area

Another good reason is i_sem.  If you've got more than one process doing
something to that directory, you spend lots of time waiting for the
semaphore.  I think it was andrew that reminded me i_sem is held on
fsync, so fync(dir) to make things safe after a rename can slow things
down. 

reiserfs only needs fsync(file), ext3 needs fsync(anything on the fs). 
If ext3 would promise to make fsync(file) sufficient forever, it might
help the mta authors tune.

-chris



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 13:40           ` Chris Mason
@ 2002-07-15 19:40             ` Andrew Morton
  0 siblings, 0 replies; 90+ messages in thread
From: Andrew Morton @ 2002-07-15 19:40 UTC (permalink / raw)
  To: Chris Mason; +Cc: Alan Cox, Sam Vilain, dax, linux-kernel

Chris Mason wrote:
> 
> ...
> If ext3 would promise to make fsync(file) sufficient forever, it might
> help the mta authors tune.

ext3 promises.  This side-effect is bolted firmly into the design
of ext3 and it's hard to see any way in which it will go away.

-

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 12:02       ` Sam Vilain
  2002-07-15 13:23         ` Alan Cox
@ 2002-07-15 15:12         ` Andrea Arcangeli
  2002-07-15 16:03         ` Andreas Dilger
  2 siblings, 0 replies; 90+ messages in thread
From: Andrea Arcangeli @ 2002-07-15 15:12 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Alan Cox, dax, linux-kernel, Jeff Dike

On Mon, Jul 15, 2002 at 01:02:01PM +0100, Sam Vilain wrote:
> Hey, while I've got your attention, how do you go about debugging your
> kernel?  I'm trying to add fair scheduling to the new O(1) scheduler,
> something of a token bucket filter counting jiffies used by a
> process/user/s_context (in scheduler_tick()) and tweaking their
> priority accordingly (in effective_prio()).  It'd be really nice if I
> could run it under UML or something like that so I can trace through
> it with gdb, but I couldn't get the UML patch to apply to your tree.
> Any hints?

-aa ships with both uml and o1 scheduler. I need uml for everything non
hardware related so expect it to be always uptodate there. However since
I merged the O(1) scheduler there is the annoyance that sometime wakeup
events don't arrive at least until kupdate reschedule or something
like that (of course only with uml, not with real hardware).  Also
pressing keys is enough to unblock it. I didn't debugged it hard yet.
Accoring to Jeff it's a problem with cli that masks signals.

Andrea

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 12:02       ` Sam Vilain
  2002-07-15 13:23         ` Alan Cox
  2002-07-15 15:12         ` Andrea Arcangeli
@ 2002-07-15 16:03         ` Andreas Dilger
  2002-07-15 16:12           ` Daniel Phillips
  2002-07-15 17:48           ` Sam Vilain
  2 siblings, 2 replies; 90+ messages in thread
From: Andreas Dilger @ 2002-07-15 16:03 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Alan Cox, dax, linux-kernel

On Jul 15, 2002  13:02 +0100, Sam Vilain wrote:
>    "Yes, we know that there is no directory hashing in ext2/3.  You'll
> have to find another solution to the problem, I'm afraid.  Why not ease
> the burden on the filesystem by breaking up the task for it, and giving
> it to it in small pieces.  That way it's much less likely to choke."

Amusingly, there IS directory hashing available for ext2 and ext3, and
it is just as fast as reiserfs hashed directories.  See:

   http://people.nl.linux.org/~phillips/htree/paper/htree.html

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 16:03         ` Andreas Dilger
@ 2002-07-15 16:12           ` Daniel Phillips
  2002-07-15 17:48           ` Sam Vilain
  1 sibling, 0 replies; 90+ messages in thread
From: Daniel Phillips @ 2002-07-15 16:12 UTC (permalink / raw)
  To: Andreas Dilger, Sam Vilain; +Cc: Alan Cox, dax, linux-kernel

On Monday 15 July 2002 18:03, Andreas Dilger wrote:
> On Jul 15, 2002  13:02 +0100, Sam Vilain wrote:
> >    "Yes, we know that there is no directory hashing in ext2/3.  You'll
> > have to find another solution to the problem, I'm afraid.  Why not ease
> > the burden on the filesystem by breaking up the task for it, and giving
> > it to it in small pieces.  That way it's much less likely to choke."
> 
> Amusingly, there IS directory hashing available for ext2 and ext3, and
> it is just as fast as reiserfs hashed directories.  See:
> 
>    http://people.nl.linux.org/~phillips/htree/paper/htree.html

Faster, last time I checked.  I really must test against XFS and JFS at
some point.

-- 
Daniel

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 16:03         ` Andreas Dilger
  2002-07-15 16:12           ` Daniel Phillips
@ 2002-07-15 17:48           ` Sam Vilain
  2002-07-15 18:47             ` Mathieu Chouquet-Stringer
                               ` (2 more replies)
  1 sibling, 3 replies; 90+ messages in thread
From: Sam Vilain @ 2002-07-15 17:48 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: dax, linux-kernel

Andreas Dilger <adilger@clusterfs.com> wrote:

> Amusingly, there IS directory hashing available for ext2 and ext3, and
> it is just as fast as reiserfs hashed directories.  See:
>    http://people.nl.linux.org/~phillips/htree/paper/htree.html

You learn something new every day.  So, with that in mind - what has reiserfs got that ext2 doesn't?

  - tail merging, giving much more efficient space usage for lots of small
    files.
  - B*Tree allocation offering ``a 1/3rd reduction in internal
    fragmentation in return for slightly more complicated insertions and
    deletion algorithms'' (from the htree paper).
  - online resizing in the main kernel (ext2 needs a patch -
    http://ext2resize.sourceforge.net/).
  - Resizing does not require the use of `ext2prepare' run on the
    filesystem while unmounted to resize over arbitrary boundaries.
  - directory hashing in the main kernel

On the flipside, ext2 over reiserfs:

  - support for attributes without a patch or 2.4.19-pre4+ kernel
  - support for filesystem quotas without a patch
  - there is a `dump' command (but it's useless, because it hangs when you
    run it on mounted filesystems - come on, who REALLY unmounts their
    filesystems for a nightly dump?  You need a 3 way mirror to do it
    while guaranteeing filesystem availability...)

I'd be very interested in seeing postmark results without the hierarchical directory structure (which an unpatched postfix doesn't support), with about 5000 mailboxes with and without the htree patch (or with the htree patch but without that directory indexed, if that is possible).
--
   Sam Vilain, sam@vilain.net     WWW: http://sam.vilain.net/
    7D74 2A09 B2D3 C30F F78E      GPG: http://sam.vilain.net/sam.asc
    278A A425 30A9 05B5 2F13

  Try to be the best of what you are, even if what you are is no good.
ASHLEIGH BRILLIANT

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 17:48           ` Sam Vilain
@ 2002-07-15 18:47             ` Mathieu Chouquet-Stringer
  2002-07-15 19:26               ` Sam Vilain
  2002-07-16  8:18               ` Stelian Pop
  2002-07-15 21:14             ` Andreas Dilger
  2002-07-16  8:15             ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks Stelian Pop
  2 siblings, 2 replies; 90+ messages in thread
From: Mathieu Chouquet-Stringer @ 2002-07-15 18:47 UTC (permalink / raw)
  To: Sam Vilain; +Cc: linux-kernel

sam@vilain.net (Sam Vilain) writes:
>   - there is a `dump' command (but it's useless, because it hangs when you
>     run it on mounted filesystems - come on, who REALLY unmounts their
>     filesystems for a nightly dump?  You need a 3 way mirror to do it
>     while guaranteeing filesystem availability...)

According to everybody, dump is deprecated (and it shouldn't work reliably
with 2.4, in two words: "forget it")...

-- 
Mathieu Chouquet-Stringer              E-Mail : mathieu@newview.com
    It is exactly because a man cannot do a thing that he is a
                      proper judge of it.
                      -- Oscar Wilde

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 18:47             ` Mathieu Chouquet-Stringer
@ 2002-07-15 19:26               ` Sam Vilain
  2002-07-16  8:18               ` Stelian Pop
  1 sibling, 0 replies; 90+ messages in thread
From: Sam Vilain @ 2002-07-15 19:26 UTC (permalink / raw)
  To: Mathieu Chouquet-Stringer; +Cc: linux-kernel

Mathieu Chouquet-Stringer <mathieu@newview.com> wrote:

> >   - there is a `dump' command (but it's useless, because it hangs when you
> >     run it on mounted filesystems - come on, who REALLY unmounts their
> >     filesystems for a nightly dump?  You need a 3 way mirror to do it
> >     while guaranteeing filesystem availability...)
> According to everybody, dump is deprecated (and it shouldn't work reliably
> with 2.4, in two words: "forget it")...

It's a shame, because `tar' doesn't save things like inode attributes and
places unnecessary load on the VFS layer.  It also takes considerably
longer than dump did on one backup server I admin - like ~12 hours to back
up ~26G in ~414k inodes to a tape capable of about 1MB/sec.  But that's
probably the old directory hashing thing again, there are some
reeeeaaallllllly large directories on that machine...

Ah, the joys of legacy. 
--
   Sam Vilain, sam@vilain.net     WWW: http://sam.vilain.net/
    7D74 2A09 B2D3 C30F F78E      GPG: http://sam.vilain.net/sam.asc
    278A A425 30A9 05B5 2F13

  If you think the United States has stood still, who built the
largest shopping center in the world?
RICHARD M NIXON

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 18:47             ` Mathieu Chouquet-Stringer
  2002-07-15 19:26               ` Sam Vilain
@ 2002-07-16  8:18               ` Stelian Pop
  2002-07-16 12:22                 ` Gerhard Mack
  1 sibling, 1 reply; 90+ messages in thread
From: Stelian Pop @ 2002-07-16  8:18 UTC (permalink / raw)
  To: Mathieu Chouquet-Stringer; +Cc: linux-kernel

On Mon, Jul 15, 2002 at 02:47:04PM -0400, Mathieu Chouquet-Stringer wrote:

> According to everybody, dump is deprecated (and it shouldn't work reliably
> with 2.4, in two words: "forget it")...

This needs to be "according to Linus, dump is deprecated". Given the
interest Linus has manifested for backups, I wouldn't really rely on
his statement :-)

Stelian.
-- 
Stelian Pop <stelian.pop@fr.alcove.com>
Alcove - http://www.alcove.com

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16  8:18               ` Stelian Pop
@ 2002-07-16 12:22                 ` Gerhard Mack
  2002-07-16 12:49                   ` Stelian Pop
  0 siblings, 1 reply; 90+ messages in thread
From: Gerhard Mack @ 2002-07-16 12:22 UTC (permalink / raw)
  To: Stelian Pop; +Cc: Mathieu Chouquet-Stringer, linux-kernel

On Tue, 16 Jul 2002, Stelian Pop wrote:

> On Mon, Jul 15, 2002 at 02:47:04PM -0400, Mathieu Chouquet-Stringer wrote:
>
> > According to everybody, dump is deprecated (and it shouldn't work reliably
> > with 2.4, in two words: "forget it")...
>
> This needs to be "according to Linus, dump is deprecated". Given the
> interest Linus has manifested for backups, I wouldn't really rely on
> his statement :-)

Either way dump is not likely to give you a reliable backup when used
with a 2.4.x kernel.

	Gerhard


--
Gerhard Mack

gmack@innerfire.net

<>< As a computer I find your faith in technology amusing.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 12:22                 ` Gerhard Mack
@ 2002-07-16 12:49                   ` Stelian Pop
  2002-07-16 15:11                     ` Gerhard Mack
  0 siblings, 1 reply; 90+ messages in thread
From: Stelian Pop @ 2002-07-16 12:49 UTC (permalink / raw)
  To: Gerhard Mack; +Cc: Mathieu Chouquet-Stringer, linux-kernel

On Tue, Jul 16, 2002 at 08:22:53AM -0400, Gerhard Mack wrote:

> > This needs to be "according to Linus, dump is deprecated". Given the
> > interest Linus has manifested for backups, I wouldn't really rely on
> > his statement :-)
> 
> Either way dump is not likely to give you a reliable backup when used
> with a 2.4.x kernel.

Since you are so well informed, maybe you could share your knowledge
with us.

I'm the dump maintainer, so I'll be very interested in knowing how
comes that dump works for me and many other users... :-)

Stelian.
-- 
Stelian Pop <stelian.pop@fr.alcove.com>
Alcove - http://www.alcove.com

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 12:49                   ` Stelian Pop
@ 2002-07-16 15:11                     ` Gerhard Mack
  2002-07-16 15:22                       ` Andrea Arcangeli
  2002-07-16 15:39                       ` Stelian Pop
  0 siblings, 2 replies; 90+ messages in thread
From: Gerhard Mack @ 2002-07-16 15:11 UTC (permalink / raw)
  To: Stelian Pop; +Cc: Mathieu Chouquet-Stringer, linux-kernel

On Tue, 16 Jul 2002, Stelian Pop wrote:

> Date: Tue, 16 Jul 2002 14:49:56 +0200
> From: Stelian Pop <stelian.pop@fr.alcove.com>
> To: Gerhard Mack <gmack@innerfire.net>
> Cc: Mathieu Chouquet-Stringer <mathieu@newview.com>,
>      linux-kernel@vger.kernel.org
> Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
>
> On Tue, Jul 16, 2002 at 08:22:53AM -0400, Gerhard Mack wrote:
>
> > > This needs to be "according to Linus, dump is deprecated". Given the
> > > interest Linus has manifested for backups, I wouldn't really rely on
> > > his statement :-)
> >
> > Either way dump is not likely to give you a reliable backup when used
> > with a 2.4.x kernel.
>
> Since you are so well informed, maybe you could share your knowledge
> with us.
>
> I'm the dump maintainer, so I'll be very interested in knowing how
> comes that dump works for me and many other users... :-)
>

I'll save myself the trouble when Linus said it better than I could:

     Note that dump simply won't work reliably at all even in
     2.4.x: the buffer cache and the page cache (where all the
     actual data is) are not coherent. This is only going to
     get even worse in 2.5.x, when the directories are moved
     into the page cache as well.

     So anybody who depends on "dump" getting backups right is
     already playing russian rulette with their backups.  It's
     not at all guaranteed to get the right results - you may
     end up having stale data in the buffer cache that ends up
     being "backed up".


In other words you have a backup system that works some of the time or
even most of the time... brilliant!

	Gerhard

--
Gerhard Mack

gmack@innerfire.net

<>< As a computer I find your faith in technology amusing.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 15:11                     ` Gerhard Mack
@ 2002-07-16 15:22                       ` Andrea Arcangeli
  2002-07-16 15:39                       ` Stelian Pop
  1 sibling, 0 replies; 90+ messages in thread
From: Andrea Arcangeli @ 2002-07-16 15:22 UTC (permalink / raw)
  To: Gerhard Mack; +Cc: Stelian Pop, Mathieu Chouquet-Stringer, linux-kernel

On Tue, Jul 16, 2002 at 11:11:20AM -0400, Gerhard Mack wrote:
> On Tue, 16 Jul 2002, Stelian Pop wrote:
> 
> > Date: Tue, 16 Jul 2002 14:49:56 +0200
> > From: Stelian Pop <stelian.pop@fr.alcove.com>
> > To: Gerhard Mack <gmack@innerfire.net>
> > Cc: Mathieu Chouquet-Stringer <mathieu@newview.com>,
> >      linux-kernel@vger.kernel.org
> > Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
> >
> > On Tue, Jul 16, 2002 at 08:22:53AM -0400, Gerhard Mack wrote:
> >
> > > > This needs to be "according to Linus, dump is deprecated". Given the
> > > > interest Linus has manifested for backups, I wouldn't really rely on
> > > > his statement :-)
> > >
> > > Either way dump is not likely to give you a reliable backup when used
> > > with a 2.4.x kernel.
> >
> > Since you are so well informed, maybe you could share your knowledge
> > with us.
> >
> > I'm the dump maintainer, so I'll be very interested in knowing how
> > comes that dump works for me and many other users... :-)
> >
> 
> I'll save myself the trouble when Linus said it better than I could:
> 
>      Note that dump simply won't work reliably at all even in
>      2.4.x: the buffer cache and the page cache (where all the
>      actual data is) are not coherent. This is only going to
>      get even worse in 2.5.x, when the directories are moved
>      into the page cache as well.
> 
>      So anybody who depends on "dump" getting backups right is
>      already playing russian rulette with their backups.  It's
>      not at all guaranteed to get the right results - you may
>      end up having stale data in the buffer cache that ends up
>      being "backed up".
> 
> 
> In other words you have a backup system that works some of the time or
> even most of the time... brilliant!

just to clarify, the above implicitly assumes the fs is mounted
read-write while you're dumping it. if the fs is mounted readonly or if
it's unmounted, there is no problem with dumping it. Also note that dump
has the same problem with read-write mounted fs also in 2.2, and I guess
in 2.0 too, it's nothing new of 2.4, it just gets more visible the more
logical dirty caches we have.

Andrea

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 15:11                     ` Gerhard Mack
  2002-07-16 15:22                       ` Andrea Arcangeli
@ 2002-07-16 15:39                       ` Stelian Pop
  2002-07-16 19:45                         ` Matthias Andree
  1 sibling, 1 reply; 90+ messages in thread
From: Stelian Pop @ 2002-07-16 15:39 UTC (permalink / raw)
  To: Gerhard Mack; +Cc: Mathieu Chouquet-Stringer, linux-kernel

On Tue, Jul 16, 2002 at 11:11:20AM -0400, Gerhard Mack wrote:

> In other words you have a backup system that works some of the time or
> even most of the time... brilliant!

Dump is a backup system that works 100% of the time when used as 
it was designed to: on unmounted filesystems (or mounted R/O).

It is indeed brilliant to have it work, even most of the time,
in conditions it wasn't designed for.

Stelian.
-- 
Stelian Pop <stelian.pop@fr.alcove.com>
Alcove - http://www.alcove.com

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 15:39                       ` Stelian Pop
@ 2002-07-16 19:45                         ` Matthias Andree
  2002-07-16 20:04                           ` Shawn
  0 siblings, 1 reply; 90+ messages in thread
From: Matthias Andree @ 2002-07-16 19:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: Stelian Pop, Gerhard Mack, Mathieu Chouquet-Stringer

On Tue, 16 Jul 2002, Stelian Pop wrote:

> On Tue, Jul 16, 2002 at 11:11:20AM -0400, Gerhard Mack wrote:
> 
> > In other words you have a backup system that works some of the time or
> > even most of the time... brilliant!
> 
> Dump is a backup system that works 100% of the time when used as 
> it was designed to: on unmounted filesystems (or mounted R/O).

Practical question: how do I get a file system mounted R/O for backup
with dump without putting that system into single-user mode?
Particularly when running automated backups, this is an issue. I cannot
kill all writers (syslog, Postfix, INN, CVS server, ...) on my
production machines just for the sake of taking a backup.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 19:45                         ` Matthias Andree
@ 2002-07-16 20:04                           ` Shawn
  2002-07-16 20:11                             ` Mathieu Chouquet-Stringer
  0 siblings, 1 reply; 90+ messages in thread
From: Shawn @ 2002-07-16 20:04 UTC (permalink / raw)
  To: linux-kernel, Stelian Pop, Gerhard Mack, Mathieu Chouquet-Stringer

You don't.

This is where you have a filesystem where syslog, xinetd, blogd,
bloatd-config-d2, raffle-ticketd DO NOT LIVE.

People forget so easily the wonders of multiple partitions.

On 07/16, Matthias Andree said something like:
> On Tue, 16 Jul 2002, Stelian Pop wrote:
> 
> > On Tue, Jul 16, 2002 at 11:11:20AM -0400, Gerhard Mack wrote:
> > 
> > > In other words you have a backup system that works some of the time or
> > > even most of the time... brilliant!
> > 
> > Dump is a backup system that works 100% of the time when used as 
> > it was designed to: on unmounted filesystems (or mounted R/O).
> 
> Practical question: how do I get a file system mounted R/O for backup
> with dump without putting that system into single-user mode?
> Particularly when running automated backups, this is an issue. I cannot
> kill all writers (syslog, Postfix, INN, CVS server, ...) on my
> production machines just for the sake of taking a backup.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
Shawn Leas
core@enodev.com

So, do you live around here often?
						-- Stephen Wright

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 20:04                           ` Shawn
@ 2002-07-16 20:11                             ` Mathieu Chouquet-Stringer
  2002-07-16 20:22                               ` Shawn
  0 siblings, 1 reply; 90+ messages in thread
From: Mathieu Chouquet-Stringer @ 2002-07-16 20:11 UTC (permalink / raw)
  To: Shawn; +Cc: linux-kernel, Stelian Pop, Gerhard Mack

On Tue, Jul 16, 2002 at 03:04:22PM -0500, Shawn wrote:
> You don't.
> 
> This is where you have a filesystem where syslog, xinetd, blogd,
> bloatd-config-d2, raffle-ticketd DO NOT LIVE.
> 
> People forget so easily the wonders of multiple partitions.

I'm sorry, but I don't understand how it's going to change anything. For
sure, it makes your life easier because you don't have to shutdown all your
programs that have files opened in R/W mode. But in the end, you will have
to shutdown something to remount the partition in R/O mode and usually you
don't want or can't afford to do that.

-- 
Mathieu Chouquet-Stringer              E-Mail : mathieu@newview.com
    It is exactly because a man cannot do a thing that he is a
                      proper judge of it.
                      -- Oscar Wilde

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 20:11                             ` Mathieu Chouquet-Stringer
@ 2002-07-16 20:22                               ` Shawn
  2002-07-16 20:27                                 ` Mathieu Chouquet-Stringer
  2002-07-17 11:45                                 ` Matthias Andree
  0 siblings, 2 replies; 90+ messages in thread
From: Shawn @ 2002-07-16 20:22 UTC (permalink / raw)
  To: Mathieu Chouquet-Stringer, Shawn, linux-kernel, Stelian Pop,
	Gerhard Mack

In this case, can you use a RAID mirror or something, then break it?

Also, there's the LVM snapshot at the block layer someone already
mentioned, which when used with smaller partions is less overhead.
(less FS delta)

This problem isn't that complex.

On 07/16, Mathieu Chouquet-Stringer said something like:
> On Tue, Jul 16, 2002 at 03:04:22PM -0500, Shawn wrote:
> > You don't.
> > 
> > This is where you have a filesystem where syslog, xinetd, blogd,
> > bloatd-config-d2, raffle-ticketd DO NOT LIVE.
> > 
> > People forget so easily the wonders of multiple partitions.
> 
> I'm sorry, but I don't understand how it's going to change anything. For
> sure, it makes your life easier because you don't have to shutdown all your
> programs that have files opened in R/W mode. But in the end, you will have
> to shutdown something to remount the partition in R/O mode and usually you
> don't want or can't afford to do that.
> 
> -- 
> Mathieu Chouquet-Stringer              E-Mail : mathieu@newview.com
>     It is exactly because a man cannot do a thing that he is a
>                       proper judge of it.
>                       -- Oscar Wilde
--
Shawn Leas
core@enodev.com

I bought my brother some gift-wrap for Christmas. I took it to the Gift
Wrap department and told them to wrap it, but in a different print so he
would know when to stop unwrapping.
						-- Stephen Wright

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 20:22                               ` Shawn
@ 2002-07-16 20:27                                 ` Mathieu Chouquet-Stringer
  2002-07-17 11:45                                 ` Matthias Andree
  1 sibling, 0 replies; 90+ messages in thread
From: Mathieu Chouquet-Stringer @ 2002-07-16 20:27 UTC (permalink / raw)
  To: Shawn; +Cc: linux-kernel, Stelian Pop, Gerhard Mack

On Tue, Jul 16, 2002 at 03:22:31PM -0500, Shawn wrote:
> In this case, can you use a RAID mirror or something, then break it?
> 
> Also, there's the LVM snapshot at the block layer someone already
> mentioned, which when used with smaller partions is less overhead.
> (less FS delta)
> 
> This problem isn't that complex.

I agree but I guess that if Matthias asked the question that way, he
probably meant he doesn't have a raid mirror or "something" (as you
say)... If you didn't plan your install (meaning you don't have the nice
raid or anything else), you're basically screwed...

-- 
Mathieu Chouquet-Stringer              E-Mail : mathieu@newview.com
    It is exactly because a man cannot do a thing that he is a
                      proper judge of it.
                      -- Oscar Wilde

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 20:22                               ` Shawn
  2002-07-16 20:27                                 ` Mathieu Chouquet-Stringer
@ 2002-07-17 11:45                                 ` Matthias Andree
  2002-07-17 19:02                                   ` Andreas Dilger
  1 sibling, 1 reply; 90+ messages in thread
From: Matthias Andree @ 2002-07-17 11:45 UTC (permalink / raw)
  To: linux-kernel

On Tue, 16 Jul 2002, Shawn wrote:

> In this case, can you use a RAID mirror or something, then break it?
> 
> Also, there's the LVM snapshot at the block layer someone already
> mentioned, which when used with smaller partions is less overhead.
> (less FS delta)

All these "solutions" don't work out, I cannot remount R/O my partition,
and LVM low-level snapshots or breaking a RAID mirror simply won't work
out. I would have to remount r/o the partition to get a consistent image
in the first place, so the first step must fail already...

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-17 11:45                                 ` Matthias Andree
@ 2002-07-17 19:02                                   ` Andreas Dilger
  2002-07-18  9:29                                     ` Matthias Andree
  2002-07-19  8:29                                     ` Matthias Andree
  0 siblings, 2 replies; 90+ messages in thread
From: Andreas Dilger @ 2002-07-17 19:02 UTC (permalink / raw)
  To: linux-kernel

On Jul 17, 2002  13:45 +0200, Matthias Andree wrote:
> On Tue, 16 Jul 2002, Shawn wrote:
> > In this case, can you use a RAID mirror or something, then break it?
> > 
> > Also, there's the LVM snapshot at the block layer someone already
> > mentioned, which when used with smaller partions is less overhead.
> > (less FS delta)
> 
> All these "solutions" don't work out, I cannot remount R/O my partition,
> and LVM low-level snapshots or breaking a RAID mirror simply won't work
> out. I would have to remount r/o the partition to get a consistent image
> in the first place, so the first step must fail already...

Have you been reading my emails at all?  LVM snapshots DO ensure that
the snapshot filesystem is consistent for journaled filesystems.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-17 19:02                                   ` Andreas Dilger
@ 2002-07-18  9:29                                     ` Matthias Andree
  2002-07-19  8:29                                     ` Matthias Andree
  1 sibling, 0 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-18  9:29 UTC (permalink / raw)
  To: linux-kernel

On Wed, 17 Jul 2002, Andreas Dilger wrote:

> On Jul 17, 2002  13:45 +0200, Matthias Andree wrote:
> > On Tue, 16 Jul 2002, Shawn wrote:
> > > In this case, can you use a RAID mirror or something, then break it?
> > > 
> > > Also, there's the LVM snapshot at the block layer someone already
> > > mentioned, which when used with smaller partions is less overhead.
> > > (less FS delta)
> > 
> > All these "solutions" don't work out, I cannot remount R/O my partition,
> > and LVM low-level snapshots or breaking a RAID mirror simply won't work
> > out. I would have to remount r/o the partition to get a consistent image
> > in the first place, so the first step must fail already...
> 
> Have you been reading my emails at all?  LVM snapshots DO ensure that
> the snapshot filesystem is consistent for journaled filesystems.

Please apologize, I have been busy and only reading partial threads, and had
not come across your LVM-snapshot related mails when I wrote the
previous mail.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-17 19:02                                   ` Andreas Dilger
  2002-07-18  9:29                                     ` Matthias Andree
@ 2002-07-19  8:29                                     ` Matthias Andree
  2002-07-19 16:39                                       ` Andreas Dilger
  1 sibling, 1 reply; 90+ messages in thread
From: Matthias Andree @ 2002-07-19  8:29 UTC (permalink / raw)
  To: linux-kernel

On Wed, 17 Jul 2002, Andreas Dilger wrote:

> On Jul 17, 2002  13:45 +0200, Matthias Andree wrote:
> > On Tue, 16 Jul 2002, Shawn wrote:
> > > In this case, can you use a RAID mirror or something, then break it?
> > > 
> > > Also, there's the LVM snapshot at the block layer someone already
> > > mentioned, which when used with smaller partions is less overhead.
> > > (less FS delta)
> > 
> > All these "solutions" don't work out, I cannot remount R/O my partition,
> > and LVM low-level snapshots or breaking a RAID mirror simply won't work
> > out. I would have to remount r/o the partition to get a consistent image
> > in the first place, so the first step must fail already...
> 
> Have you been reading my emails at all?  LVM snapshots DO ensure that
> the snapshot filesystem is consistent for journaled filesystems.

What kernel version is necessary to achieve this on production kernels
(i. e. 2.4)?

Does "consistent" mean "fsck proof"?

Here's what I tried, on Linux-2.4.19-pre10-ac3 (IIRC) (ext3fs):

(from memory, history not available, different machine):
lvcreate --snapshot snap /dev/vg0/home
e2fsck -f /dev/vg0/snap
dump -0 ...

It reported zero dtime for one file and two bitmap differences.

Does "consistent" mean "consistent after you replay the log?" If so,
that's still a losing game, because I cannot fsck the snapshot (it's R/O
in the LVM case at least) to replay the journal -- and I don't assume
dump 0.4b29 (which I'm using) goes fishing in the journal, but did not
use the dump source code.

dump did not complain however, and given what e2fsck had to complain,
I'd happily force mount such a file system when just a deletion has not
completed.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-19  8:29                                     ` Matthias Andree
@ 2002-07-19 16:39                                       ` Andreas Dilger
  2002-07-19 20:01                                         ` Shawn
  0 siblings, 1 reply; 90+ messages in thread
From: Andreas Dilger @ 2002-07-19 16:39 UTC (permalink / raw)
  To: linux-kernel

On Jul 19, 2002  10:29 +0200, Matthias Andree wrote:
> What kernel version is necessary to achieve this on production kernels
> (i. e. 2.4)?
> 
> Does "consistent" mean "fsck proof"?
> 
> Here's what I tried, on Linux-2.4.19-pre10-ac3 (IIRC) (ext3fs):
> 
> (from memory, history not available, different machine):
> lvcreate --snapshot snap /dev/vg0/home
> e2fsck -f /dev/vg0/snap
> dump -0 ...
> 
> It reported zero dtime for one file and two bitmap differences.

That is because one critical piece is missing from 2.4, the VFS lock
patch.  It is part of the LVM sources at sistina.com.  Chris Mason has
been trying to get it in, but it is delayed until 2.4.19 is out.

> dump did not complain however, and given what e2fsck had to complain,
> I'd happily force mount such a file system when just a deletion has not
> completed.

You cannot mount a dirty ext3 filesystem from read-only media.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-19 16:39                                       ` Andreas Dilger
@ 2002-07-19 20:01                                         ` Shawn
  2002-07-19 20:47                                           ` Andreas Dilger
  0 siblings, 1 reply; 90+ messages in thread
From: Shawn @ 2002-07-19 20:01 UTC (permalink / raw)
  To: linux-kernel

On 07/19, Andreas Dilger said something like:
> On Jul 19, 2002  10:29 +0200, Matthias Andree wrote:
> > What kernel version is necessary to achieve this on production kernels
> > (i. e. 2.4)?
> > 
> > Does "consistent" mean "fsck proof"?
> > 
> > Here's what I tried, on Linux-2.4.19-pre10-ac3 (IIRC) (ext3fs):
> > 
> > (from memory, history not available, different machine):
> > lvcreate --snapshot snap /dev/vg0/home
> > e2fsck -f /dev/vg0/snap
> > dump -0 ...
> > 
> > It reported zero dtime for one file and two bitmap differences.
> 
> That is because one critical piece is missing from 2.4, the VFS lock
> patch.  It is part of the LVM sources at sistina.com.  Chris Mason has
> been trying to get it in, but it is delayed until 2.4.19 is out.
> 
> > dump did not complain however, and given what e2fsck had to complain,
> > I'd happily force mount such a file system when just a deletion has not
> > completed.
> 
> You cannot mount a dirty ext3 filesystem from read-only media.

I thought you could "mount -t ext2" ext3 volumes, and thought you could
force mount ext2.

I'm no Andreas Dilger, so don't take this like I'm disagreeing...

--
Shawn Leas
core@enodev.com

I went to the bank and asked to borrow a cup of money.  They
said, "What for?"  I said, "I'm going to buy some sugar."
						-- Stephen Wright

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-19 20:01                                         ` Shawn
@ 2002-07-19 20:47                                           ` Andreas Dilger
  0 siblings, 0 replies; 90+ messages in thread
From: Andreas Dilger @ 2002-07-19 20:47 UTC (permalink / raw)
  To: Shawn; +Cc: linux-kernel

On Jul 19, 2002  15:01 -0500, Shawn wrote:
> On 07/19, Andreas Dilger said something like:
> > You cannot mount a dirty ext3 filesystem from read-only media.
> 
> I thought you could "mount -t ext2" ext3 volumes, and thought you could
> force mount ext2.

This is true if the ext3 filesystem is unmounted cleanly.  Otherwise
there is a flag in the superblock which tells the kernel it can't
mount the filesystem because there is something there it doesn't
understand (namely the dirty journal with all of the recent changes).

This flag (EXT3_FEATURE_INCOMPAT_RECOVERY) is cleared when the
filesystem is unmounted properly, when e2fsck or a r/w mount
recovers the journal, and not coincidentally when an LVM snapshot
is created.

In case you are more curious, there are a couple of paragraphs in
linux/Documentation/filesystems/ext2.txt about the compat flags,
which are really one of the great features of ext2.  You may think
that an overstatement, but without the feature flags, none of the
other enhancements that have been added to ext2 over the last few
years (and in the next few years too) would have been so easily done.

As for mounting a dirty ext2 filesystem, yes that is possible with
only a warning at mount time.  That is why nobody has put much effort
into adding the snapshot hooks into ext2 yet.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 17:48           ` Sam Vilain
  2002-07-15 18:47             ` Mathieu Chouquet-Stringer
@ 2002-07-15 21:14             ` Andreas Dilger
  2002-07-17 18:41               ` bill davidsen
  2002-07-16  8:15             ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks Stelian Pop
  2 siblings, 1 reply; 90+ messages in thread
From: Andreas Dilger @ 2002-07-15 21:14 UTC (permalink / raw)
  To: Sam Vilain; +Cc: dax, linux-kernel

On Jul 15, 2002  18:48 +0100, Sam Vilain wrote:
> Andreas Dilger <adilger@clusterfs.com> wrote:
> 
> > Amusingly, there IS directory hashing available for ext2 and ext3, and
> > it is just as fast as reiserfs hashed directories.  See:
> >    http://people.nl.linux.org/~phillips/htree/paper/htree.html
> 
> You learn something new every day.  So, with that in mind - what has
> reiserfs got that ext2 doesn't?
> 
>   - tail merging, giving much more efficient space usage for lots of small
>     files.

Well, there was a tail merging patch for ext2, but it has been dropped
for now.  In reality, any benchmarks with reiserfs (except the
very-small-files case) will run with tail packing disabled because it
kills performance.

>   - B*Tree allocation offering ``a 1/3rd reduction in internal
>     fragmentation in return for slightly more complicated insertions and
>     deletion algorithms'' (from the htree paper).
>   - online resizing in the main kernel (ext2 needs a patch -
>     http://ext2resize.sourceforge.net/).

Yes, I wrote it...

>   - Resizing does not require the use of `ext2prepare' run on the
>     filesystem while unmounted to resize over arbitrary boundaries.

That is coming this summer.  It will be part of some changes to support
"meta blockgroups", and the resizing comes for free at the same time.

>   - directory hashing in the main kernel

Probably will happen in 2.5, as Andrew is already testing htree support
for ext3.  It is also in the ext3 CVS tree for 2.4, so I wouldn't be
surprised if it shows up in 2.4 also.

> On the flipside, ext2 over reiserfs:
> 
>   - support for attributes without a patch or 2.4.19-pre4+ kernel
>   - support for filesystem quotas without a patch
>   - there is a `dump' command (but it's useless, because it hangs when you
>     run it on mounted filesystems - come on, who REALLY unmounts their
>     filesystems for a nightly dump?  You need a 3 way mirror to do it
>     while guaranteeing filesystem availability...)

Well, the dump can only be inconsistent for files that are being changed
during the dump itself.  As for hanging the system, that would be a bug
regardless of whether it was dump or "dd" reading from the block device.
A bug related to this was fixed, probably in 2.4.19-preX somewhere.

> I'd be very interested in seeing postmark results without the
> hierarchical directory structure (which an unpatched postfix doesn't
> support), with about 5000 mailboxes with and without the htree patch
> (or with the htree patch but without that directory indexed, if that
> is possible).

Let me know what you find.  It is possible to use an htree-patched
kernel and not have indexed directories - just don't mount with
"-o index".  Note that there is a data-corrupting bug somewhere in
the ext3 htree code, so I wouldn't suggest using indexed directories
except for test.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:14             ` Andreas Dilger
@ 2002-07-17 18:41               ` bill davidsen
  2002-07-17 19:47                 ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks (whither dump?) Lew Wolfgang
  0 siblings, 1 reply; 90+ messages in thread
From: bill davidsen @ 2002-07-17 18:41 UTC (permalink / raw)
  To: linux-kernel

In article <20020715211448.GI442@clusterfs.com>,
Andreas Dilger  <adilger@clusterfs.com> wrote:

| Well, the dump can only be inconsistent for files that are being changed
| during the dump itself.  As for hanging the system, that would be a bug
| regardless of whether it was dump or "dd" reading from the block device.
| A bug related to this was fixed, probably in 2.4.19-preX somewhere.

Any dump on a live f/s would seem to have the problem that files are
changing as they are read and may not be consistant. I suppose there
could be some kind of "fsync and journal lock" on a file, allowing all
writes to a file to be journaled while the file is backed up. However,
such things don't scale well for big files with lots of writes, and the
file, while unchanging, may not be valid.

Backups of running files are best done by the application, like Oracle
as a for-instance. Neither the o/s nor the backup can be sure when/if
the data is in a valid state.

Tar has this problem, although not the same issues with data on the fly
in buffers.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks (whither dump?)
  2002-07-17 18:41               ` bill davidsen
@ 2002-07-17 19:47                 ` Lew Wolfgang
  0 siblings, 0 replies; 90+ messages in thread
From: Lew Wolfgang @ 2002-07-17 19:47 UTC (permalink / raw)
  To: linux-kernel

Hi Folks,

As an old dump user (dumpster?) I have to admit that we've
avoided ext3 and Reiserfs because of this issue.  We couldn't
live without the "Tower of Hanoi".

I remember using, many years ago (SunOS 3.4), a patched
dump binary that allowed safe dumps from live UFS filesystems.
I don't remember all the details (it was 16-years ago) but
this dump would compare somehow, files before and after writing
to tape.  If there was a difference it would back out the
dumped file and preserve the consistency of the tape.  I don't
remember if it would go back and try the file again.

I haven't the foggest notion if this would work in these
modern times, I'm just offering it as food for thought.

Regards,
Lew Wolfgang

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 17:48           ` Sam Vilain
  2002-07-15 18:47             ` Mathieu Chouquet-Stringer
  2002-07-15 21:14             ` Andreas Dilger
@ 2002-07-16  8:15             ` Stelian Pop
  2002-07-16 12:27               ` Matthias Andree
  2 siblings, 1 reply; 90+ messages in thread
From: Stelian Pop @ 2002-07-16  8:15 UTC (permalink / raw)
  To: Sam Vilain; +Cc: dax, linux-kernel

On Mon, Jul 15, 2002 at 06:48:05PM +0100, Sam Vilain wrote:

> On the flipside, ext2 over reiserfs:
[...]
>   - there is a `dump' command (but it's useless, because it hangs when you
>     run it on mounted filesystems - come on, who REALLY unmounts their
>     filesystems for a nightly dump?  You need a 3 way mirror to do it
>     while guaranteeing filesystem availability...)

dump(8) doesn't hang when dumping mounted filesystems. You are refering
to a genuine bug which was fixed some time ago.

However, in some rare occasions, dump can save corrupted data when
saving a mounted and generaly highly active filesystem. Even then,
in 99% of the cases it doesn't really matter because the corrupted
files will get saved by the next incremental dump.

Come on, who REALLY expects to have consistent backups without
either unmounting the filesystem or using some snapshot techniques ?

Stelian.
-- 
Stelian Pop <stelian.pop@fr.alcove.com>
Alcove - http://www.alcove.com

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16  8:15             ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks Stelian Pop
@ 2002-07-16 12:27               ` Matthias Andree
  2002-07-16 12:43                 ` Stelian Pop
  0 siblings, 1 reply; 90+ messages in thread
From: Matthias Andree @ 2002-07-16 12:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Stelian Pop, Sam Vilain, dax

On Tue, 16 Jul 2002, Stelian Pop wrote:

> Come on, who REALLY expects to have consistent backups without
> either unmounting the filesystem or using some snapshot techniques ?

The who uses [s|g]tar, cpio, afio, dsmc (Tivoli distributed storage
manager), ...

Low-level snapshots don't do any good, they just freeze the "halfway
there" on-disk structure.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 12:27               ` Matthias Andree
@ 2002-07-16 12:43                 ` Stelian Pop
  2002-07-16 12:53                   ` Matthias Andree
  0 siblings, 1 reply; 90+ messages in thread
From: Stelian Pop @ 2002-07-16 12:43 UTC (permalink / raw)
  To: linux-kernel, Sam Vilain, dax

On Tue, Jul 16, 2002 at 02:27:56PM +0200, Matthias Andree wrote:

> > Come on, who REALLY expects to have consistent backups without
> > either unmounting the filesystem or using some snapshot techniques ?
> 
> The who uses [s|g]tar, cpio, afio, dsmc (Tivoli distributed storage
> manager), ...
> 
> Low-level snapshots don't do any good, they just freeze the "halfway
> there" on-disk structure.

But [s|g]tar, cpio, afio (don't know about dsmc) also freeze the
"halfway there" data, but at the file level instead (application
instead of filesystem)...

Stelian.
-- 
Stelian Pop <stelian.pop@fr.alcove.com>
Alcove - http://www.alcove.com

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 12:43                 ` Stelian Pop
@ 2002-07-16 12:53                   ` Matthias Andree
  2002-07-16 13:05                     ` Christoph Hellwig
  2002-07-17 18:51                     ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks bill davidsen
  0 siblings, 2 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-16 12:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: Stelian Pop, Sam Vilain, dax

On Tue, 16 Jul 2002, Stelian Pop wrote:

> > Low-level snapshots don't do any good, they just freeze the "halfway
> > there" on-disk structure.
> 
> But [s|g]tar, cpio, afio (don't know about dsmc) also freeze the
> "halfway there" data, but at the file level instead (application
> instead of filesystem)...

Not if some day somebody implements file system level snapshots for
Linux. Until then, better have garbled file contents constrained to a
file than random data as on-disk layout changes with hefty directory
updates.

dsmc fstat()s the file it is currently reading regularly and retries the
dump as the changes, and gives up if it is updated too often. Not sure
about the server side, and certainly not a useful option for sequential
devices that you directly write on. Looks like a cache for the biggest
file is necessary.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 12:53                   ` Matthias Andree
@ 2002-07-16 13:05                     ` Christoph Hellwig
  2002-07-16 19:38                       ` Matthias Andree
  2002-07-17 18:51                     ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks bill davidsen
  1 sibling, 1 reply; 90+ messages in thread
From: Christoph Hellwig @ 2002-07-16 13:05 UTC (permalink / raw)
  To: linux-kernel, Stelian Pop, Sam Vilain, dax

On Tue, Jul 16, 2002 at 02:53:01PM +0200, Matthias Andree wrote:
> Not if some day somebody implements file system level snapshots for
> Linux. Until then, better have garbled file contents constrained to a
> file than random data as on-disk layout changes with hefty directory
> updates.

or the blockdevice-level snapshots already implemented in Linux..


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 13:05                     ` Christoph Hellwig
@ 2002-07-16 19:38                       ` Matthias Andree
  2002-07-16 19:49                         ` Andreas Dilger
  2002-07-16 20:11                         ` Thunder from the hill
  0 siblings, 2 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-16 19:38 UTC (permalink / raw)
  To: linux-kernel; +Cc: Christoph Hellwig, Stelian Pop, Sam Vilain, dax

On Tue, 16 Jul 2002, Christoph Hellwig wrote:

> On Tue, Jul 16, 2002 at 02:53:01PM +0200, Matthias Andree wrote:
> > Not if some day somebody implements file system level snapshots for
> > Linux. Until then, better have garbled file contents constrained to a
> > file than random data as on-disk layout changes with hefty directory
> > updates.
> 
> or the blockdevice-level snapshots already implemented in Linux..

That would require three atomic steps:

1. mount read-only, flushing all pending updates
2. take snapshot
3. mount read-write

and then backup the snapshot. A snapshots of a live file system won't
do, it can be as inconsistent as it desires -- if your corrupt target is
moving or not, dumping it is not of much use.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 19:38                       ` Matthias Andree
@ 2002-07-16 19:49                         ` Andreas Dilger
  2002-07-16 20:11                         ` Thunder from the hill
  1 sibling, 0 replies; 90+ messages in thread
From: Andreas Dilger @ 2002-07-16 19:49 UTC (permalink / raw)
  To: linux-kernel, Christoph Hellwig, Stelian Pop, Sam Vilain, dax

On Jul 16, 2002  21:38 +0200, Matthias Andree wrote:
> On Tue, 16 Jul 2002, Christoph Hellwig wrote:
> > On Tue, Jul 16, 2002 at 02:53:01PM +0200, Matthias Andree wrote:
> > > Not if some day somebody implements file system level snapshots for
> > > Linux. Until then, better have garbled file contents constrained to a
> > > file than random data as on-disk layout changes with hefty directory
> > > updates.
> > 
> > or the blockdevice-level snapshots already implemented in Linux..
> 
> That would require three atomic steps:
> 
> 1. mount read-only, flushing all pending updates
> 2. take snapshot
> 3. mount read-write
> 
> and then backup the snapshot. A snapshots of a live file system won't
> do, it can be as inconsistent as it desires -- if your corrupt target is
> moving or not, dumping it is not of much use.

Luckily, there is already an interface which does this -
sync_supers_lockfs(), which the LVM code will use if it is patched in.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 19:38                       ` Matthias Andree
  2002-07-16 19:49                         ` Andreas Dilger
@ 2002-07-16 20:11                         ` Thunder from the hill
  2002-07-16 21:06                           ` Matthias Andree
  1 sibling, 1 reply; 90+ messages in thread
From: Thunder from the hill @ 2002-07-16 20:11 UTC (permalink / raw)
  To: Matthias Andree
  Cc: linux-kernel, Christoph Hellwig, Stelian Pop, Sam Vilain, dax

Hi,

On Tue, 16 Jul 2002, Matthias Andree wrote:
> > or the blockdevice-level snapshots already implemented in Linux..
> 
> That would require three atomic steps:
> 
> 1. mount read-only, flushing all pending updates
> 2. take snapshot
> 3. mount read-write
> 
> and then backup the snapshot. A snapshots of a live file system won't
> do, it can be as inconsistent as it desires -- if your corrupt target is
> moving or not, dumping it is not of much use.

Well, couldn't we just kindof lock the file system so that while backing 
up no writes get through to the real filesystem? This will possibly 
require a lot of memory (or another space to write to), but it might be 
done?

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 20:11                         ` Thunder from the hill
@ 2002-07-16 21:06                           ` Matthias Andree
  2002-07-16 21:23                             ` Andreas Dilger
  2002-07-16 22:19                             ` Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks) stoffel
  0 siblings, 2 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-16 21:06 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: msg.pgp --]
[-- Type: application/pgp, Size: 1226 bytes --]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 21:06                           ` Matthias Andree
@ 2002-07-16 21:23                             ` Andreas Dilger
  2002-07-16 21:38                               ` Thunder from the hill
                                                 ` (2 more replies)
  2002-07-16 22:19                             ` Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks) stoffel
  1 sibling, 3 replies; 90+ messages in thread
From: Andreas Dilger @ 2002-07-16 21:23 UTC (permalink / raw)
  To: linux-kernel

On Jul 16, 2002  23:06 +0200, Matthias Andree wrote:
> On Tue, 16 Jul 2002, Thunder from the hill wrote:
> > On Tue, 16 Jul 2002, Matthias Andree wrote:
> > > That would require three atomic steps:
> > > 
> > > 1. mount read-only, flushing all pending updates
> > > 2. take snapshot
> > > 3. mount read-write
> > > 
> > > and then backup the snapshot. A snapshots of a live file system won't
> > > do, it can be as inconsistent as it desires -- if your corrupt target is
> > > moving or not, dumping it is not of much use.
> > 
> > Well, couldn't we just kindof lock the file system so that while backing 
> > up no writes get through to the real filesystem? This will possibly 
> > require a lot of memory (or another space to write to), but it might be 
> > done?
> 
> But you would want to backup a consistent file system, so when entering
> the freeze or snapshot mode, you must flush all pending data in such a
> way that the snapshot is consistent (i. e. needs not fsck action
> whatsoever).

This is all done already for both LVM and EVMS snapshots.  The filesystem
(ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
frozen, the snapshot is created, and the filesystem becomes active again.
It takes a second or less.  Then dump will guarantee 100% correct backups
of the snapshot filesystem.  You would have to do a backup on the snapshot
to guarantee 100% correctness even with tar.

Most people don't care, because they don't even do backups in the first
place, until they have lost a lot of their data and they learn.  Even
without snapshots, while dump isn't guaranteed to be 100% correct for
rapidly changing filesystems, I have been using it for years on both
2.2 and 2.4 without any problems on my home systems.  I have even
restored data from those same backups...

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 21:23                             ` Andreas Dilger
@ 2002-07-16 21:38                               ` Thunder from the hill
  2002-07-17 11:47                               ` Matthias Andree
  2002-07-18 14:50                               ` Bill Davidsen
  2 siblings, 0 replies; 90+ messages in thread
From: Thunder from the hill @ 2002-07-16 21:38 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel

Hi,

On Tue, 16 Jul 2002, Andreas Dilger wrote:
> This is all done already for both LVM and EVMS snapshots.  The filesystem
> (ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
> frozen, the snapshot is created, and the filesystem becomes active again.
> It takes a second or less.

Anyway, we could do that in parallel if we did it like that:

sync	-> significant data is being written
lock	-> data writes stay cached, but aren't written
snapshot
unlock	-> data is getting written
now unmount the snapshout (clean it)
write the modified snapshot to disk...

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 21:23                             ` Andreas Dilger
  2002-07-16 21:38                               ` Thunder from the hill
@ 2002-07-17 11:47                               ` Matthias Andree
  2002-07-18 14:50                               ` Bill Davidsen
  2 siblings, 0 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-17 11:47 UTC (permalink / raw)
  To: linux-kernel

On Tue, 16 Jul 2002, Andreas Dilger wrote:

> This is all done already for both LVM and EVMS snapshots.  The filesystem
> (ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
> frozen, the snapshot is created, and the filesystem becomes active again.
> It takes a second or less.  Then dump will guarantee 100% correct backups
> of the snapshot filesystem.  You would have to do a backup on the snapshot
> to guarantee 100% correctness even with tar.

Sure. On some machines, they will go with dsmc anyhow which reads the
file and rereads if it changes under dsmc's hands.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 21:23                             ` Andreas Dilger
  2002-07-16 21:38                               ` Thunder from the hill
  2002-07-17 11:47                               ` Matthias Andree
@ 2002-07-18 14:50                               ` Bill Davidsen
  2002-07-18 15:09                                 ` Rik van Riel
  2 siblings, 1 reply; 90+ messages in thread
From: Bill Davidsen @ 2002-07-18 14:50 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel

On Tue, 16 Jul 2002, Andreas Dilger wrote:

> This is all done already for both LVM and EVMS snapshots.  The filesystem
> (ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
> frozen, the snapshot is created, and the filesystem becomes active again.
> It takes a second or less.  Then dump will guarantee 100% correct backups
> of the snapshot filesystem.  You would have to do a backup on the snapshot
> to guarantee 100% correctness even with tar.

I think I'm missing a part of this, the "a snapshot is created" sounds a
lot like "here a miracle occurs." Where is this snapshot saved? And how do
you take it in one sec regardless of f/s size? Is this one of those
theoretical things which requires two mirrored copies of the f/s so you
will still have RAID-1 after you break one? Or are changes journaled
somewhere until the snapshot is transferred to external media? And how do
you force applications to stop with their files in a valid state?

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-18 14:50                               ` Bill Davidsen
@ 2002-07-18 15:09                                 ` Rik van Riel
  0 siblings, 0 replies; 90+ messages in thread
From: Rik van Riel @ 2002-07-18 15:09 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Andreas Dilger, linux-kernel

On Thu, 18 Jul 2002, Bill Davidsen wrote:

> I think I'm missing a part of this, the "a snapshot is created" sounds a
> lot like "here a miracle occurs." Where is this snapshot saved? And how
> do you take it in one sec regardless of f/s size?

LVM. Systems like LVM already provide a logical->physical block
mapping on disk, so they might as well provide multiple mappings.

If the live filesystem writes to a particular disk block, the
snapshot will keep referencing the old blocks while the filesystem
gets to work on its own data. Copy on Write snapshots for block
devices...

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-16 21:06                           ` Matthias Andree
  2002-07-16 21:23                             ` Andreas Dilger
@ 2002-07-16 22:19                             ` stoffel
  2002-07-16 22:33                               ` Thunder from the hill
                                                 ` (2 more replies)
  1 sibling, 3 replies; 90+ messages in thread
From: stoffel @ 2002-07-16 22:19 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

It's really quite simple in theory to do proper backups.  But you need
to have application support to make it work in most cases.  It would
flow like this:

  1. lock application(s), flush any outstanding transactions.
  2. lock filesystems, flush any outstanding transactions.

  3a. lock mirrored volume, flush any outstanding transactions, break
      mirror.
                --or--
  3b. snapshot filesystem to another volume.

  4. unlock volume

  5. unlock filesystem

  6. unlock application(s).

  7. do backup against quiescent volume/filesystem.

In reality, people didn't lock filesystems (remount R/O) unless they
had too (ClearCase, Oracle, any DBMS, etc are the exceptions), since
the time hit was too much.  The chances of getting a bad backup on
user home directories or mail spools wasn't worth the extra cost to be
sure to get a clean backup.  For the exceptions, that's why god made
backup windows and such.  These days, those windows are miniscule, so
the seven steps outlined above are what needs to happen these days for
a trully reliable backup of important data.

John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
	 stoffel@lucent.com - http://www.lucent.com - 978-399-0479

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-16 22:19                             ` Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks) stoffel
@ 2002-07-16 22:33                               ` Thunder from the hill
  2002-07-18 15:04                               ` Bill Davidsen
  2002-07-19 15:28                               ` Sam Vilain
  2 siblings, 0 replies; 90+ messages in thread
From: Thunder from the hill @ 2002-07-16 22:33 UTC (permalink / raw)
  To: stoffel; +Cc: Matthias Andree, linux-kernel

Hi,

I do it like this:

-> Reconfigure port switch to use B server
-> Backup A server
-> Replay B server journals on A server
-> Switch to A server
-> Backup B server
-> Replay A server journals on B server
-> Reconfigure port switch to dynamic mode

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-16 22:19                             ` Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks) stoffel
  2002-07-16 22:33                               ` Thunder from the hill
@ 2002-07-18 15:04                               ` Bill Davidsen
  2002-07-18 15:27                                 ` Rik van Riel
  2002-07-18 15:50                                 ` stoffel
  2002-07-19 15:28                               ` Sam Vilain
  2 siblings, 2 replies; 90+ messages in thread
From: Bill Davidsen @ 2002-07-18 15:04 UTC (permalink / raw)
  To: stoffel; +Cc: Linux Kernel Mailing List

On Tue, 16 Jul 2002 stoffel@lucent.com wrote:

>   3a. lock mirrored volume, flush any outstanding transactions, break
>       mirror.
>                 --or--
>   3b. snapshot filesystem to another volume.

Good summary. The problem is that 3a either requires a double morror or
leaving the f/s un mirrored, and 3b can take a very long time for a big
f/s.

In general mauch of this can be addressed by only backing up small f/s and
using an application backup utility to backup the big stuff. Fortunately
the most common problem apps are databases and and they include this
capability. 

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-18 15:04                               ` Bill Davidsen
@ 2002-07-18 15:27                                 ` Rik van Riel
  2002-07-18 15:50                                 ` stoffel
  1 sibling, 0 replies; 90+ messages in thread
From: Rik van Riel @ 2002-07-18 15:27 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: stoffel, Linux Kernel Mailing List

On Thu, 18 Jul 2002, Bill Davidsen wrote:
> On Tue, 16 Jul 2002 stoffel@lucent.com wrote:
>
> >   3a. lock mirrored volume, flush any outstanding transactions, break
> >       mirror.
> >                 --or--
> >   3b. snapshot filesystem to another volume.
>
> Good summary. The problem is that 3a either requires a double morror or
> leaving the f/s un mirrored, and 3b can take a very long time for a big
> f/s.

3b should be fairly quick since you only need to do an in-memory
copy of some LVM metadata.

Rik
-- 
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-18 15:04                               ` Bill Davidsen
  2002-07-18 15:27                                 ` Rik van Riel
@ 2002-07-18 15:50                                 ` stoffel
  2002-07-18 16:29                                   ` Bill Davidsen
  1 sibling, 1 reply; 90+ messages in thread
From: stoffel @ 2002-07-18 15:50 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: stoffel, Linux Kernel Mailing List

Bill> On Tue, 16 Jul 2002 stoffel@lucent.com wrote:
>> 3a. lock mirrored volume, flush any outstanding transactions, break
>> mirror.
>> --or--
>> 3b. snapshot filesystem to another volume.

Bill> Good summary. The problem is that 3a either requires a double
Bill> morror or leaving the f/s un mirrored, and 3b can take a very
Bill> long time for a big f/s.

Yup, 3a isn't a totally perfect solution, though triple mirrors (if
you can afford them) work well.  We actually do this for some servers
where we can't afford the application down time of locking the DB for
extended times, but we also don't have triple mirrors either.  It's a
tradeoff.

I really prefer 3b, since it's more efficient, faster, and more
robust.  To snapshot a filesystem, all you need to do is:

 - create backing store for the snapshot, usually around 10-15% of the
   size of the original volume.  Depends on volatility of data.
 - lock the app(s).
 - lock the filesystem and flush pending transactions.
 - copy the metadata describing the filesystem
 - insert a COW handler into the FS block write path
 - mount the snapshot elsewhere
 - unlock the FS
 - unlock the app

Whenever the app writes a block into the FS, copy the original block
to the backing store, then write the new block to storage.  

All the backups see if the quiescent data store, so it can do a clean
backup.  

When you're done, just unmount the snapshot and delete it, then remove
the backing store.  There is an overhead for doing this, but it's
better than having to unmirror/remirror whole block devices to do a
backup.  And cheaper in terms of disk space too. 

Bill> In general mauch of this can be addressed by only backing up
Bill> small f/s and using an application backup utility to backup the
Bill> big stuff. Fortunately the most common problem apps are
Bill> databases and and they include this capability.

Define what a small file system is these days, since it could be 100gb
for some people.  *grin*.  It's a matter of making the tools scale
well so that the data can be secured properly.  

To do a proper backup requires that all layers talk to each other, and
have some means of doing a RW lock and flush of pending transactions.
If you have that, you can do it.  If you don't, you need to either
goto single user mode, re-mount RO, or pray.

John

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-18 15:50                                 ` stoffel
@ 2002-07-18 16:29                                   ` Bill Davidsen
  0 siblings, 0 replies; 90+ messages in thread
From: Bill Davidsen @ 2002-07-18 16:29 UTC (permalink / raw)
  To: stoffel; +Cc: Linux Kernel Mailing List

On Thu, 18 Jul 2002 stoffel@lucent.com wrote:

> I really prefer 3b, since it's more efficient, faster, and more
> robust.  To snapshot a filesystem, all you need to do is:
> 
>  - create backing store for the snapshot, usually around 10-15% of the
>    size of the original volume.  Depends on volatility of data.
>  - lock the app(s).
>  - lock the filesystem and flush pending transactions.
>  - copy the metadata describing the filesystem
>  - insert a COW handler into the FS block write path
>  - mount the snapshot elsewhere
>  - unlock the FS
>  - unlock the app
> 
> Whenever the app writes a block into the FS, copy the original block
> to the backing store, then write the new block to storage.  

Okay, other than the overhead and having enough filespace for Tbkup sec
(min, hr, day) of operation this is practical. In general most times you
would be doing an incremental, and the time would not be much.

> Bill> In general mauch of this can be addressed by only backing up
> Bill> small f/s and using an application backup utility to backup the
> Bill> big stuff. Fortunately the most common problem apps are
> Bill> databases and and they include this capability.
> 
> Define what a small file system is these days, since it could be 100gb
> for some people.  *grin*.  It's a matter of making the tools scale
> well so that the data can be secured properly.  

Obviously a small f/s is one you can backup without operator intervantion
to change media and in a reasonable time, which might be 10min..few hours
depending on your taste. That's kind of my rule of thumb, you're welcome
to suggest others, but if someone has to change media I can't call it
small any more.

> To do a proper backup requires that all layers talk to each other, and
> have some means of doing a RW lock and flush of pending transactions.
> If you have that, you can do it.  If you don't, you need to either
> goto single user mode, re-mount RO, or pray.

With some people, pray or ignore the problem are popular.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks)
  2002-07-16 22:19                             ` Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks) stoffel
  2002-07-16 22:33                               ` Thunder from the hill
  2002-07-18 15:04                               ` Bill Davidsen
@ 2002-07-19 15:28                               ` Sam Vilain
  2 siblings, 0 replies; 90+ messages in thread
From: Sam Vilain @ 2002-07-19 15:28 UTC (permalink / raw)
  To: stoffel; +Cc: matthias.andree, linux-kernel

stoffel@lucent.com wrote:

>   1. lock application(s), flush any outstanding transactions.
>   2. lock filesystems, flush any outstanding transactions.
>   3a. lock mirrored volume, flush any outstanding transactions, break
>       mirror.
>   3b. snapshot filesystem to another volume.

Or, to avoid the penalty of locking everything and bringing it down
and stuff:

  1. set a flag.

  2. start backing up blocks (read them raw of course, don't want to load
     those stressed higher level systems)

  3. If something wants to write to a block, quickly back up the old
     contents of the block before you write the new contents.  Unless of
     course you've already backed up that block.

Of course, step 3 does place a bit more unschedulable load on the
disk.  Heck, when the backups have just started, you're doubling the
latency of the devices.  You can avoid this with a transaction
journal; in fact, the cockier RDBMSes out there (eg, DMSII) don't even
bother to do this and assume that your transaction journal is on a
mirrored device - and hence there's no point in backing up the old
data, you just want to do one sweep of the disk - and replay the
journal to get current.

(note: implicit assumption: you're dealing with applications using
synchronous I/O, where it needs to be written to all mirrors before
it's trusted to be stored)

Ah, moot points - the Linux MD/LVM drivers are far too unsophisticated
to have journal devices ;-)
--
   Sam Vilain, sam@vilain.net     WWW: http://sam.vilain.net/
    7D74 2A09 B2D3 C30F F78E      GPG: http://sam.vilain.net/sam.asc
    278A A425 30A9 05B5 2F13

Law of Computability Applied to Social Sciences:
 If at first you don't suceed, transform your data set.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 12:53                   ` Matthias Andree
  2002-07-16 13:05                     ` Christoph Hellwig
@ 2002-07-17 18:51                     ` bill davidsen
  2002-07-18  9:32                       ` Matthias Andree
  1 sibling, 1 reply; 90+ messages in thread
From: bill davidsen @ 2002-07-17 18:51 UTC (permalink / raw)
  To: linux-kernel

In article <20020716125301.GI4576@merlin.emma.line.org>,
Matthias Andree  <matthias.andree@stud.uni-dortmund.de> wrote:

| dsmc fstat()s the file it is currently reading regularly and retries the
| dump as the changes, and gives up if it is updated too often. Not sure
| about the server side, and certainly not a useful option for sequential
| devices that you directly write on. Looks like a cache for the biggest
| file is necessary.

Which doesn't address the issue of data in files A, B and C, with
indices in X and Y. This only works if you flush and freeze all the
files at one time, making a perfect backup of one at a time results in
corruption if the database is busy.

My favorite example is usenet news on INN, a bunch of circular spools, a
linear history with two index files, 30-40k overview files, and all of
it changing with perhaps 3.5MB/sec data and 20-50/sec index writes. Far
better done with an application backup!

The point is, backups are hard, for many systems dump is optimal because
it's fast. After that I like cpio (-Hcrc) but that's personal
preference. All have fail cases on volatile data.
-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-17 18:51                     ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks bill davidsen
@ 2002-07-18  9:32                       ` Matthias Andree
  0 siblings, 0 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-18  9:32 UTC (permalink / raw)
  To: linux-kernel

On Wed, 17 Jul 2002, bill davidsen wrote:

> In article <20020716125301.GI4576@merlin.emma.line.org>,
> Matthias Andree  <matthias.andree@stud.uni-dortmund.de> wrote:
> 
> | dsmc fstat()s the file it is currently reading regularly and retries the
> | dump as the changes, and gives up if it is updated too often. Not sure
> | about the server side, and certainly not a useful option for sequential
> | devices that you directly write on. Looks like a cache for the biggest
> | file is necessary.
> 
> Which doesn't address the issue of data in files A, B and C, with
> indices in X and Y. This only works if you flush and freeze all the
> files at one time, making a perfect backup of one at a time results in
> corruption if the database is busy.

Right, but this would have to be taken up with Tivoli "do snapshot as
dsmc starts, backup from snapshot and discard snapshot on exit"

> My favorite example is usenet news on INN, a bunch of circular spools, a
> linear history with two index files, 30-40k overview files, and all of
> it changing with perhaps 3.5MB/sec data and 20-50/sec index writes. Far
> better done with an application backup!

In that case, when you are restoring from backups, you can also
regenerate index files (at least with tradspool, I never looked at the
"News in Dosen" aggregated spools like CNFS or whatever). It's really
hard if you have .dir/.pag style dbm data bases that don't mirror some
other single-file format.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 12:30     ` Alan Cox
  2002-07-15 12:02       ` Sam Vilain
@ 2002-07-15 12:09       ` Matti Aarnio
  1 sibling, 0 replies; 90+ messages in thread
From: Matti Aarnio @ 2002-07-15 12:09 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Dax Kelson, linux-kernel

On Mon, Jul 15, 2002 at 01:30:51PM +0100, Alan Cox wrote:
> On Mon, 2002-07-15 at 09:26, Sam Vilain wrote:
> > You are testing for a mail server - how many mailboxes are in your spool 
> > directory for the tests?  Try it with about five to ten thousand
> > mailboxes and see how your results vary.
> 
> If your mail server can't get heirarchical mail spools right, get one
> that can. 

   Long ago (10-15 internet-years ago..) I followed testing of
   FFS-family of filesystems in Squid cache.

   We noticed at Solaris machines using UFS, than when the directory
   data size grew above the number of blocks directly addressable by
   the direct-index pointers in the i-node, system speed plummeted.
   (Or perhaps it was something a bit smaller, like 32 kB)

   Consider:  4 kB block size, 12 direct indexes: 48 kB directory size.

   Spend 16 bytes for each file name + auxiliary data: 3000 files/subdirs

   Optimal would be to store the files inside only the first block,
   e.g. the directory shall not grow over 4k (or 1k, or ..)

   Name subdirs as:  00 thru 7F (128+2, 12 bytes ?)
   Possibly do that in 2 layers:  128^2 = 16384 subdirs, each
   with 50 long named users (even more files?): 820 000 users.

   Tune the subdir hashing function to suit your application, and
   you should be happy.

   Putting all your eggs in one basket (files in one directory)
   is not a smart thing.

> Alan

/Matti Aarnio

^ permalink raw reply	[flat|nested] 90+ messages in thread

[parent not found: <20020712162306$aa7d@traf.lcs.mit.edu>]

[parent not found: <mit.lcs.mail.linux-kernel/20020712162306$aa7d@traf.lcs.mit.edu>]

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
       [not found] ` <mit.lcs.mail.linux-kernel/20020712162306$aa7d@traf.lcs.mit.edu>
@ 2002-07-15 15:22   ` Patrick J. LoPresti
  2002-07-15 17:31     ` Chris Mason
                       ` (3 more replies)
  0 siblings, 4 replies; 90+ messages in thread
From: Patrick J. LoPresti @ 2002-07-15 15:22 UTC (permalink / raw)
  To: linux-kernel

Consider this argument:

  Given: On ext3, fsync() of any file on a partition commits all
         outstanding transactions on that partition to the log.

  Given: data=ordered forces pending data writes for a file to happen
         before related transactions are committed to the log.

  Therefore: With data=ordered, fsync() of any file on a partition
             syncs the outstanding writes of EVERY file on that
             partition.

Is this argument correct?  If so, it suggests that data=ordered is
actually the *worst* possible journalling mode for a mail spool.

One other thing.  I think this statement is misleading:

    IF your server is stable and not prone to crashing, and/or you
    have the write cache on your hard drives battery backed, you
    should strongly consider using the writeback journaling mode of
    Ext3 versus ordered.

This makes it sound like data=writeback is somehow unsafe when
machines crash.  I do not think this is true.  If your application
(e.g., Postfix) is written correctly (which it is), so it calls
fsync() when it is supposed to, then data=writeback is *exactly* as
safe as any other journalling mode.  "Battery backed caches" and the
like have nothing to do with it.  And if your application is written
incorrectly, then other journalling modes will reduce but not
eliminate the chances for things to break catastrophically on a crash.

So if the partition is dedicated to correct applications, like a mail
spool is, then data=writeback is perfectly safe.  If it is faster,
too, then it really is a no-brainer.

 - Pat

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 15:22   ` Patrick J. LoPresti
@ 2002-07-15 17:31     ` Chris Mason
  2002-07-15 18:33     ` Matthias Andree
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 90+ messages in thread
From: Chris Mason @ 2002-07-15 17:31 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

On Mon, 2002-07-15 at 11:22, Patrick J. LoPresti wrote:
> Consider this argument:
> 
>   Given: On ext3, fsync() of any file on a partition commits all
>          outstanding transactions on that partition to the log.
> 
>   Given: data=ordered forces pending data writes for a file to happen
>          before related transactions are committed to the log.
> 
>   Therefore: With data=ordered, fsync() of any file on a partition
>              syncs the outstanding writes of EVERY file on that
>              partition.
> 
> Is this argument correct?  If so, it suggests that data=ordered is
> actually the *worst* possible journalling mode for a mail spool.
> 

Yes.  In practice this doesn't hurt as much as it could, because ext3
does a good job of letting more writers come in before forcing the
commit.  What hurts you is when a forced commit comes in the middle of
creating the file.  A data write that could have been contiguous gets
broken into two or more writes instead.

> One other thing.  I think this statement is misleading:
> 
>     IF your server is stable and not prone to crashing, and/or you
>     have the write cache on your hard drives battery backed, you
>     should strongly consider using the writeback journaling mode of
>     Ext3 versus ordered.
> 
> This makes it sound like data=writeback is somehow unsafe when
> machines crash.  I do not think this is true.  If your application
> (e.g., Postfix) is written correctly (which it is), so it calls
> fsync() when it is supposed to, then data=writeback is *exactly* as
> safe as any other journalling mode.  

Almost.  data=writeback makes it possible for the old contents of a
block to end up in a newly grown file.  There are a few ways this can
screw you up:

1) that newly grown file is someone's inbox, and the old contents of the
new block include someone else's private message.

2) That newly grown file is a control file for the application, and the
application expects it to contain valid data within (think sendmail).  

> "Battery backed caches" and the
> like have nothing to do with it.  

Nope, battery backed caches don't make data=writeback more or less safe
(with respect to the data anyway).  They do make data=ordered and
data=journal more safe.

> And if your application is written
> incorrectly, then other journalling modes will reduce but not
> eliminate the chances for things to break catastrophically on a crash.
> 
> So if the partition is dedicated to correct applications, like a mail
> spool is, then data=writeback is perfectly safe.  If it is faster,
> too, then it really is a no-brainer.

For mail servers, data=journal is your friend.  ext3 sometimes needs a
bigger log for it (reiserfs data=journal patches don't), but the
performance increase can be significant.

-chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 15:22   ` Patrick J. LoPresti
  2002-07-15 17:31     ` Chris Mason
@ 2002-07-15 18:33     ` Matthias Andree
       [not found]     ` <20020715173337$acad@traf.lcs.mit.edu>
  2002-07-16  7:07     ` Dax Kelson
  3 siblings, 0 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-15 18:33 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> One other thing.  I think this statement is misleading:
> 
>     IF your server is stable and not prone to crashing, and/or you
>     have the write cache on your hard drives battery backed, you
>     should strongly consider using the writeback journaling mode of
>     Ext3 versus ordered.
> 
> This makes it sound like data=writeback is somehow unsafe when
> machines crash.  I do not think this is true.  If your application

Well, if your fsync() completes...

> (e.g., Postfix) is written correctly (which it is), so it calls
> fsync() when it is supposed to, then data=writeback is *exactly* as
> safe as any other journalling mode.  "Battery backed caches" and the
> like have nothing to do with it.  And if your application is written
> incorrectly, then other journalling modes will reduce but not
> eliminate the chances for things to break catastrophically on a crash.

...then you're right. If the machine crashes amidst the fsync()
operation, but has scheduled meta data before file contents, then
journal recovery can present you a file that contains bogus data which
will confuse some applications. I believe Postfix will recover from
this condition either way, see its file is hosed and ignore or discard
it (depending on what it is), but software that blindly relies on a
special format without checking will barf.

All of this assumes two things:

1. the application actually calls fsync()

2. the application can detect if fsync() succeeded before the crash
(like fsync -> fchmod -> fsync, structured file contents, whatever).

> So if the partition is dedicated to correct applications, like a mail
> spool is, then data=writeback is perfectly safe.  If it is faster,
> too, then it really is a no-brainer.

These ordering promises also apply to applications that do not call
fsync() or that cannot detect hosed files. Been there, seen that, with
CVS on unpatched ReiserFS as of Linux-2.4.19-presomething: suddenly one
,v file contained NUL blocks. The server barfed, the (remote!) client
segfaulted... yes, it's almost as bad as it can get.

Not catastrophic, tape backup available, but it gave some time to
restore the file and investigate this issue nonetheless. It boiled down
to "nobody's fault, but missing feature". With data=ordered or
data=journal, I would have either had my old ,v file around or a proper
new one.

I'm now using Chris Mason's data-logging patches to try and see how
things work out, I had one crash with an old version, then updated to
the -11 version and have yet to see something break again.

I'd certainly appreciate if these patches were merged early in
2.4.20-pre so they get some testing and can be in 2.4.20 and Linux had
two file systems with data=ordered to choose from.

Disclaimer: I don't know anything except the bare existence, about XFS
or JFS. Feel free to add comments.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 90+ messages in thread

[parent not found: <20020715173337$acad@traf.lcs.mit.edu>]

[parent not found: <mit.lcs.mail.linux-kernel/20020715173337$acad@traf.lcs.mit.edu>]

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
       [not found]       ` <mit.lcs.mail.linux-kernel/20020715173337$acad@traf.lcs.mit.edu>
@ 2002-07-15 19:13         ` Patrick J. LoPresti
  2002-07-15 20:55           ` Matthias Andree
  2002-07-15 21:14           ` Chris Mason
  0 siblings, 2 replies; 90+ messages in thread
From: Patrick J. LoPresti @ 2002-07-15 19:13 UTC (permalink / raw)
  To: linux-kernel

Chris Mason <mason@suse.com> writes:

> > One other thing.  I think this statement is misleading:
> > 
> >     IF your server is stable and not prone to crashing, and/or you
> >     have the write cache on your hard drives battery backed, you
> >     should strongly consider using the writeback journaling mode of
> >     Ext3 versus ordered.
> > 
> > This makes it sound like data=writeback is somehow unsafe when
> > machines crash.  I do not think this is true.  If your application
> > (e.g., Postfix) is written correctly (which it is), so it calls
> > fsync() when it is supposed to, then data=writeback is *exactly* as
> > safe as any other journalling mode.  
> 
> Almost.  data=writeback makes it possible for the old contents of a
> block to end up in a newly grown file.

Only if the application is already broken.

> There are a few ways this can screw you up:
> 
> 1) that newly grown file is someone's inbox, and the old contents of the
> new block include someone else's private message.
>
> 2) That newly grown file is a control file for the application, and the
> application expects it to contain valid data within (think sendmail).  

In a correctly-written application, neither of these things can
happen.  (See my earlier message today on fsync() and MTAs.)  To get a
file onto disk reliably, the application must 1) flush the data, and
then 2) flush a "validity" indicator.  This could be a sequence like:

  create temp file
  flush data to temp file
  rename temp file
  flush rename operation

In this sequence, the file's existence under a particular name is the
indicator of its validity.

If you skip either of these flush operations, you are not behaving
reliably.  Skipping the first flush means the validity indicator might
hit the disk before the data; so after a crash, you might see invalid
data in an allegedly valid file.  Skipping the second flush means you
do not know that the validity indicator has been set, so you cannot
report success to whoever is waiting for this "reliable write" to
happen.

It is possible to make an application which relies on data=ordered
semantics; for example, skipping the "flush data to temp file" step
above.  But such an application would be broken for every version of
Unix *except* Linux in data=ordered mode.  I would call that an
incorrect application.

> Nope, battery backed caches don't make data=writeback more or less safe
> (with respect to the data anyway).  They do make data=ordered and
> data=journal more safe.

A theorist would say that "more safe" is a sloppy concept.  Either an
operation is safe or it is not.  As I said in my last message,
data=ordered (and data=journal) can reduce the risk for poorly written
apps.  But they cannot eliminate that risk, and for a correctly
written app, data=writeback is 100% as safe.

 - Pat

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 19:13         ` Patrick J. LoPresti
@ 2002-07-15 20:55           ` Matthias Andree
  2002-07-15 21:23             ` Patrick J. LoPresti
  2002-07-15 22:55             ` Alan Cox
  2002-07-15 21:14           ` Chris Mason
  1 sibling, 2 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-15 20:55 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> In a correctly-written application, neither of these things can
> happen.  (See my earlier message today on fsync() and MTAs.)  To get a
> file onto disk reliably, the application must 1) flush the data, and
> then 2) flush a "validity" indicator.  This could be a sequence like:
> 
>   create temp file
>   flush data to temp file
>   rename temp file
>   flush rename operation
> 
> In this sequence, the file's existence under a particular name is the
> indicator of its validity.

Assume that most applications are broken then.

I assume that most will just call close() or fclose() and exit() right
away. Does fclose() imply fsync()? 

Some applications will not even check the [f]close() return value...

> It is possible to make an application which relies on data=ordered
> semantics; for example, skipping the "flush data to temp file" step
> above.  But such an application would be broken for every version of
> Unix *except* Linux in data=ordered mode.  I would call that an
> incorrect application.

Or very specific, at least.

> > Nope, battery backed caches don't make data=writeback more or less safe
> > (with respect to the data anyway).  They do make data=ordered and
> > data=journal more safe.
> 
> A theorist would say that "more safe" is a sloppy concept.  Either an
> operation is safe or it is not.  As I said in my last message,
> data=ordered (and data=journal) can reduce the risk for poorly written
> apps.  But they cannot eliminate that risk, and for a correctly
> written app, data=writeback is 100% as safe.

IF that application uses a marker to mark completion. If it does not,
data=ordered will be the safe bet, regardless of fsync() or not. The
machine can crash BEFORE the fsync() is called.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 20:55           ` Matthias Andree
@ 2002-07-15 21:23             ` Patrick J. LoPresti
  2002-07-15 21:38               ` Thunder from the hill
  2002-07-15 21:59               ` Ketil Froyn
  2002-07-15 22:55             ` Alan Cox
  1 sibling, 2 replies; 90+ messages in thread
From: Patrick J. LoPresti @ 2002-07-15 21:23 UTC (permalink / raw)
  To: linux-kernel

Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:

> I assume that most will just call close() or fclose() and exit() right
> away. Does fclose() imply fsync()? 

Not according to my close(2) man page:

       A successful close does not guarantee that  the  data  has
       been  successfully  saved  to  disk,  as the kernel defers
       writes. It is not common for a  filesystem  to  flush  the
       buffers  when the stream is closed. If you need to be sure
       that the data is physically stored use fsync(2).  (It will
       depend on the disk hardware at this point.)

Note that this means writing a truly reliable shell or Perl script is
tricky.  I suppose you can "use POSIX qw(fsync);" in Perl.  But what
do you do for a shell script?  /bin/sync :-) ?

> Some applications will not even check the [f]close() return value...

Such applications are broken, of course.

> > It is possible to make an application which relies on data=ordered
> > semantics; for example, skipping the "flush data to temp file" step
> > above.  But such an application would be broken for every version of
> > Unix *except* Linux in data=ordered mode.  I would call that an
> > incorrect application.
> 
> Or very specific, at least.

Hm.  Does BSD with soft updates guarantee anything about write
ordering on fsync()?  In particular, does it promise to commit the
data before the metadata?

> > A theorist would say that "more safe" is a sloppy concept.  Either an
> > operation is safe or it is not.  As I said in my last message,
> > data=ordered (and data=journal) can reduce the risk for poorly written
> > apps.  But they cannot eliminate that risk, and for a correctly
> > written app, data=writeback is 100% as safe.
> 
> IF that application uses a marker to mark completion. If it does not,
> data=ordered will be the safe bet, regardless of fsync() or not. The
> machine can crash BEFORE the fsync() is called.

Without marking completion, there is no safe bet.  Without calling
fsync(), you *never* know when the data will hit the disk.  It is very
hard to build a reliable system that way...  For an MTA, for example,
you can never safely inform the remote mailer that you have accepted
the message.  But this problem goes beyond MTAs; very few applications
live in a vacuum.

Reliable systems are tricky.  I guess this is why Oracle and Sybase
make all that money.

 - Pat

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:23             ` Patrick J. LoPresti
@ 2002-07-15 21:38               ` Thunder from the hill
  2002-07-16 12:31                 ` Matthias Andree
  2002-07-15 21:59               ` Ketil Froyn
  1 sibling, 1 reply; 90+ messages in thread
From: Thunder from the hill @ 2002-07-15 21:38 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

Hi,

On 15 Jul 2002, Patrick J. LoPresti wrote:
> Note that this means writing a truly reliable shell or Perl script is
> tricky.  I suppose you can "use POSIX qw(fsync);" in Perl.  But what do
> you do for a shell script?  /bin/sync :-) ?

Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be 
done with it.

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:38               ` Thunder from the hill
@ 2002-07-16 12:31                 ` Matthias Andree
  2002-07-16 15:53                   ` Thunder from the hill
  0 siblings, 1 reply; 90+ messages in thread
From: Matthias Andree @ 2002-07-16 12:31 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Thunder from the hill wrote:

> Hi,
> 
> On 15 Jul 2002, Patrick J. LoPresti wrote:
> > Note that this means writing a truly reliable shell or Perl script is
> > tricky.  I suppose you can "use POSIX qw(fsync);" in Perl.  But what do
> > you do for a shell script?  /bin/sync :-) ?
> 
> Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be 
> done with it.

Or steal one from FreeBSD (written by Paul Saab), fix the err() function
and be done with it.

.../usr.bin/fsync/fsync.{1,c}

Interesting side note -- mind the O_RDONLY:

        for (i = 1; i < argc; ++i) {
                if ((fd = open(argv[i], O_RDONLY)) < 0)
                        err(1, "open %s", argv[i]);

                if (fsync(fd) != 0)
                        err(1, "fsync %s", argv[1]);
                close(fd);
        }

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 12:31                 ` Matthias Andree
@ 2002-07-16 15:53                   ` Thunder from the hill
  2002-07-16 19:26                     ` Matthias Andree
  0 siblings, 1 reply; 90+ messages in thread
From: Thunder from the hill @ 2002-07-16 15:53 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

Hi,

On Tue, 16 Jul 2002, Matthias Andree wrote:
> > Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be 
> > done with it.
> 
> Or steal one from FreeBSD (written by Paul Saab), fix the err() function
> and be done with it.
> 
> .../usr.bin/fsync/fsync.{1,c}
> 
> Interesting side note -- mind the O_RDONLY:
> 
>         for (i = 1; i < argc; ++i) {
>                 if ((fd = open(argv[i], O_RDONLY)) < 0)
>                         err(1, "open %s", argv[i]);
> 
>                 if (fsync(fd) != 0)
>                         err(1, "fsync %s", argv[1]);
>                 close(fd);
>         }

Pretty much the thing I had in mind, except that the close return code is 
disregarded here...

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 15:53                   ` Thunder from the hill
@ 2002-07-16 19:26                     ` Matthias Andree
  2002-07-16 19:38                       ` Thunder from the hill
  0 siblings, 1 reply; 90+ messages in thread
From: Matthias Andree @ 2002-07-16 19:26 UTC (permalink / raw)
  To: linux-kernel

On Tue, 16 Jul 2002, Thunder from the hill wrote:

> >                 if (fsync(fd) != 0)
> >                         err(1, "fsync %s", argv[1]);
> >                 close(fd);
> >         }
> 
> Pretty much the thing I had in mind, except that the close return code is 
> disregarded here...

Indeed, but OTOH, what error is close to report when the file is opened
read-only?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16 19:26                     ` Matthias Andree
@ 2002-07-16 19:38                       ` Thunder from the hill
  0 siblings, 0 replies; 90+ messages in thread
From: Thunder from the hill @ 2002-07-16 19:38 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

Hi,

On Tue, 16 Jul 2002, Matthias Andree wrote:
> Indeed, but OTOH, what error is close to report when the file is opened
> read-only?

Well, you can still get EIO, EINTR, EBADF. Whatever you say, disregarding 
the close return code is never any good.

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:23             ` Patrick J. LoPresti
  2002-07-15 21:38               ` Thunder from the hill
@ 2002-07-15 21:59               ` Ketil Froyn
  2002-07-15 23:08                 ` Matti Aarnio
  1 sibling, 1 reply; 90+ messages in thread
From: Ketil Froyn @ 2002-07-15 21:59 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

On 15 Jul 2002, Patrick J. LoPresti wrote:

> Without calling fsync(), you *never* know when the data will hit the
> disk.

Doesn't bdflush ensure that data is written to disk within 30 seconds or 
some tunable number of seconds?

Ketil


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:59               ` Ketil Froyn
@ 2002-07-15 23:08                 ` Matti Aarnio
  2002-07-16 12:33                   ` Matthias Andree
  0 siblings, 1 reply; 90+ messages in thread
From: Matti Aarnio @ 2002-07-15 23:08 UTC (permalink / raw)
  To: Ketil Froyn; +Cc: linux-kernel

On Mon, Jul 15, 2002 at 11:59:48PM +0200, Ketil Froyn wrote:
> On 15 Jul 2002, Patrick J. LoPresti wrote:
> > Without calling fsync(), you *never* know when the data will hit the
> > disk.
> 
> Doesn't bdflush ensure that data is written to disk within 30 seconds or 
> some tunable number of seconds?

  It TRIES TO, it does not guarantee anything.

  The MTA systems are an example of software suites which have
  transaction requirements.  The goal has been usually stated
  as:  must not fail to deliver.

  Practical implementations without full-blown all encompassing
  transactions will usually mean that the message "will be delivered
  at least once", e.g. double-delivery can happen.

  One view to MTA behaviour is moving the message from one substate
  to another during its processing.

  These days, usually, the transaction database for MTAs is UNIX
  filesystem.   For ZMailer I have considered (although not actually
  done - yet) using SleepyCat DB files for the transaction subsystem.
  There are great challenges in failure compartementalisation, and
  integrity, when using that kind of integrated database mechanisms.
  Getting SEGV is potentially _very_ bad thing!

> Ketil

/Matti Aarnio

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 23:08                 ` Matti Aarnio
@ 2002-07-16 12:33                   ` Matthias Andree
  0 siblings, 0 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-16 12:33 UTC (permalink / raw)
  To: linux-kernel

On Tue, 16 Jul 2002, Matti Aarnio wrote:

>   These days, usually, the transaction database for MTAs is UNIX
>   filesystem.   For ZMailer I have considered (although not actually
>   done - yet) using SleepyCat DB files for the transaction subsystem.
>   There are great challenges in failure compartementalisation, and
>   integrity, when using that kind of integrated database mechanisms.
>   Getting SEGV is potentially _very_ bad thing!

Read: lethal to the spool. Has SleepyCat DB learned to recover from
ENOSPC in the meanwhile? I had a db1.85 file corrupt after ENOSPC once...

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 20:55           ` Matthias Andree
  2002-07-15 21:23             ` Patrick J. LoPresti
@ 2002-07-15 22:55             ` Alan Cox
  2002-07-15 21:58               ` Matthias Andree
  1 sibling, 1 reply; 90+ messages in thread
From: Alan Cox @ 2002-07-15 22:55 UTC (permalink / raw)
  To: Matthias Andree; +Cc: linux-kernel

On Mon, 2002-07-15 at 21:55, Matthias Andree wrote:
> I assume that most will just call close() or fclose() and exit() right
> away. Does fclose() imply fsync()? 

It doesn't.

> Some applications will not even check the [f]close() return value...

We are only interested in reliable code. Anything else is already
fatally broken.

-- quote --
       Not checking the return value of close  is  a  common  but
       nevertheless   serious  programming  error.   File  system
       implementations which use techniques  as  ``write-behind''
       to  increase  performance may lead to write(2) succeeding,
       although the data has not been  written  yet.   The  error
       status  may be reported at a later write operation, but it
       is guaranteed to be reported on  closing  the  file.   Not
       checking  the  return value when closing the file may lead
       to silent loss of data.  This can especially  be  observed
       with NFS and disk quotas.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 22:55             ` Alan Cox
@ 2002-07-15 21:58               ` Matthias Andree
  0 siblings, 0 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-15 21:58 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Alan Cox wrote:

> We are only interested in reliable code. Anything else is already
> fatally broken.
> 
> -- quote --
>        Not checking the return value of close  is  a  common  but
>        nevertheless   serious  programming  error.   File  system

As in 6. on http://www.apocalypse.org/pub/u/paul/docs/commandments.html
(The Ten Commandments for C Programmers, by Henry Spencer).

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 19:13         ` Patrick J. LoPresti
  2002-07-15 20:55           ` Matthias Andree
@ 2002-07-15 21:14           ` Chris Mason
  2002-07-15 21:31             ` Patrick J. LoPresti
  2002-07-16 12:35             ` Matthias Andree
  1 sibling, 2 replies; 90+ messages in thread
From: Chris Mason @ 2002-07-15 21:14 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

On Mon, 2002-07-15 at 15:13, Patrick J. LoPresti wrote:

> > 1) that newly grown file is someone's inbox, and the old contents of the
> > new block include someone else's private message.
> >
> > 2) That newly grown file is a control file for the application, and the
> > application expects it to contain valid data within (think sendmail).  
> 
> In a correctly-written application, neither of these things can
> happen.  (See my earlier message today on fsync() and MTAs.)  To get a
> file onto disk reliably, the application must 1) flush the data, and
> then 2) flush a "validity" indicator.  This could be a sequence like:
> 
>   create temp file
>   flush data to temp file
>   rename temp file
>   flush rename operation

Yes, most mtas do this for queue files, I'm not sure how many do it for
the actual spool file.  mail server authors are more than welcome to
recommend the best safety/performance combo for their product, and to
ask the FS guys which combinations are safe.

-chris



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:14           ` Chris Mason
@ 2002-07-15 21:31             ` Patrick J. LoPresti
  2002-07-15 22:12               ` Richard A Nelson
  2002-07-16  1:02               ` Lawrence Greenfield
  2002-07-16 12:35             ` Matthias Andree
  1 sibling, 2 replies; 90+ messages in thread
From: Patrick J. LoPresti @ 2002-07-15 21:31 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel

Chris Mason <mason@suse.com> writes:

> Yes, most mtas do this for queue files, I'm not sure how many do it for
> the actual spool file.

Maybe the control files are small enough to fit in one disk block,
making the operations atomic in practice.  Or something.

> mail server authors are more than welcome to recommend the best
> safety/performance combo for their product, and to ask the FS guys
> which combinations are safe.

Yeah, but it's a shame if those combinations require performance hits
like "synchronous directory updates" or, worse, "fsync() == sync()".

I really wish MTA authors would just support Linux's "fsync the
directory" approach.  It is simple, reliable, and fast.  Yes, it does
require Linux-specific support in the application, but that's what
application authors should expect when there is a gap in the
standards.

 - Pat

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:31             ` Patrick J. LoPresti
@ 2002-07-15 22:12               ` Richard A Nelson
  2002-07-16  1:02               ` Lawrence Greenfield
  1 sibling, 0 replies; 90+ messages in thread
From: Richard A Nelson @ 2002-07-15 22:12 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: Chris Mason, linux-kernel

On 15 Jul 2002, Patrick J. LoPresti wrote:

> I really wish MTA authors would just support Linux's "fsync the
> directory" approach.  It is simple, reliable, and fast.  Yes, it does
> require Linux-specific support in the application, but that's what
> application authors should expect when there is a gap in the
> standards.

This is exactly what sendmail did in its 8.12.0 release (2001/09/08)

-- 
Rick Nelson
"...very few phenomena can pull someone out of Deep Hack Mode, with two
noted exceptions: being struck by lightning, or worse, your *computer*
being struck by lightning."
(By Matt Welsh)


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:31             ` Patrick J. LoPresti
  2002-07-15 22:12               ` Richard A Nelson
@ 2002-07-16  1:02               ` Lawrence Greenfield
       [not found]                 ` <mit.lcs.mail.linux-kernel/200207160102.g6G12BiH022986@lin2.andrew.cmu.edu>
  1 sibling, 1 reply; 90+ messages in thread
From: Lawrence Greenfield @ 2002-07-16  1:02 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

   From: "Patrick J. LoPresti" <patl@curl.com>
   Date: 	15 Jul 2002 17:31:07 -0400
[...]
   I really wish MTA authors would just support Linux's "fsync the
   directory" approach.  It is simple, reliable, and fast.  Yes, it does
   require Linux-specific support in the application, but that's what
   application authors should expect when there is a gap in the
   standards.

Actually, it's not all that simple (you have to find the enclosing
directories of any files you're modifying, which might require string
manipulation) or necessarily all that fast (you're doubling the number
of system calls and now the application is imposing an ordering on the
filesystem that didn't exist before).

It's only necessary for ext2. Modern Linux filesystems (such as ext3
or reiserfs) don't require it.

Finally: ext2 isn't safe even if you do call fsync() on the directory!

Let's consider: some filesystem operation modifies two different
blocks. This operation is safe if block A is written before block
B. 

. FFS guarantees this by performing the writes synchronously: block A
is written when it is changed, followed by block B when it is changed.

. Journalling filesystems (ext3, reiserfs) guarantee this by
journalling the operation and forcing that journal entry to disk
before either A or B can be modified.

. What does ext2 do (in the default mode)? It modifies A, it modifies
B, and then leaves it up to the buffer cache to write them back---and
the buffer cache might decide to write B before A.

We're finally getting to some decent shared semantics on
filesystems. Reiserfs, ext3, FFS w/ softupdates, vxfs, etc., all work
with just fsync()ing the file (though an fsync() is required after a
link() or rename() operation). Let's encourage all filesystems to
provide these semantics and make it slightly easier on us stupid
application programmers.

Larry

^ permalink raw reply	[flat|nested] 90+ messages in thread

[parent not found: <mit.lcs.mail.linux-kernel/200207160102.g6G12BiH022986@lin2.andrew.cmu.edu>]

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
       [not found]                 ` <mit.lcs.mail.linux-kernel/200207160102.g6G12BiH022986@lin2.andrew.cmu.edu>
@ 2002-07-16  1:43                   ` Patrick J. LoPresti
  2002-07-16  1:56                     ` Thunder from the hill
                                       ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: Patrick J. LoPresti @ 2002-07-16  1:43 UTC (permalink / raw)
  To: linux-kernel

Lawrence Greenfield <leg+@andrew.cmu.edu> writes:

> Actually, it's not all that simple (you have to find the enclosing
> directories of any files you're modifying, which might require string
> manipulation)

No, you have to find the directories you are modifying.  And the
application knows darn well which directories it is modifying.

Don't speculate.  Show some sample code, and let's see how hard it
would be to use the "Linux way".  I am betting on "not hard at all".

> or necessarily all that fast (you're doubling the number of system
> calls and now the application is imposing an ordering on the
> filesystem that didn't exist before).

No, you are not doubling the number of system calls.  As I have tried
to point out repeatedly, doing this stuff reliably and portably
already requires a sequence like this:

   write data
   flush data
   write "validity" indicator (e.g., rename() or fchmod())
   flush validity indicator

On Linux, flushing a rename() means calling fsync() on the directory
instead of the file.  That's it.  Doing that instead of fsync'ing the
file adds at most two system calls (to open and close the directory),
and those can be amortized over many operations on that directory
(think "mail spool").  So the system call overhead is non-existent.

As for "imposing an ordering on the filesystem that didn't exist
before", that is complete nonsense.  This is imposing *precisely* the
ordering required for reliable operation; no more, no less.  Relying
on mount options, "chattr +S", or journaling artifacts for your
ordering is the inefficient approach; since they impose extra
ordering, they can never be faster and will usually be slower.

> It's only necessary for ext2. Modern Linux filesystems (such as ext3
> or reiserfs) don't require it.

Only because they take the performance hit of flushing the whole log
to disk on every fsync().  Combine that with "data=ordered" and see
what happens to your performance.  (Perhaps "data=ordered" should be
called "fsync=sync".)  I would rather get back the performance and
convince application authors to understand what they are doing.

> Finally: ext2 isn't safe even if you do call fsync() on the directory!

Wrong.

   write temp file
   fsync() temp file
   rename() temp file to actual file
   fsync() directory

No matter where this crashes, it is perfectly safe on ext2.  (If not,
ext2 is badly broken.)  The worst that can happen after a crash is
that the file might exist with both the old name and the new name.
But an application can detect this case on startup and clean it up.

 - Pat

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16  1:43                   ` Patrick J. LoPresti
@ 2002-07-16  1:56                     ` Thunder from the hill
  2002-07-16 12:47                     ` Matthias Andree
  2002-07-16 21:09                     ` James Antill
  2 siblings, 0 replies; 90+ messages in thread
From: Thunder from the hill @ 2002-07-16  1:56 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

Hi,

On 15 Jul 2002, Patrick J. LoPresti wrote:
> Doing that instead of fsync'ing the
> file adds at most two system calls (to open and close the directory),

Keep the directory fd open all the time, and flush it when needed. This 
gets rid of the open(dir, dd); fsync(dd); close(dd);, you just have:
open(dir, dd); once, then fsync(dd); fsync(dd); ... and then one close(dd);

Not too much of an overhead, is it?

							Regards,
							Thunder
-- 
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o?  K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y- 
------END GEEK CODE BLOCK------


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16  1:43                   ` Patrick J. LoPresti
  2002-07-16  1:56                     ` Thunder from the hill
@ 2002-07-16 12:47                     ` Matthias Andree
  2002-07-16 21:09                     ` James Antill
  2 siblings, 0 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-16 12:47 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> On Linux, flushing a rename() means calling fsync() on the directory
> instead of the file.  That's it.  Doing that instead of fsync'ing the
> file adds at most two system calls (to open and close the directory),
> and those can be amortized over many operations on that directory
> (think "mail spool").  So the system call overhead is non-existent.

Indeed, but I can also leave the file descriptor open on any file system
on any system except SOME of Linux'. (Ok, this precludes systems that
don't offer POSIX synchronous completion semantics, but these systems
don't nessarily have fsync() either).

> ordering required for reliable operation; no more, no less.  Relying
> on mount options, "chattr +S", or journaling artifacts for your
> ordering is the inefficient approach; since they impose extra
> ordering, they can never be faster and will usually be slower.

It is sometimes the only way, if the application is unaware. I hope I'm
not loosening a flame war if I mention qmail now, which is not even
softupdates aware. Without chattr +S or mount -o sync, nothing is to be
gained. OTOH, where mount -o sync only makes directory updates
synchronous, it's not too expensive, which is why the +D approach is
still useful there.

> > It's only necessary for ext2. Modern Linux filesystems (such as ext3
> > or reiserfs) don't require it.
> 
> Only because they take the performance hit of flushing the whole log
> to disk on every fsync().  Combine that with "data=ordered" and see
> what happens to your performance.  (Perhaps "data=ordered" should be
> called "fsync=sync".)  I would rather get back the performance and
> convince application authors to understand what they are doing.

1. data=ordered is more than fsync=sync. It guarantees that data blocks
are flushed before flushing the meta data blocks that reference the data
blocks. Try this on ext2fs and lose.

2. sync() is unreliable, it can return control to the caller earlier
than what is sound. It can "complete" at any time it desires without
having completed.
(Probably so it can ever return as new blocks are written by another
process, but at least SUS v2 did not detail on this).

3. Application authors do not desire fsync=sync semantics, but they want
to rely on "fsync(fd) also syncs recent renames". It comes as a
now-guaranteed side effect of how ext3fs works, so I am told.

I'm not sure how the ext3fs journal works internally, but it'd fine with
all applications if only that part of a file system be synched that is
really relevant to the current fsync(fd). No more. It seems as though
fsync==sync is an artifact that ext2 also suffers from.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-16  1:43                   ` Patrick J. LoPresti
  2002-07-16  1:56                     ` Thunder from the hill
  2002-07-16 12:47                     ` Matthias Andree
@ 2002-07-16 21:09                     ` James Antill
  2 siblings, 0 replies; 90+ messages in thread
From: James Antill @ 2002-07-16 21:09 UTC (permalink / raw)
  To: Lawrence Greenfield, Patrick J. LoPresti; +Cc: linux-kernel

"Patrick J. LoPresti" <patl@curl.com> writes:

> Lawrence Greenfield <leg+@andrew.cmu.edu> writes:
> 
> > Actually, it's not all that simple (you have to find the enclosing
> > directories of any files you're modifying, which might require string
> > manipulation)
> 
> No, you have to find the directories you are modifying.  And the
> application knows darn well which directories it is modifying.
> 
> Don't speculate.  Show some sample code, and let's see how hard it
> would be to use the "Linux way".  I am betting on "not hard at all".

 I added fsync() on directories to exim-3.31, it took about 2hrs
coding and another hours testing it (with strace) to make sure it was
doing the right thing. That was from almost never seeing the source
before.
 The only reason it took that long was because that version of exim
altered the spool in a couple of different places. Forward porting to
3.951 took about 20minutes IIRC (that version only plays witht he
spool in one place).

-- 
# James Antill -- james@and.org
:0:
* ^From: .*james@and\.org
/dev/null

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 21:14           ` Chris Mason
  2002-07-15 21:31             ` Patrick J. LoPresti
@ 2002-07-16 12:35             ` Matthias Andree
  1 sibling, 0 replies; 90+ messages in thread
From: Matthias Andree @ 2002-07-16 12:35 UTC (permalink / raw)
  To: linux-kernel

On Mon, 15 Jul 2002, Chris Mason wrote:

> On Mon, 2002-07-15 at 15:13, Patrick J. LoPresti wrote:
> 
> > > 1) that newly grown file is someone's inbox, and the old contents of the
> > > new block include someone else's private message.
> > >
> > > 2) That newly grown file is a control file for the application, and the
> > > application expects it to contain valid data within (think sendmail).  
> > 
> > In a correctly-written application, neither of these things can
> > happen.  (See my earlier message today on fsync() and MTAs.)  To get a
> > file onto disk reliably, the application must 1) flush the data, and
> > then 2) flush a "validity" indicator.  This could be a sequence like:
> > 
> >   create temp file
> >   flush data to temp file
> >   rename temp file
> >   flush rename operation
> 
> Yes, most mtas do this for queue files, I'm not sure how many do it for
> the actual spool file.  mail server authors are more than welcome to

Less. For one, Postfix' local(8) daemon relies on synchronous directory
update for Maildir spools. For mbox spool, the problem is less
prevalent, because spool files usually exist already and fsync() is
sufficient (and fsync() is done before local(8) reports success to the
queue manager).

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
  2002-07-15 15:22   ` Patrick J. LoPresti
                       ` (2 preceding siblings ...)
       [not found]     ` <20020715173337$acad@traf.lcs.mit.edu>
@ 2002-07-16  7:07     ` Dax Kelson
  3 siblings, 0 replies; 90+ messages in thread
From: Dax Kelson @ 2002-07-16  7:07 UTC (permalink / raw)
  To: Patrick J. LoPresti; +Cc: linux-kernel

On Mon, 2002-07-15 at 09:22, Patrick J. LoPresti wrote:

> One other thing.  I think this statement is misleading:
> 
>     IF your server is stable and not prone to crashing, and/or you
>     have the write cache on your hard drives battery backed, you
>     should strongly consider using the writeback journaling mode of
>     Ext3 versus ordered.

I rewrote that statement on the website.

Dax Kelson
Guru Labs


^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2002-07-19 20:46 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-07-12 16:21 [ANNOUNCE] Ext3 vs Reiserfs benchmarks Dax Kelson
2002-07-12 17:05 ` Andreas Dilger
2002-07-12 17:26   ` kwijibo
2002-07-12 17:36     ` Andreas Dilger
2002-07-12 20:34 ` Chris Mason
2002-07-13  4:44 ` Daniel Phillips
2002-07-14 20:40 ` Dax Kelson
2002-07-15  8:26   ` Sam Vilain
2002-07-15 12:30     ` Alan Cox
2002-07-15 12:02       ` Sam Vilain
2002-07-15 13:23         ` Alan Cox
2002-07-15 13:40           ` Chris Mason
2002-07-15 19:40             ` Andrew Morton
2002-07-15 15:12         ` Andrea Arcangeli
2002-07-15 16:03         ` Andreas Dilger
2002-07-15 16:12           ` Daniel Phillips
2002-07-15 17:48           ` Sam Vilain
2002-07-15 18:47             ` Mathieu Chouquet-Stringer
2002-07-15 19:26               ` Sam Vilain
2002-07-16  8:18               ` Stelian Pop
2002-07-16 12:22                 ` Gerhard Mack
2002-07-16 12:49                   ` Stelian Pop
2002-07-16 15:11                     ` Gerhard Mack
2002-07-16 15:22                       ` Andrea Arcangeli
2002-07-16 15:39                       ` Stelian Pop
2002-07-16 19:45                         ` Matthias Andree
2002-07-16 20:04                           ` Shawn
2002-07-16 20:11                             ` Mathieu Chouquet-Stringer
2002-07-16 20:22                               ` Shawn
2002-07-16 20:27                                 ` Mathieu Chouquet-Stringer
2002-07-17 11:45                                 ` Matthias Andree
2002-07-17 19:02                                   ` Andreas Dilger
2002-07-18  9:29                                     ` Matthias Andree
2002-07-19  8:29                                     ` Matthias Andree
2002-07-19 16:39                                       ` Andreas Dilger
2002-07-19 20:01                                         ` Shawn
2002-07-19 20:47                                           ` Andreas Dilger
2002-07-15 21:14             ` Andreas Dilger
2002-07-17 18:41               ` bill davidsen
2002-07-17 19:47                 ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks (whither dump?) Lew Wolfgang
2002-07-16  8:15             ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks Stelian Pop
2002-07-16 12:27               ` Matthias Andree
2002-07-16 12:43                 ` Stelian Pop
2002-07-16 12:53                   ` Matthias Andree
2002-07-16 13:05                     ` Christoph Hellwig
2002-07-16 19:38                       ` Matthias Andree
2002-07-16 19:49                         ` Andreas Dilger
2002-07-16 20:11                         ` Thunder from the hill
2002-07-16 21:06                           ` Matthias Andree
2002-07-16 21:23                             ` Andreas Dilger
2002-07-16 21:38                               ` Thunder from the hill
2002-07-17 11:47                               ` Matthias Andree
2002-07-18 14:50                               ` Bill Davidsen
2002-07-18 15:09                                 ` Rik van Riel
2002-07-16 22:19                             ` Backups done right (was [ANNOUNCE] Ext3 vs Reiserfs benchmarks) stoffel
2002-07-16 22:33                               ` Thunder from the hill
2002-07-18 15:04                               ` Bill Davidsen
2002-07-18 15:27                                 ` Rik van Riel
2002-07-18 15:50                                 ` stoffel
2002-07-18 16:29                                   ` Bill Davidsen
2002-07-19 15:28                               ` Sam Vilain
2002-07-17 18:51                     ` [ANNOUNCE] Ext3 vs Reiserfs benchmarks bill davidsen
2002-07-18  9:32                       ` Matthias Andree
2002-07-15 12:09       ` Matti Aarnio
     [not found] <20020712162306$aa7d@traf.lcs.mit.edu>
     [not found] ` <mit.lcs.mail.linux-kernel/20020712162306$aa7d@traf.lcs.mit.edu>
2002-07-15 15:22   ` Patrick J. LoPresti
2002-07-15 17:31     ` Chris Mason
2002-07-15 18:33     ` Matthias Andree
     [not found]     ` <20020715173337$acad@traf.lcs.mit.edu>
     [not found]       ` <mit.lcs.mail.linux-kernel/20020715173337$acad@traf.lcs.mit.edu>
2002-07-15 19:13         ` Patrick J. LoPresti
2002-07-15 20:55           ` Matthias Andree
2002-07-15 21:23             ` Patrick J. LoPresti
2002-07-15 21:38               ` Thunder from the hill
2002-07-16 12:31                 ` Matthias Andree
2002-07-16 15:53                   ` Thunder from the hill
2002-07-16 19:26                     ` Matthias Andree
2002-07-16 19:38                       ` Thunder from the hill
2002-07-15 21:59               ` Ketil Froyn
2002-07-15 23:08                 ` Matti Aarnio
2002-07-16 12:33                   ` Matthias Andree
2002-07-15 22:55             ` Alan Cox
2002-07-15 21:58               ` Matthias Andree
2002-07-15 21:14           ` Chris Mason
2002-07-15 21:31             ` Patrick J. LoPresti
2002-07-15 22:12               ` Richard A Nelson
2002-07-16  1:02               ` Lawrence Greenfield
     [not found]                 ` <mit.lcs.mail.linux-kernel/200207160102.g6G12BiH022986@lin2.andrew.cmu.edu>
2002-07-16  1:43                   ` Patrick J. LoPresti
2002-07-16  1:56                     ` Thunder from the hill
2002-07-16 12:47                     ` Matthias Andree
2002-07-16 21:09                     ` James Antill
2002-07-16 12:35             ` Matthias Andree
2002-07-16  7:07     ` Dax Kelson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).