linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* very poor ext3 write performance on big filesystems?
@ 2008-02-18 12:57 Tomasz Chmielewski
  2008-02-18 14:03 ` Andi Kleen
  2008-02-19  9:24 ` Vladislav Bolkhovitin
  0 siblings, 2 replies; 28+ messages in thread
From: Tomasz Chmielewski @ 2008-02-18 12:57 UTC (permalink / raw)
  To: LKML, LKML

I have a 1.2 TB (of which 750 GB is used) filesystem which holds
almost 200 millions of files.
1.2 TB doesn't make this filesystem that big, but 200 millions of files 
is a decent number.


Most of the files are hardlinked multiple times, some of them are
hardlinked thousands of times.


Recently I began removing some of unneeded files (or hardlinks) and to 
my surprise, it takes longer than I initially expected.


After cache is emptied (echo 3 > /proc/sys/vm/drop_caches) I can usually 
remove about 50000-200000 files with moderate performance. I see up to 
5000 kB read/write from/to the disk, wa reported by top is usually 20-70%.


After that, waiting for IO grows to 99%, and disk write speed is down to 
50 kB/s - 200 kB/s (fifty - two hundred kilobytes/s).


Is it normal to expect the write speed go down to only few dozens of 
kilobytes/s? Is it because of that many seeks? Can it be somehow 
optimized? The machine has loads of free memory, perhaps it could be 
uses better?


Also, writing big files is very slow - it takes more than 4 minutes to 
write and sync a 655 MB file (so, a little bit more than 1 MB/s) - 
fragmentation perhaps?

+ dd if=/dev/zero of=testfile bs=64k count=10000
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 3,12109 seconds, 210 MB/s
+ sync
0.00user 2.14system 4:06.76elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+883minor)pagefaults 0swaps


# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda              1,2T  697G  452G  61% /mnt/iscsi_backup

# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda                154M     20M    134M   13% /mnt/iscsi_backup




-- 
Tomasz Chmielewski
http://wpkg.org


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 12:57 very poor ext3 write performance on big filesystems? Tomasz Chmielewski
@ 2008-02-18 14:03 ` Andi Kleen
  2008-02-18 14:16   ` Theodore Tso
  2008-02-19  9:24 ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2008-02-18 14:03 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: LKML, LKML

Tomasz Chmielewski <mangoo@wpkg.org> writes:
>
> Is it normal to expect the write speed go down to only few dozens of
> kilobytes/s? Is it because of that many seeks? Can it be somehow
> optimized? 

I have similar problems on my linux source partition which also
has a lot of hard linked files (although probably not quite
as many as you do). It seems like hard linking prevents
some of the heuristics ext* uses to generate non fragmented
disk layouts and the resulting seeking makes things slow.

What has helped a bit was to recreate the file system with -O^dir_index
dir_index seems to cause more seeks.

Also keeping enough free space is also a good idea because that
allows the file system code better choices on where to place data.

-Andi


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 14:03 ` Andi Kleen
@ 2008-02-18 14:16   ` Theodore Tso
  2008-02-18 15:02     ` Tomasz Chmielewski
                       ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Theodore Tso @ 2008-02-18 14:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tomasz Chmielewski, LKML, LKML

[-- Attachment #1: Type: text/plain, Size: 1907 bytes --]

On Mon, Feb 18, 2008 at 03:03:44PM +0100, Andi Kleen wrote:
> Tomasz Chmielewski <mangoo@wpkg.org> writes:
> >
> > Is it normal to expect the write speed go down to only few dozens of
> > kilobytes/s? Is it because of that many seeks? Can it be somehow
> > optimized? 
> 
> I have similar problems on my linux source partition which also
> has a lot of hard linked files (although probably not quite
> as many as you do). It seems like hard linking prevents
> some of the heuristics ext* uses to generate non fragmented
> disk layouts and the resulting seeking makes things slow.

ext3 tries to keep inodes in the same block group as their containing
directory.  If you have lots of hard links, obviously it can't really
do that, especially since we don't have a good way at mkdir time to
tell the filesystem, "Psst!  This is going to be a hard link clone of
that directory over there, put it in the same block group".

> What has helped a bit was to recreate the file system with -O^dir_index
> dir_index seems to cause more seeks.

Part of it may have simply been recreating the filesystem, not
necessarily removing the dir_index feature.  Dir_index speeds up
individual lookups, but it slows down workloads that do a readdir
followed by a stat of all of the files in the workload.  You can work
around this by calling readdir(), sorting all of the entries by inode
number, and then calling open or stat or whatever.  So this can help
out for workloads that are doing find or rm -r on a dir_index
workload.  Basically, it helps for some things, hurts for others.
Once things are in the cache it doesn't matter of course.

The following ld_preload can help in some cases.  Mutt has this hack
encoded in for maildir directories, which helps.

> Also keeping enough free space is also a good idea because that
> allows the file system code better choices on where to place data.

Yep, that too.

					- Ted


[-- Attachment #2: spd_readdir.c --]
[-- Type: text/x-csrc, Size: 6551 bytes --]

/*
 * readdir accelerator
 *
 * (C) Copyright 2003, 2004 by Theodore Ts'o.
 *
 * Compile using the command:
 *
 * gcc -o spd_readdir.so -shared spd_readdir.c -ldl
 *
 * Use it by setting the LD_PRELOAD environment variable:
 * 
 * export LD_PRELOAD=/usr/local/sbin/spd_readdir.so
 *
 * %Begin-Header%
 * This file may be redistributed under the terms of the GNU Public
 * License.
 * %End-Header%
 * 
 */

#define ALLOC_STEPSIZE	100
#define MAX_DIRSIZE	0

#define DEBUG

#ifdef DEBUG
#define DEBUG_DIR(x)	{if (do_debug) { x; }}
#else
#define DEBUG_DIR(x)
#endif

#define _GNU_SOURCE
#define __USE_LARGEFILE64

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>
#include <dirent.h>
#include <errno.h>
#include <dlfcn.h>

struct dirent_s {
	unsigned long long d_ino;
	long long d_off;
	unsigned short int d_reclen;
	unsigned char d_type;
	char *d_name;
};

struct dir_s {
	DIR	*dir;
	int	num;
	int	max;
	struct dirent_s *dp;
	int	pos;
	int	fd;
	struct dirent ret_dir;
	struct dirent64 ret_dir64;
};

static int (*real_closedir)(DIR *dir) = 0;
static DIR *(*real_opendir)(const char *name) = 0;
static struct dirent *(*real_readdir)(DIR *dir) = 0;
static struct dirent64 *(*real_readdir64)(DIR *dir) = 0;
static off_t (*real_telldir)(DIR *dir) = 0;
static void (*real_seekdir)(DIR *dir, off_t offset) = 0;
static int (*real_dirfd)(DIR *dir) = 0;
static unsigned long max_dirsize = MAX_DIRSIZE;
static num_open = 0;
#ifdef DEBUG
static int do_debug = 0;
#endif

static void setup_ptr()
{
	char *cp;

	real_opendir = dlsym(RTLD_NEXT, "opendir");
	real_closedir = dlsym(RTLD_NEXT, "closedir");
	real_readdir = dlsym(RTLD_NEXT, "readdir");
	real_readdir64 = dlsym(RTLD_NEXT, "readdir64");
	real_telldir = dlsym(RTLD_NEXT, "telldir");
	real_seekdir = dlsym(RTLD_NEXT, "seekdir");
	real_dirfd = dlsym(RTLD_NEXT, "dirfd");
	if ((cp = getenv("SPD_READDIR_MAX_SIZE")) != NULL) {
		max_dirsize = atol(cp);
	}
#ifdef DEBUG
	if (getenv("SPD_READDIR_DEBUG"))
		do_debug++;
#endif
}

static void free_cached_dir(struct dir_s *dirstruct)
{
	int i;

	if (!dirstruct->dp)
		return;

	for (i=0; i < dirstruct->num; i++) {
		free(dirstruct->dp[i].d_name);
	}
	free(dirstruct->dp);
	dirstruct->dp = 0;
}	

static int ino_cmp(const void *a, const void *b)
{
	const struct dirent_s *ds_a = (const struct dirent_s *) a;
	const struct dirent_s *ds_b = (const struct dirent_s *) b;
	ino_t i_a, i_b;
	
	i_a = ds_a->d_ino;
	i_b = ds_b->d_ino;

	if (ds_a->d_name[0] == '.') {
		if (ds_a->d_name[1] == 0)
			i_a = 0;
		else if ((ds_a->d_name[1] == '.') && (ds_a->d_name[2] == 0))
			i_a = 1;
	}
	if (ds_b->d_name[0] == '.') {
		if (ds_b->d_name[1] == 0)
			i_b = 0;
		else if ((ds_b->d_name[1] == '.') && (ds_b->d_name[2] == 0))
			i_b = 1;
	}

	return (i_a - i_b);
}


DIR *opendir(const char *name)
{
	DIR *dir;
	struct dir_s	*dirstruct;
	struct dirent_s *ds, *dnew;
	struct dirent64 *d;
	struct stat st;

	if (!real_opendir)
		setup_ptr();

	DEBUG_DIR(printf("Opendir(%s) (%d open)\n", name, num_open++));
	dir = (*real_opendir)(name);
	if (!dir)
		return NULL;

	dirstruct = malloc(sizeof(struct dir_s));
	if (!dirstruct) {
		(*real_closedir)(dir);
		errno = -ENOMEM;
		return NULL;
	}
	dirstruct->num = 0;
	dirstruct->max = 0;
	dirstruct->dp = 0;
	dirstruct->pos = 0;
	dirstruct->dir = 0;

	if (max_dirsize && (stat(name, &st) == 0) && 
	    (st.st_size > max_dirsize)) {
		DEBUG_DIR(printf("Directory size %ld, using direct readdir\n",
				 st.st_size));
		dirstruct->dir = dir;
		return (DIR *) dirstruct;
	}

	while ((d = (*real_readdir64)(dir)) != NULL) {
		if (dirstruct->num >= dirstruct->max) {
			dirstruct->max += ALLOC_STEPSIZE;
			DEBUG_DIR(printf("Reallocating to size %d\n", 
					 dirstruct->max));
			dnew = realloc(dirstruct->dp, 
				       dirstruct->max * sizeof(struct dir_s));
			if (!dnew)
				goto nomem;
			dirstruct->dp = dnew;
		}
		ds = &dirstruct->dp[dirstruct->num++];
		ds->d_ino = d->d_ino;
		ds->d_off = d->d_off;
		ds->d_reclen = d->d_reclen;
		ds->d_type = d->d_type;
		if ((ds->d_name = malloc(strlen(d->d_name)+1)) == NULL) {
			dirstruct->num--;
			goto nomem;
		}
		strcpy(ds->d_name, d->d_name);
		DEBUG_DIR(printf("readdir: %lu %s\n", 
				 (unsigned long) d->d_ino, d->d_name));
	}
	dirstruct->fd = dup((*real_dirfd)(dir));
	(*real_closedir)(dir);
	qsort(dirstruct->dp, dirstruct->num, sizeof(struct dirent_s), ino_cmp);
	return ((DIR *) dirstruct);
nomem:
	DEBUG_DIR(printf("No memory, backing off to direct readdir\n"));
	free_cached_dir(dirstruct);
	dirstruct->dir = dir;
	return ((DIR *) dirstruct);
}

int closedir(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	DEBUG_DIR(printf("Closedir (%d open)\n", --num_open));
	if (dirstruct->dir)
		(*real_closedir)(dirstruct->dir);

	if (dirstruct->fd >= 0)
		close(dirstruct->fd);
	free_cached_dir(dirstruct);
	free(dirstruct);
	return 0;
}

struct dirent *readdir(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;
	struct dirent_s *ds;

	if (dirstruct->dir)
		return (*real_readdir)(dirstruct->dir);

	if (dirstruct->pos >= dirstruct->num)
		return NULL;

	ds = &dirstruct->dp[dirstruct->pos++];
	dirstruct->ret_dir.d_ino = ds->d_ino;
	dirstruct->ret_dir.d_off = ds->d_off;
	dirstruct->ret_dir.d_reclen = ds->d_reclen;
	dirstruct->ret_dir.d_type = ds->d_type;
	strncpy(dirstruct->ret_dir.d_name, ds->d_name,
		sizeof(dirstruct->ret_dir.d_name));

	return (&dirstruct->ret_dir);
}

struct dirent64 *readdir64(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;
	struct dirent_s *ds;

	if (dirstruct->dir)
		return (*real_readdir64)(dirstruct->dir);

	if (dirstruct->pos >= dirstruct->num)
		return NULL;

	ds = &dirstruct->dp[dirstruct->pos++];
	dirstruct->ret_dir64.d_ino = ds->d_ino;
	dirstruct->ret_dir64.d_off = ds->d_off;
	dirstruct->ret_dir64.d_reclen = ds->d_reclen;
	dirstruct->ret_dir64.d_type = ds->d_type;
	strncpy(dirstruct->ret_dir64.d_name, ds->d_name,
		sizeof(dirstruct->ret_dir64.d_name));

	return (&dirstruct->ret_dir64);
}

off_t telldir(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	if (dirstruct->dir)
		return (*real_telldir)(dirstruct->dir);

	return ((off_t) dirstruct->pos);
}

void seekdir(DIR *dir, off_t offset)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	if (dirstruct->dir) {
		(*real_seekdir)(dirstruct->dir, offset);
		return;
	}

	dirstruct->pos = offset;
}

int dirfd(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	if (dirstruct->dir)
		return (*real_dirfd)(dirstruct->dir);

	return (dirstruct->fd);
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 14:16   ` Theodore Tso
@ 2008-02-18 15:02     ` Tomasz Chmielewski
  2008-02-18 15:16       ` Theodore Tso
  2008-02-18 15:18     ` Andi Kleen
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Tomasz Chmielewski @ 2008-02-18 15:02 UTC (permalink / raw)
  To: Theodore Tso, Andi Kleen, Tomasz Chmielewski, LKML, LKML

Theodore Tso schrieb:

(...)

>> What has helped a bit was to recreate the file system with -O^dir_index
>> dir_index seems to cause more seeks.
> 
> Part of it may have simply been recreating the filesystem, not
> necessarily removing the dir_index feature.

You mean, copy data somewhere else, mkfs a new filesystem, and copy data 
back?

Unfortunately, doing it on a file level is not possible with a 
reasonable amount of time.

I tried to copy that filesystem once (when it was much smaller) with 
"rsync -a -H", but after 3 days, rsync was still building an index and 
didn't copy any file.


Also, as files/hardlinks come and go, it would degrade again.


Are there better choices than ext3 for a filesystem with lots of 
hardlinks? ext4, once it's ready? xfs?


-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 15:18     ` Andi Kleen
@ 2008-02-18 15:03       ` Theodore Tso
  0 siblings, 0 replies; 28+ messages in thread
From: Theodore Tso @ 2008-02-18 15:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tomasz Chmielewski, LKML, LKML

On Mon, Feb 18, 2008 at 04:18:23PM +0100, Andi Kleen wrote:
> On Mon, Feb 18, 2008 at 09:16:41AM -0500, Theodore Tso wrote:
> > ext3 tries to keep inodes in the same block group as their containing
> > directory.  If you have lots of hard links, obviously it can't really
> > do that, especially since we don't have a good way at mkdir time to
> > tell the filesystem, "Psst!  This is going to be a hard link clone of
> > that directory over there, put it in the same block group".
> 
> Hmm, you think such a hint interface would be worth it?

It would definitely help ext2/3/4.  An interesting question is whether
it would help enough other filesystems that's worth adding.  

> > necessarily removing the dir_index feature.  Dir_index speeds up
> > individual lookups, but it slows down workloads that do a readdir
> 
> But only for large directories right? For kernel source like
> directory sizes it seems to be a general loss.

On my todo list is a hack which does the sorting of directory inodes
by inode number inside the kernel for smallish directories (say, less
than 2-3 blocks) where using the kernel memory space to store the
directory entries is acceptable, and which would speed up dir_index
performance for kernel source-like directory sizes --- without needing
to use the spd_readdir LD_PRELOAD hack.

But yes, right now, if you know that your directories are almost
always going to be kernel source like in size, then omitting dir_index
is probably goint to be a good idea.  

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 15:02     ` Tomasz Chmielewski
@ 2008-02-18 15:16       ` Theodore Tso
  2008-02-18 15:57         ` Andi Kleen
  2008-02-18 16:16         ` Tomasz Chmielewski
  0 siblings, 2 replies; 28+ messages in thread
From: Theodore Tso @ 2008-02-18 15:16 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: Andi Kleen, LKML, LKML

On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote:
> I tried to copy that filesystem once (when it was much smaller) with "rsync 
> -a -H", but after 3 days, rsync was still building an index and didn't copy 
> any file.

If you're going to copy the whole filesystem don't use rsync!  Use cp
or a tar pipeline to move the files.

> Also, as files/hardlinks come and go, it would degrade again.

Yes...

> Are there better choices than ext3 for a filesystem with lots of hardlinks? 
> ext4, once it's ready? xfs?

All filesystems are going to have problems keeping inodes close to
directories when you have huge numbers of hard links.

I'd really need to know exactly what kind of operations you were
trying to do that were causing problems before I could say for sure.
Yes, you said you were removing unneeded files, but how were you doing
it?  With rm -r of old hard-linked directories?  How big are the
average files involved?  Etc.

					- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 14:16   ` Theodore Tso
  2008-02-18 15:02     ` Tomasz Chmielewski
@ 2008-02-18 15:18     ` Andi Kleen
  2008-02-18 15:03       ` Theodore Tso
  2008-02-19 14:54     ` Tomasz Chmielewski
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2008-02-18 15:18 UTC (permalink / raw)
  To: Theodore Tso, Andi Kleen, Tomasz Chmielewski, LKML, LKML

On Mon, Feb 18, 2008 at 09:16:41AM -0500, Theodore Tso wrote:
> ext3 tries to keep inodes in the same block group as their containing
> directory.  If you have lots of hard links, obviously it can't really
> do that, especially since we don't have a good way at mkdir time to
> tell the filesystem, "Psst!  This is going to be a hard link clone of
> that directory over there, put it in the same block group".

Hmm, you think such a hint interface would be worth it?

> 
> > What has helped a bit was to recreate the file system with -O^dir_index
> > dir_index seems to cause more seeks.
> 
> Part of it may have simply been recreating the filesystem, not

Undoubtedly.

> necessarily removing the dir_index feature.  Dir_index speeds up
> individual lookups, but it slows down workloads that do a readdir

But only for large directories right? For kernel source like
directory sizes it seems to be a general loss.

-Andi


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 15:57         ` Andi Kleen
@ 2008-02-18 15:35           ` Theodore Tso
  2008-02-20 10:57             ` Jan Engelhardt
  0 siblings, 1 reply; 28+ messages in thread
From: Theodore Tso @ 2008-02-18 15:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tomasz Chmielewski, LKML, LKML

On Mon, Feb 18, 2008 at 04:57:25PM +0100, Andi Kleen wrote:
> > Use cp
> > or a tar pipeline to move the files.
> 
> Are you sure cp handles hardlinks correctly? I know tar does,
> but I have my doubts about cp.

I *think* GNU cp does the right thing with --preserve=links.  I'm not
100% sure, though --- like you, probably, I always use tar for moving
or copying directory hierarchies.

  	     	     	       	       - Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 15:16       ` Theodore Tso
@ 2008-02-18 15:57         ` Andi Kleen
  2008-02-18 15:35           ` Theodore Tso
  2008-02-18 16:16         ` Tomasz Chmielewski
  1 sibling, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2008-02-18 15:57 UTC (permalink / raw)
  To: Theodore Tso, Tomasz Chmielewski, Andi Kleen, LKML, LKML

On Mon, Feb 18, 2008 at 10:16:32AM -0500, Theodore Tso wrote:
> On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote:
> > I tried to copy that filesystem once (when it was much smaller) with "rsync 
> > -a -H", but after 3 days, rsync was still building an index and didn't copy 
> > any file.
> 
> If you're going to copy the whole filesystem don't use rsync! 

Yes, I managed to kill systems (drive them really badly into oom and
get very long swap storms) with rsync -H in the past too. Something is very 
wrong with the rsync implementation of this.

> Use cp
> or a tar pipeline to move the files.

Are you sure cp handles hardlinks correctly? I know tar does,
but I have my doubts about cp.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 15:16       ` Theodore Tso
  2008-02-18 15:57         ` Andi Kleen
@ 2008-02-18 16:16         ` Tomasz Chmielewski
  2008-02-18 18:45           ` Theodore Tso
  1 sibling, 1 reply; 28+ messages in thread
From: Tomasz Chmielewski @ 2008-02-18 16:16 UTC (permalink / raw)
  To: Theodore Tso, Tomasz Chmielewski, Andi Kleen, LKML, LKML

Theodore Tso schrieb:

>> Are there better choices than ext3 for a filesystem with lots of hardlinks? 
>> ext4, once it's ready? xfs?
> 
> All filesystems are going to have problems keeping inodes close to
> directories when you have huge numbers of hard links.
> 
> I'd really need to know exactly what kind of operations you were
> trying to do that were causing problems before I could say for sure.
> Yes, you said you were removing unneeded files, but how were you doing
> it?  With rm -r of old hard-linked directories?

Yes, with rm -r.


> How big are the
> average files involved?  Etc.

It's hard to estimate the average size of a file. I'd say there are not 
many files bigger than 50 MB.

Basically, it's a filesystem where backups are kept. Backups are made 
with BackupPC [1].

Imagine a full rootfs backup of 100 Linux systems.

Instead of compressing and writing "/bin/bash" 100 times for each 
separate system, we do it once, and hardlink. Then, keep 40 copies back, 
and you have 4000 hardlinks.

For individual or user files, the number of hardlinks will be smaller of 
course.

The directories I want to remove have usually a structure of a "normal" 
Linux rootfs, nothing special there (other than most of the files will 
have multiple hardlinks).


I noticed using write back helps a tiny bit, but as dm and md don't 
support write barriers, I'm not very eager to use it.


[1] http://backuppc.sf.net
http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues



-- 
Tomasz Chmielewski
http://wpkg.org


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 16:16         ` Tomasz Chmielewski
@ 2008-02-18 18:45           ` Theodore Tso
  0 siblings, 0 replies; 28+ messages in thread
From: Theodore Tso @ 2008-02-18 18:45 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: Andi Kleen, LKML, LKML

On Mon, Feb 18, 2008 at 05:16:55PM +0100, Tomasz Chmielewski wrote:
> Theodore Tso schrieb:
>
>> I'd really need to know exactly what kind of operations you were
>> trying to do that were causing problems before I could say for sure.
>> Yes, you said you were removing unneeded files, but how were you doing
>> it?  With rm -r of old hard-linked directories?
>
> Yes, with rm -r.

You should definitely try the spd_readdir hack; that will help reduce
the seek times.  This will probably help on any block group oriented
filesystems, including XFS, etc.

>> How big are the
>> average files involved?  Etc.
>
> It's hard to estimate the average size of a file. I'd say there are not 
> many files bigger than 50 MB.

Well, Ext4 will help for files bigger than 48k.

The other thing that might help for you is using an external journal
on a separate hard drive (either for ext3 or ext4).  That will help
alleviate some of the seek storms going on, since the journal is
written to only sequentially, and putting it on a separate hard drive
will help remove some of the contention on the hard drive.  

I assume that your 1.2 TB filesystem is located on a RAID array; did
you use the mke2fs -E stride option to make sure all of the bitmaps
don't get concentrated on one hard drive spindle?  One of the failure
modes which can happen is if you use a 4+1 raid 5 setup, that all of
the block and inode bitmaps can end up getting laid out on a single
hard drive, so it becomes a bottleneck for bitmap intensive workloads
--- including "rm -rf".  So that's another thing that might be going
on.  If you do a "dumpe2fs", and look at the block numbers for the
block and inode allocation bitmaps, and you find that they are are all
landing on the same physical hard drive, then that's very clearly the
biggest problem given an "rm -rf" workload.  You should be able to see
this as well visually; if one hard drive has its hard drive light
almost constantly on, and the other ones don't have much activity,
that's probably what is happening.

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 12:57 very poor ext3 write performance on big filesystems? Tomasz Chmielewski
  2008-02-18 14:03 ` Andi Kleen
@ 2008-02-19  9:24 ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 28+ messages in thread
From: Vladislav Bolkhovitin @ 2008-02-19  9:24 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: LKML, LKML

Tomasz Chmielewski wrote:
> I have a 1.2 TB (of which 750 GB is used) filesystem which holds
> almost 200 millions of files.
> 1.2 TB doesn't make this filesystem that big, but 200 millions of files 
> is a decent number.
> 
> 
> Most of the files are hardlinked multiple times, some of them are
> hardlinked thousands of times.
> 
> 
> Recently I began removing some of unneeded files (or hardlinks) and to 
> my surprise, it takes longer than I initially expected.
> 
> 
> After cache is emptied (echo 3 > /proc/sys/vm/drop_caches) I can usually 
> remove about 50000-200000 files with moderate performance. I see up to 
> 5000 kB read/write from/to the disk, wa reported by top is usually 20-70%.
> 
> 
> After that, waiting for IO grows to 99%, and disk write speed is down to 
> 50 kB/s - 200 kB/s (fifty - two hundred kilobytes/s).
> 
> 
> Is it normal to expect the write speed go down to only few dozens of 
> kilobytes/s? Is it because of that many seeks? Can it be somehow 
> optimized? The machine has loads of free memory, perhaps it could be 
> uses better?
> 
> 
> Also, writing big files is very slow - it takes more than 4 minutes to 
> write and sync a 655 MB file (so, a little bit more than 1 MB/s) - 
> fragmentation perhaps?

It would be really interesting if you try your workload with XFS. In my 
experience, XFS considerably outperforms ext3 on big (> few hundreds MB) 
disks.

Vlad

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 14:16   ` Theodore Tso
  2008-02-18 15:02     ` Tomasz Chmielewski
  2008-02-18 15:18     ` Andi Kleen
@ 2008-02-19 14:54     ` Tomasz Chmielewski
  2008-02-19 15:06       ` Chris Mason
  2008-02-19 18:29     ` Mark Lord
  2008-02-27 11:20     ` Tomasz Chmielewski
  4 siblings, 1 reply; 28+ messages in thread
From: Tomasz Chmielewski @ 2008-02-19 14:54 UTC (permalink / raw)
  To: Theodore Tso, Andi Kleen, Tomasz Chmielewski, LKML, LKML

Theodore Tso schrieb:

(...)

> The following ld_preload can help in some cases.  Mutt has this hack
> encoded in for maildir directories, which helps.

It doesn't work very reliable for me.

For some reason, it hangs for me sometimes (doesn't remove any files, rm 
-rf just stalls), or segfaults.


As most of the ideas here in this thread assume (re)creating a new 
filesystem from scratch - would perhaps playing with 
/proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio help a bit?


-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-19 14:54     ` Tomasz Chmielewski
@ 2008-02-19 15:06       ` Chris Mason
  2008-02-19 15:21         ` Tomasz Chmielewski
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Mason @ 2008-02-19 15:06 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: Theodore Tso, Andi Kleen, LKML, LKML

On Tuesday 19 February 2008, Tomasz Chmielewski wrote:
> Theodore Tso schrieb:
>
> (...)
>
> > The following ld_preload can help in some cases.  Mutt has this hack
> > encoded in for maildir directories, which helps.
>
> It doesn't work very reliable for me.
>
> For some reason, it hangs for me sometimes (doesn't remove any files, rm
> -rf just stalls), or segfaults.

You can go the low-tech route (assuming your file names don't have spaces in 
them)

find . -printf "%i %p\n" | sort -n | awk '{print $2}' | xargs rm

>
>
> As most of the ideas here in this thread assume (re)creating a new
> filesystem from scratch - would perhaps playing with
> /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio help a
> bit?

Probably not.  You're seeking between all the inodes on the box, and probably 
not bound by the memory used.

-chris

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-19 15:06       ` Chris Mason
@ 2008-02-19 15:21         ` Tomasz Chmielewski
  2008-02-19 16:04           ` Chris Mason
  0 siblings, 1 reply; 28+ messages in thread
From: Tomasz Chmielewski @ 2008-02-19 15:21 UTC (permalink / raw)
  To: Chris Mason; +Cc: Theodore Tso, Andi Kleen, LKML, LKML

Chris Mason schrieb:
> On Tuesday 19 February 2008, Tomasz Chmielewski wrote:
>> Theodore Tso schrieb:
>>
>> (...)
>>
>>> The following ld_preload can help in some cases.  Mutt has this hack
>>> encoded in for maildir directories, which helps.
>> It doesn't work very reliable for me.
>>
>> For some reason, it hangs for me sometimes (doesn't remove any files, rm
>> -rf just stalls), or segfaults.
> 
> You can go the low-tech route (assuming your file names don't have spaces in 
> them)
> 
> find . -printf "%i %p\n" | sort -n | awk '{print $2}' | xargs rm

Why should it make a difference?

Does "find" find filenames/paths faster than "rm -r"?

Or is "find once/remove once" faster than "find files/rm files/find 
files/rm files/...", which I suppose "rm -r" does?


-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-19 15:21         ` Tomasz Chmielewski
@ 2008-02-19 16:04           ` Chris Mason
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Mason @ 2008-02-19 16:04 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: Theodore Tso, Andi Kleen, LKML, LKML

On Tuesday 19 February 2008, Tomasz Chmielewski wrote:
> Chris Mason schrieb:
> > On Tuesday 19 February 2008, Tomasz Chmielewski wrote:
> >> Theodore Tso schrieb:
> >>
> >> (...)
> >>
> >>> The following ld_preload can help in some cases.  Mutt has this hack
> >>> encoded in for maildir directories, which helps.
> >>
> >> It doesn't work very reliable for me.
> >>
> >> For some reason, it hangs for me sometimes (doesn't remove any files, rm
> >> -rf just stalls), or segfaults.
> >
> > You can go the low-tech route (assuming your file names don't have spaces
> > in them)
> >
> > find . -printf "%i %p\n" | sort -n | awk '{print $2}' | xargs rm
>
> Why should it make a difference?

It does something similar to Ted's ld preload, sorting the results from 
readdir by inode number before using them.  You will still seek quite a lot 
between the directory entries, but operations on the files themselves will go 
in a much more optimal order.  It might help.

>
> Does "find" find filenames/paths faster than "rm -r"?
>
> Or is "find once/remove once" faster than "find files/rm files/find
> files/rm files/...", which I suppose "rm -r" does?

rm -r does removes things in the order that readdir returns.  In your hard 
linked tree (on almost any FS), this will be very random.  The sorting is 
probably the best you can do from userland to optimize the ordering.

-chris


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 14:16   ` Theodore Tso
                       ` (2 preceding siblings ...)
  2008-02-19 14:54     ` Tomasz Chmielewski
@ 2008-02-19 18:29     ` Mark Lord
  2008-02-19 18:41       ` Mark Lord
  2008-02-19 18:58       ` Paulo Marques
  2008-02-27 11:20     ` Tomasz Chmielewski
  4 siblings, 2 replies; 28+ messages in thread
From: Mark Lord @ 2008-02-19 18:29 UTC (permalink / raw)
  To: Theodore Tso, Andi Kleen, Tomasz Chmielewski, LKML, LKML

Theodore Tso wrote:
..
> The following ld_preload can help in some cases.  Mutt has this hack
> encoded in for maildir directories, which helps.
..

Oddly enough, that same spd_readdir() preload craps out here too
when used with "rm -r" on largish directories.

I added a bit more debugging to it, and it always craps out like this:
     
     opendir dir=0x805ad10((nil))
     Readdir64 dir=0x805ad10 pos=0/289/290
     Readdir64 dir=0x805ad10 pos=1/289/290
     Readdir64 dir=0x805ad10 pos=2/289/290
     Readdir64 dir=0x805ad10 pos=3/289/290
     Readdir64 dir=0x805ad10 pos=4/289/290
     ...
     Readdir64 dir=0x805ad10 pos=287/289/290
     Readdir64 dir=0x805ad10 pos=288/289/290
     Readdir64 dir=0x805ad10 pos=289/289/290
     Readdir64 dir=0x805ad10 pos=0/289/290
     Readdir64: dirstruct->dp=(nil)
     Readdir64: ds=(nil)
     Segmentation fault (core dumped)
     

Always.  The "rm -r" loops over the directory, as show above,
and then tries to re-access entry 0 somehow, at which point
it discovers that it's been NULLed out.

Which is weird, because the local seekdir() was never called,
and the code never zeroed/freed that memory itself
(I've got printfs in there..).

Nulling out the qsort has no effect, and smaller/larger
ALLOC_STEPSIZE values don't seem to matter.

But.. when the entire tree is in RAM (freshly unpacked .tar),
it seems to have no problems with it.  As opposed to an uncached tree.

Peculiar.. I wonder where the bug is ?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-19 18:29     ` Mark Lord
@ 2008-02-19 18:41       ` Mark Lord
  2008-02-19 18:58       ` Paulo Marques
  1 sibling, 0 replies; 28+ messages in thread
From: Mark Lord @ 2008-02-19 18:41 UTC (permalink / raw)
  To: Theodore Tso, Andi Kleen, Tomasz Chmielewski, LKML, LKML

Mark Lord wrote:
> Theodore Tso wrote:
> ..
>> The following ld_preload can help in some cases.  Mutt has this hack
>> encoded in for maildir directories, which helps.
> ..
> 
> Oddly enough, that same spd_readdir() preload craps out here too
> when used with "rm -r" on largish directories.
> 
> I added a bit more debugging to it, and it always craps out like this:
>         opendir dir=0x805ad10((nil))
>     Readdir64 dir=0x805ad10 pos=0/289/290
>     Readdir64 dir=0x805ad10 pos=1/289/290
>     Readdir64 dir=0x805ad10 pos=2/289/290
>     Readdir64 dir=0x805ad10 pos=3/289/290
>     Readdir64 dir=0x805ad10 pos=4/289/290
>     ...
>     Readdir64 dir=0x805ad10 pos=287/289/290
>     Readdir64 dir=0x805ad10 pos=288/289/290
>     Readdir64 dir=0x805ad10 pos=289/289/290
>     Readdir64 dir=0x805ad10 pos=0/289/290
>     Readdir64: dirstruct->dp=(nil)
>     Readdir64: ds=(nil)
>     Segmentation fault (core dumped)
>    
> Always.  The "rm -r" loops over the directory, as show above,
> and then tries to re-access entry 0 somehow, at which point
> it discovers that it's been NULLed out.
> 
> Which is weird, because the local seekdir() was never called,
> and the code never zeroed/freed that memory itself
> (I've got printfs in there..).
> 
> Nulling out the qsort has no effect, and smaller/larger
> ALLOC_STEPSIZE values don't seem to matter.
> 
> But.. when the entire tree is in RAM (freshly unpacked .tar),
> it seems to have no problems with it.  As opposed to an uncached tree.
..

I take back that last point -- it also fails even when the tree *is* cached.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-19 18:29     ` Mark Lord
  2008-02-19 18:41       ` Mark Lord
@ 2008-02-19 18:58       ` Paulo Marques
  2008-02-19 22:33         ` Mark Lord
  1 sibling, 1 reply; 28+ messages in thread
From: Paulo Marques @ 2008-02-19 18:58 UTC (permalink / raw)
  To: Mark Lord; +Cc: Theodore Tso, Andi Kleen, Tomasz Chmielewski, LKML, LKML

Mark Lord wrote:
> Theodore Tso wrote:
> ..
>> The following ld_preload can help in some cases.  Mutt has this hack
>> encoded in for maildir directories, which helps.
> ..
> 
> Oddly enough, that same spd_readdir() preload craps out here too
> when used with "rm -r" on largish directories.

 From looking at the code, I think I've found at least one bug in opendir:
...
> 			dnew = realloc(dirstruct->dp, 
> 				       dirstruct->max * sizeof(struct dir_s));
...

Shouldn't this be: "...*sizeof(struct dirent_s));"?

-- 
Paulo Marques - www.grupopie.com

"Nostalgia isn't what it used to be."

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-19 18:58       ` Paulo Marques
@ 2008-02-19 22:33         ` Mark Lord
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Lord @ 2008-02-19 22:33 UTC (permalink / raw)
  To: Paulo Marques; +Cc: Theodore Tso, Andi Kleen, Tomasz Chmielewski, LKML, LKML

Paulo Marques wrote:
> Mark Lord wrote:
>> Theodore Tso wrote:
>> ..
>>> The following ld_preload can help in some cases.  Mutt has this hack
>>> encoded in for maildir directories, which helps.
>> ..
>>
>> Oddly enough, that same spd_readdir() preload craps out here too
>> when used with "rm -r" on largish directories.
> 
>  From looking at the code, I think I've found at least one bug in opendir:
> ...
>>             dnew = realloc(dirstruct->dp,                        
>> dirstruct->max * sizeof(struct dir_s));
> ...
> 
> Shouldn't this be: "...*sizeof(struct dirent_s));"?
..

Yeah, that's one bug.
Another is that ->fd is frequently left uninitialized, yet later used.

Fixing those didn't change the null pointer deaths, though.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 15:35           ` Theodore Tso
@ 2008-02-20 10:57             ` Jan Engelhardt
  2008-02-20 17:44               ` David Rees
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Engelhardt @ 2008-02-20 10:57 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Andi Kleen, Tomasz Chmielewski, LKML, LKML


On Feb 18 2008 10:35, Theodore Tso wrote:
>On Mon, Feb 18, 2008 at 04:57:25PM +0100, Andi Kleen wrote:
>> > Use cp
>> > or a tar pipeline to move the files.
>> 
>> Are you sure cp handles hardlinks correctly? I know tar does,
>> but I have my doubts about cp.
>
>I *think* GNU cp does the right thing with --preserve=links.  I'm not
>100% sure, though --- like you, probably, I always use tar for moving
>or copying directory hierarchies.

But GNU tar does not handle acls and xattrs. So back to rsync/cp/mv.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-20 10:57             ` Jan Engelhardt
@ 2008-02-20 17:44               ` David Rees
  2008-02-20 18:08                 ` Jan Engelhardt
  0 siblings, 1 reply; 28+ messages in thread
From: David Rees @ 2008-02-20 17:44 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Theodore Tso, Andi Kleen, Tomasz Chmielewski, LKML, LKML

On Wed, Feb 20, 2008 at 2:57 AM, Jan Engelhardt <jengelh@computergmbh.de> wrote:
>  But GNU tar does not handle acls and xattrs. So back to rsync/cp/mv.

Huh? The version of tar on my Fedora 8 desktop (tar-1.17-7) does. Just
add the --xattrs option (which turns on --acls and --selinux).

-Dave

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-20 17:44               ` David Rees
@ 2008-02-20 18:08                 ` Jan Engelhardt
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Engelhardt @ 2008-02-20 18:08 UTC (permalink / raw)
  To: David Rees; +Cc: Theodore Tso, Andi Kleen, Tomasz Chmielewski, LKML, LKML


On Feb 20 2008 09:44, David Rees wrote:
>On Wed, Feb 20, 2008 at 2:57 AM, Jan Engelhardt <jengelh@computergmbh.de> wrote:
>>  But GNU tar does not handle acls and xattrs. So back to rsync/cp/mv.
>
>Huh? The version of tar on my Fedora 8 desktop (tar-1.17-7) does. Just
>add the --xattrs option (which turns on --acls and --selinux).

Yeah they probably whipped it up with some patches.

$ tar --xattrs
tar: unrecognized option `--xattrs'
Try `tar --help' or `tar --usage' for more information.
$ tar --acl
tar: unrecognized option `--acl'
Try `tar --help' or `tar --usage' for more information.
$ rpm -q tar
tar-1.17-21
(Not everything that runs rpm is a fedorahat, though)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-18 14:16   ` Theodore Tso
                       ` (3 preceding siblings ...)
  2008-02-19 18:29     ` Mark Lord
@ 2008-02-27 11:20     ` Tomasz Chmielewski
  2008-02-27 20:03       ` Andreas Dilger
  4 siblings, 1 reply; 28+ messages in thread
From: Tomasz Chmielewski @ 2008-02-27 11:20 UTC (permalink / raw)
  To: Theodore Tso, Andi Kleen, LKML, LKML

Theodore Tso schrieb:
> On Mon, Feb 18, 2008 at 03:03:44PM +0100, Andi Kleen wrote:
>> Tomasz Chmielewski <mangoo@wpkg.org> writes:
>>> Is it normal to expect the write speed go down to only few dozens of
>>> kilobytes/s? Is it because of that many seeks? Can it be somehow
>>> optimized? 
>> I have similar problems on my linux source partition which also
>> has a lot of hard linked files (although probably not quite
>> as many as you do). It seems like hard linking prevents
>> some of the heuristics ext* uses to generate non fragmented
>> disk layouts and the resulting seeking makes things slow.

A follow-up to this thread.
Using small optimizations like playing with /proc/sys/vm/* didn't help 
much, increasing "commit=" ext3 mount option helped only a tiny bit.

What *did* help a lot was... disabling the internal bitmap of the RAID-5 
array. "rm -rf" doesn't "pause" for several seconds any more.

If md and dm supported barriers, it would be even better I guess (I 
could enable write cache with some degree of confidence).


This is "iostat sda -d 10" output without the internal bitmap.
The system mostly tries to read (Blk_read/s), and once in a while it
does a big commit (Blk_wrtn/s):

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             164,67      2088,62         0,00      20928          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             180,12      1999,60         0,00      20016          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             172,63      2587,01         0,00      25896          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             156,53      2054,64         0,00      20608          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             170,20      3013,60         0,00      30136          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             119,46      1377,25      5264,67      13800      52752

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             154,05      1897,10         0,00      18952          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             197,70      2177,02         0,00      21792          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             166,47      1805,19         0,00      18088          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             150,95      1552,05         0,00      15536          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             158,44      1792,61         0,00      17944          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             132,47      1399,40      3781,82      14008      37856



With the bitmap enabled, it sometimes behave similarly, but mostly, I
can see as reads compete with writes, and both have very low numbers then:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             112,57       946,11      5837,13       9480      58488

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             157,24      1858,94         0,00      18608          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             116,90      1173,60        44,00      11736        440

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              24,05        85,43       172,46        856       1728

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              25,60        90,40       165,60        904       1656

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              25,05       276,25       180,44       2768       1808

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              22,70        65,60       229,60        656       2296

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              21,66       202,79       786,43       2032       7880

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              20,90        83,20      1800,00        832      18000

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              51,75       237,36       479,52       2376       4800

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              35,43       129,34       245,91       1296       2464

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              34,50        88,00       270,40        880       2704


Now, let's disable the bitmap in the RAID-5 array:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             110,59       536,26       973,43       5368       9744

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             119,68       533,07      1574,43       5336      15760

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             123,78       368,43      2335,26       3688      23376

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             122,48       315,68      1990,01       3160      19920

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             117,08       580,22      1009,39       5808      10104

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             119,50       324,00      1080,80       3240      10808

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             118,36       353,69      1926,55       3544      19304


And let's enable it again - after a while, it degrades again, and I can
see "rm -rf" stops for longer periods:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             162,70      2213,60         0,00      22136          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             165,73      1639,16         0,00      16408          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             119,76      1192,81      3722,16      11952      37296

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             178,70      1855,20         0,00      18552          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             162,64      1528,07         0,80      15296          8

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             182,87      2082,07         0,00      20904          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             168,93      1692,71         0,00      16944          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             177,45      1572,06         0,00      15752          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             123,10      1436,00      4941,60      14360      49416

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             201,30      1984,03         0,00      19880          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             165,50      1555,20        22,40      15552        224

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              25,35       273,05       189,22       2736       1896

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              22,58        63,94       165,43        640       1656

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              69,40       435,20       262,40       4352       2624



There is a related thread (although not much kernel-related) on a 
BackupPC mailing list:

http://thread.gmane.org/gmane.comp.sysutils.backup.backuppc.general/14009

As it's BackupPC software which makes this amount of hardlinks (but hey, 
I can keep ~14 TB of data on a 1.2 TB filesystem which is not even 65% 
full).


-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-27 11:20     ` Tomasz Chmielewski
@ 2008-02-27 20:03       ` Andreas Dilger
  2008-02-27 20:25         ` Tomasz Chmielewski
  2008-03-01 20:04         ` Bill Davidsen
  0 siblings, 2 replies; 28+ messages in thread
From: Andreas Dilger @ 2008-02-27 20:03 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: Theodore Tso, Andi Kleen, LKML, LKML, linux-raid

I'm CCing the linux-raid mailing list, since I suspect they will be
interested in this result.

I would suspect that the "journal guided RAID recovery" mechanism
developed by U.Wisconsin may significantly benefit this workload
because the filesystem journal is already recording all of these
block numbers and the MD bitmap mechanism is pure overhead.

On Feb 27, 2008  12:20 +0100, Tomasz Chmielewski wrote:
> Theodore Tso schrieb:
>> On Mon, Feb 18, 2008 at 03:03:44PM +0100, Andi Kleen wrote:
>>> Tomasz Chmielewski <mangoo@wpkg.org> writes:
>>>> Is it normal to expect the write speed go down to only few dozens of
>>>> kilobytes/s? Is it because of that many seeks? Can it be somehow
>>>> optimized? 
>>>
>>> I have similar problems on my linux source partition which also
>>> has a lot of hard linked files (although probably not quite
>>> as many as you do). It seems like hard linking prevents
>>> some of the heuristics ext* uses to generate non fragmented
>>> disk layouts and the resulting seeking makes things slow.
>
> A follow-up to this thread.
> Using small optimizations like playing with /proc/sys/vm/* didn't help 
> much, increasing "commit=" ext3 mount option helped only a tiny bit.
>
> What *did* help a lot was... disabling the internal bitmap of the RAID-5 
> array. "rm -rf" doesn't "pause" for several seconds any more.
>
> If md and dm supported barriers, it would be even better I guess (I could 
> enable write cache with some degree of confidence).
>
> This is "iostat sda -d 10" output without the internal bitmap.
> The system mostly tries to read (Blk_read/s), and once in a while it
> does a big commit (Blk_wrtn/s):
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             164,67      2088,62         0,00      20928          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             180,12      1999,60         0,00      20016          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             172,63      2587,01         0,00      25896          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             156,53      2054,64         0,00      20608          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             170,20      3013,60         0,00      30136          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             119,46      1377,25      5264,67      13800      52752
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             154,05      1897,10         0,00      18952          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             197,70      2177,02         0,00      21792          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             166,47      1805,19         0,00      18088          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             150,95      1552,05         0,00      15536          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             158,44      1792,61         0,00      17944          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             132,47      1399,40      3781,82      14008      37856
>
>
>
> With the bitmap enabled, it sometimes behave similarly, but mostly, I
> can see as reads compete with writes, and both have very low numbers then:
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             112,57       946,11      5837,13       9480      58488
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             157,24      1858,94         0,00      18608          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             116,90      1173,60        44,00      11736        440
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              24,05        85,43       172,46        856       1728
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              25,60        90,40       165,60        904       1656
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              25,05       276,25       180,44       2768       1808
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              22,70        65,60       229,60        656       2296
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              21,66       202,79       786,43       2032       7880
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              20,90        83,20      1800,00        832      18000
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              51,75       237,36       479,52       2376       4800
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              35,43       129,34       245,91       1296       2464
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              34,50        88,00       270,40        880       2704
>
>
> Now, let's disable the bitmap in the RAID-5 array:
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             110,59       536,26       973,43       5368       9744
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             119,68       533,07      1574,43       5336      15760
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             123,78       368,43      2335,26       3688      23376
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             122,48       315,68      1990,01       3160      19920
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             117,08       580,22      1009,39       5808      10104
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             119,50       324,00      1080,80       3240      10808
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             118,36       353,69      1926,55       3544      19304
>
>
> And let's enable it again - after a while, it degrades again, and I can
> see "rm -rf" stops for longer periods:
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             162,70      2213,60         0,00      22136          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             165,73      1639,16         0,00      16408          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             119,76      1192,81      3722,16      11952      37296
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             178,70      1855,20         0,00      18552          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             162,64      1528,07         0,80      15296          8
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             182,87      2082,07         0,00      20904          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             168,93      1692,71         0,00      16944          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             177,45      1572,06         0,00      15752          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             123,10      1436,00      4941,60      14360      49416
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             201,30      1984,03         0,00      19880          0
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda             165,50      1555,20        22,40      15552        224
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              25,35       273,05       189,22       2736       1896
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              22,58        63,94       165,43        640       1656
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda              69,40       435,20       262,40       4352       2624
>
>
>
> There is a related thread (although not much kernel-related) on a BackupPC 
> mailing list:
>
> http://thread.gmane.org/gmane.comp.sysutils.backup.backuppc.general/14009
>
> As it's BackupPC software which makes this amount of hardlinks (but hey, I 
> can keep ~14 TB of data on a 1.2 TB filesystem which is not even 65% full).
>
>
> -- 
> Tomasz Chmielewski
> http://wpkg.org
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-27 20:03       ` Andreas Dilger
@ 2008-02-27 20:25         ` Tomasz Chmielewski
  2008-03-01 20:04         ` Bill Davidsen
  1 sibling, 0 replies; 28+ messages in thread
From: Tomasz Chmielewski @ 2008-02-27 20:25 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Theodore Tso, Andi Kleen, LKML, LKML, linux-raid

Andreas Dilger schrieb:
> I'm CCing the linux-raid mailing list, since I suspect they will be
> interested in this result.
> 
> I would suspect that the "journal guided RAID recovery" mechanism
> developed by U.Wisconsin may significantly benefit this workload
> because the filesystem journal is already recording all of these
> block numbers and the MD bitmap mechanism is pure overhead.

Also, using anticipatory IO scheduler seems to be the best option for an 
array with lots of seeks and random reads and writes (quite 
surprisingly, closely followed by NOOP - both were behaving much better 
than deadline or CFQ).

Here are some numbers I posted to BackupPC mailing list:

http://thread.gmane.org/gmane.comp.sysutils.backup.backuppc.general/14009



-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
  2008-02-27 20:03       ` Andreas Dilger
  2008-02-27 20:25         ` Tomasz Chmielewski
@ 2008-03-01 20:04         ` Bill Davidsen
  1 sibling, 0 replies; 28+ messages in thread
From: Bill Davidsen @ 2008-03-01 20:04 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Tomasz Chmielewski, Theodore Tso, Andi Kleen, LKML, LKML, linux-raid

Andreas Dilger wrote:
> I'm CCing the linux-raid mailing list, since I suspect they will be
> interested in this result.
>
> I would suspect that the "journal guided RAID recovery" mechanism
> developed by U.Wisconsin may significantly benefit this workload
> because the filesystem journal is already recording all of these
> block numbers and the MD bitmap mechanism is pure overhead.
>   

Thanks for sharing these numbers. I think use of a bitmap is one of 
those things which people have to configure to match their use, 
certainly using a larger bitmap seems to reduce the delays, using an 
external bitmap certainly help, especially on an SSD. But on a large 
array, without a bitmap, performance can be compromised for hours during 
recovery, so the administrator must decide if normal case performance is 
more important than worst case performance.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: very poor ext3 write performance on big filesystems?
       [not found]         ` <9Yg6H-2DJ-23@gated-at.bofh.it>
@ 2008-02-19 13:14           ` Paul Slootman
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Slootman @ 2008-02-19 13:14 UTC (permalink / raw)
  To: linux-kernel

On Mon 18 Feb 2008, Andi Kleen wrote:
> On Mon, Feb 18, 2008 at 10:16:32AM -0500, Theodore Tso wrote:
> > On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote:
> > > I tried to copy that filesystem once (when it was much smaller) with "rsync 
> > > -a -H", but after 3 days, rsync was still building an index and didn't copy 
> > > any file.
> > 
> > If you're going to copy the whole filesystem don't use rsync! 
> 
> Yes, I managed to kill systems (drive them really badly into oom and
> get very long swap storms) with rsync -H in the past too. Something is very 
> wrong with the rsync implementation of this.

Note that the soon-to-be-released version 3.0.0 of rsync has very much
improved performance here, both in speed and memory usage, thanks to a
new incremental transfer protocol (before it read in the complete file
list, first on the source, then on the target, and then only started to
do the actual work).


Paul Slootman

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2008-03-01 20:00 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-18 12:57 very poor ext3 write performance on big filesystems? Tomasz Chmielewski
2008-02-18 14:03 ` Andi Kleen
2008-02-18 14:16   ` Theodore Tso
2008-02-18 15:02     ` Tomasz Chmielewski
2008-02-18 15:16       ` Theodore Tso
2008-02-18 15:57         ` Andi Kleen
2008-02-18 15:35           ` Theodore Tso
2008-02-20 10:57             ` Jan Engelhardt
2008-02-20 17:44               ` David Rees
2008-02-20 18:08                 ` Jan Engelhardt
2008-02-18 16:16         ` Tomasz Chmielewski
2008-02-18 18:45           ` Theodore Tso
2008-02-18 15:18     ` Andi Kleen
2008-02-18 15:03       ` Theodore Tso
2008-02-19 14:54     ` Tomasz Chmielewski
2008-02-19 15:06       ` Chris Mason
2008-02-19 15:21         ` Tomasz Chmielewski
2008-02-19 16:04           ` Chris Mason
2008-02-19 18:29     ` Mark Lord
2008-02-19 18:41       ` Mark Lord
2008-02-19 18:58       ` Paulo Marques
2008-02-19 22:33         ` Mark Lord
2008-02-27 11:20     ` Tomasz Chmielewski
2008-02-27 20:03       ` Andreas Dilger
2008-02-27 20:25         ` Tomasz Chmielewski
2008-03-01 20:04         ` Bill Davidsen
2008-02-19  9:24 ` Vladislav Bolkhovitin
     [not found] <9YdLC-75W-51@gated-at.bofh.it>
     [not found] ` <9YeRh-Gq-39@gated-at.bofh.it>
     [not found]   ` <9Yf0W-SX-19@gated-at.bofh.it>
     [not found]     ` <9YfNi-2da-23@gated-at.bofh.it>
     [not found]       ` <9YfWL-2pZ-1@gated-at.bofh.it>
     [not found]         ` <9Yg6H-2DJ-23@gated-at.bofh.it>
2008-02-19 13:14           ` Paul Slootman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).