* [PATCH] Re: idr in Samba4 [not found] <16759.16648.459393.752417@samba.org> @ 2004-10-21 18:32 ` Jim Houston 2004-10-22 6:17 ` tridge 2004-11-19 7:38 ` performance of filesystem xattrs with Samba4 tridge 0 siblings, 2 replies; 41+ messages in thread From: Jim Houston @ 2004-10-21 18:32 UTC (permalink / raw) To: tridge, Andrew Morton; +Cc: linux-kernel On Thu, 2004-10-21 at 00:54, tridge@samba.org wrote: > Apart from converting idr to use our pool allocator, and some other > minor user-space tweaks, the only significant change I've made is to > add a idr_find() call at the top of idr_remove() to catch possible > errors where idr_remove() is called multiple times. Obviously this is > programmer error if it happens, but I didn't like the default > behaviour (I saw corruption in the tree without this check). Hi Tridge, Andrew, Tridge, thanks for your note. I'm glad to hear you are using idr.c. I agree with your concerns about idr_remove(). It really should fail gracefully and warn if the id being removed is not valid. The attached patch against linux-2.6.9 should do the job without additional overhead. Andrew, I hope you will add this patch to your tree. With the existing code, removing an id which was not allocated could remove a valid id which shares the same lowest layer of the radix tree. I ran a kernel with this patch but have not done any tests to force a failure. Jim Houston - Concurrent Computer Corp. --- linux-2.6.9/lib/idr.c.orig 2004-10-21 12:57:24.547106092 -0400 +++ linux-2.6.9/lib/idr.c 2004-10-21 13:09:28.984974796 -0400 @@ -277,24 +277,31 @@ } EXPORT_SYMBOL(idr_get_new); +static void idr_remove_warning(int id) +{ + printk("idr_remove called for id=%d which is not allocated.\n", id); + dump_stack(); +} + static void sub_remove(struct idr *idp, int shift, int id) { struct idr_layer *p = idp->top; struct idr_layer **pa[MAX_LEVEL]; struct idr_layer ***paa = &pa[0]; + int n; *paa = NULL; *++paa = &idp->top; while ((shift > 0) && p) { - int n = (id >> shift) & IDR_MASK; + n = (id >> shift) & IDR_MASK; __clear_bit(n, &p->bitmap); *++paa = &p->ary[n]; p = p->ary[n]; shift -= IDR_BITS; } - if (likely(p != NULL)){ - int n = id & IDR_MASK; + n = id & IDR_MASK; + if (likely(p != NULL && test_bit(n, &p->bitmap))){ __clear_bit(n, &p->bitmap); p->ary[n] = NULL; while(*paa && ! --((**paa)->count)){ @@ -303,6 +310,8 @@ } if ( ! *paa ) idp->layers = 0; + } else { + idr_remove_warning(id); } } ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] Re: idr in Samba4 2004-10-21 18:32 ` [PATCH] Re: idr in Samba4 Jim Houston @ 2004-10-22 6:17 ` tridge 2004-11-19 7:38 ` performance of filesystem xattrs with Samba4 tridge 1 sibling, 0 replies; 41+ messages in thread From: tridge @ 2004-10-22 6:17 UTC (permalink / raw) To: Jim Houston; +Cc: Andrew Morton, linux-kernel Jim, > The attached patch against linux-2.6.9 should do the job without > additional overhead. Andrew, I hope you will add this patch to > your tree. Thanks, that looks good, and it now passes my randomized testsuite. If you are interested, my test code is at: http://samba.org/ftp/unpacked/junkcode/idtree/ Note that I made idr_remove() and sub_remove() return an int for success/failure, as that was more useful for my code, and it also means we skip the layer free logic on remove failure (not that it does any harm, just seems a bit of a loose end). Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* performance of filesystem xattrs with Samba4 2004-10-21 18:32 ` [PATCH] Re: idr in Samba4 Jim Houston 2004-10-22 6:17 ` tridge @ 2004-11-19 7:38 ` tridge 2004-11-19 8:08 ` James Morris ` (4 more replies) 1 sibling, 5 replies; 41+ messages in thread From: tridge @ 2004-11-19 7:38 UTC (permalink / raw) To: linux-kernel I've been developing the posix backend for Samba4 over the last few months. It has now reached the stage where it is passing most of the test suites, so its time to start some performance testing. The biggest change from the kernels point of view is that Samba4 makes extensive use of filesystem xattrs. Almost every file with have a user.DosAttrib xattr containing file attributes and additional timestamp fields. A lot of files will also have a system.NTACL attribute containing a NT ACL, and many files will have a user.DosStreams xattr for NT alternate data streams. Some rare files will have a user.DosEAs xattr for DOS extended attribute support. Files with streams will also have separate xattrs for each NT stream. I started some simple benchmarking today using the BENCH-NBENCH smbtorture benchmark, with 10 simulated clients and loopback networking on a dual Xeon server with 2G ram and a 50G scsi partition. I used a 2.6.10-rc2 kernel. This benchmark only involves a user.DosAttrib xattr of size 44 on every file (that will be the most common situation in production use). ext2 68 MB/sec ext2+xattr 64 MB/sec ext3 67 MB/sec ext3+xattr 58 MB/sec xfs 62 MB/sec xfs+xattr 40 MB/sec xfs+2Kinode 63 MB/sec xfs+xattr+2Kinode 58 MB/sec tmpfs 69 MB/sec tmpfs+xattr ?? MB/sec (failed) jfs 36 MB/sec jfs+xattr 29 MB/sec reiser 58 MB/sec reiser+xattr 44 MB/sec To get the ext2/ext3 results I needed to add "return NULL;" at the start of ext3_xattr_cache_find() to avoid a bug in the xattr sharing code that causes a oops (I've reported the oops separately). The tmpfs+xattr failure above is because tmpfs didn't seem to allow user xattrs, despite having CONFIG_TMPFS_XATTR=y. I'm very impressed that ext3 has improved so much since I last did Samba benchmarks. It used to always be the slowest in my tests, but now it is the fastest journaled filesystem for Samba4. almost matching tmpfs. The XFS results with default options are rather disappointing, as XFS has usually been a good performer for Samba workloads. Increasing the inode size to 2k brought it back to a more reasonable level. The high cost of xattr support is a bit of a problem. In the above, xattrs were enabled in the filesystems for all runs, the difference being whether I told Samba4 to use them or not. I hope we can reduce the cost of xattrs as otherwise Samba4 is going to be seriously disadvantaged when full windows compatibility is needed. I'm guessing that nearly all Samba installs will be using xattrs by this time next year, as we can't do basic security features like WinXP security zones without them, so making them perform well will be important. To make it easier to benchmark with xattrs, I'm planning on doing a new version of dbench with optional xattr support. That will allow others to play with xattr performance for the above workload without having to delve into the esoteric world of Samba4 development. Apart from the 2k inode with XFS I haven't tried any filesystem tuning options. I'll probably wait till I have xattr support in dbench for that, to make large numbers of runs with different options easier. If anyone wants to see in detail what we are sticking in these xattrs, then look at http://samba.org/ftp/unpacked/samba4/source/librpc/idl/xattr.idl for an IDL specification of the xattr format we are using. Soon we'll be starting to integrate the xattr support with a LSM module, to allow the kernel to interpret the NT ACLs directly to avoid races, make things a little more efficient (using a xattr cache holding unpacked ACLs), and allowing for the possibility of non-Samba file access to obey the NT ACLs. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 7:38 ` performance of filesystem xattrs with Samba4 tridge @ 2004-11-19 8:08 ` James Morris 2004-11-19 10:16 ` Andreas Dilger ` (3 subsequent siblings) 4 siblings, 0 replies; 41+ messages in thread From: James Morris @ 2004-11-19 8:08 UTC (permalink / raw) To: tridge; +Cc: linux-kernel On Fri, 19 Nov 2004 tridge@samba.org wrote: > The tmpfs+xattr failure above is because tmpfs didn't seem to allow > user xattrs, despite having CONFIG_TMPFS_XATTR=y. tmpfs does not have a 'user' xattr handler. xattr support was added to tmpfs only to provide a 'security' xattr handler which calls out to LSM modules such as SELinux. - James -- James Morris <jmorris@redhat.com> ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 7:38 ` performance of filesystem xattrs with Samba4 tridge 2004-11-19 8:08 ` James Morris @ 2004-11-19 10:16 ` Andreas Dilger 2004-11-19 11:43 ` tridge 2004-11-22 13:02 ` tridge 2004-11-19 12:03 ` Anton Altaparmakov ` (2 subsequent siblings) 4 siblings, 2 replies; 41+ messages in thread From: Andreas Dilger @ 2004-11-19 10:16 UTC (permalink / raw) To: tridge; +Cc: linux-kernel [-- Attachment #1.1: Type: text/plain, Size: 2917 bytes --] On Nov 19, 2004 18:38 +1100, tridge@samba.org wrote: > I started some simple benchmarking today using the BENCH-NBENCH > smbtorture benchmark, with 10 simulated clients and loopback > networking on a dual Xeon server with 2G ram and a 50G scsi partition. > I used a 2.6.10-rc2 kernel. This benchmark only involves a > user.DosAttrib xattr of size 44 on every file (that will be the most > common situation in production use). > > ext3 67 MB/sec > ext3+xattr 58 MB/sec > > xfs 62 MB/sec > xfs+xattr 40 MB/sec > xfs+2Kinode 63 MB/sec > xfs+xattr+2Kinode 58 MB/sec Also, we (CFS) have developed patches for ext3 + e2fsprogs to support "fast" EAs stored in larger inodes on disk, and this can improve performance dramatically in the case where you are accessing a large number of inodes with EAs just. Otherwise you are storing the EAs in an external block which requires another seek + read to access, while the large inode EA is already in cache after you read the inode. Also, the fact that you have to read a 4kB EA block into memory for (in our case) a relatively small amount of data really kills the cache. You can select inode sizes from 128..4096 in power-of-two sizes. This patch also provides the infrastructure on disk for storing e.g. nsecond and create timestamps in the ext3 large inodes, but the actual implementation to save/load these isn't there yet. If that were available, would you use it instead of explicitly storing the NTTIME in an EA? I believe the 2.6 stat interface will support nsecond timestamps, but I don't think there is any API to get the create time to userspace though we could hook this up to a pseudo EA. The benefit of storing these common fields in the inode instead of EAs is less overhead. > To get the ext2/ext3 results I needed to add "return NULL;" at the > start of ext3_xattr_cache_find() to avoid a bug in the xattr sharing > code that causes a oops (I've reported the oops separately). I would just configure out the xattr sharing code entirely since it will likely do nothing but increase overhead if any of the EAs on an inode are unique (this is the most common case, except for POSIX-ACL-only setups). I've attached this patch here. I believe all of the ext3 developers agree it should go into the kernel, just nobody has made a push to do so. If this helps your performance (or even if not ;-) we'd be happy to get it into the kernel proper. The e2fsprogs support for same can be found at http://cvs.lustre.org:5000/ though it is mixed in with a lot of other changesets you probably don't care about. Relevant ones are 1.1347.1.2 (Apr 23, 2004) and 1.1421 (Sept 03, 2004). Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ [-- Attachment #1.2: ext3-ea-in-inode-2.6-suse.patch --] [-- Type: text/plain, Size: 24837 bytes --] %patch Index: linux-2.6.0/fs/ext3/ialloc.c =================================================================== --- linux-2.6.0.orig/fs/ext3/ialloc.c 2004-01-14 18:54:11.000000000 +0300 +++ linux-2.6.0/fs/ext3/ialloc.c 2004-01-14 18:54:12.000000000 +0300 @@ -627,6 +627,11 @@ inode->i_generation = EXT3_SB(sb)->s_next_generation++; ei->i_state = EXT3_STATE_NEW; + if (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) { + ei->i_extra_isize = sizeof(__u16) /* i_extra_isize */ + + sizeof(__u16); /* i_pad1 */ + } else + ei->i_extra_isize = 0; ret = inode; if(DQUOT_ALLOC_INODE(inode)) { Index: linux-2.6.0/fs/ext3/inode.c =================================================================== --- linux-2.6.0.orig/fs/ext3/inode.c 2004-01-14 18:54:12.000000000 +0300 +++ linux-2.6.0/fs/ext3/inode.c 2004-01-14 19:09:46.000000000 +0300 @@ -2339,7 +2339,7 @@ * trying to determine the inode's location on-disk and no read need be * performed. */ -static int ext3_get_inode_loc(struct inode *inode, +int ext3_get_inode_loc(struct inode *inode, struct ext3_iloc *iloc, int in_mem) { unsigned long block; @@ -2547,6 +2547,11 @@ ei->i_data[block] = raw_inode->i_block[block]; INIT_LIST_HEAD(&ei->i_orphan); + if (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) + ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize); + else + ei->i_extra_isize = 0; + if (S_ISREG(inode->i_mode)) { inode->i_op = &ext3_file_inode_operations; inode->i_fop = &ext3_file_operations; @@ -2682,6 +2687,9 @@ } else for (block = 0; block < EXT3_N_BLOCKS; block++) raw_inode->i_block[block] = ei->i_data[block]; + if (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) + raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize); + BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata"); rc = ext3_journal_dirty_metadata(handle, bh); if (!err) Index: linux-2.6.0/fs/ext3/xattr.c =================================================================== --- linux-2.6.0.orig/fs/ext3/xattr.c 2003-12-30 08:33:13.000000000 +0300 +++ linux-2.6.0/fs/ext3/xattr.c 2004-01-14 18:54:12.000000000 +0300 @@ -246,17 +246,12 @@ } /* - * ext3_xattr_get() - * - * Copy an extended attribute into the buffer - * provided, or compute the buffer size required. - * Buffer is NULL to compute the size of the buffer required. + * ext3_xattr_block_get() * - * Returns a negative error number on failure, or the number of bytes - * used / required on success. + * routine looks for attribute in EA block and returns it's value and size */ int -ext3_xattr_get(struct inode *inode, int name_index, const char *name, +ext3_xattr_block_get(struct inode *inode, int name_index, const char *name, void *buffer, size_t buffer_size) { struct buffer_head *bh = NULL; @@ -270,7 +265,6 @@ if (name == NULL) return -EINVAL; - down_read(&EXT3_I(inode)->xattr_sem); error = -ENODATA; if (!EXT3_I(inode)->i_file_acl) goto cleanup; @@ -343,15 +337,87 @@ cleanup: brelse(bh); - up_read(&EXT3_I(inode)->xattr_sem); return error; } /* - * ext3_xattr_list() + * ext3_xattr_ibode_get() * - * Copy a list of attribute names into the buffer + * routine looks for attribute in inode body and returns it's value and size + */ +int +ext3_xattr_ibody_get(struct inode *inode, int name_index, const char *name, + void *buffer, size_t buffer_size) +{ + int size, name_len = strlen(name), storage_size; + struct ext3_xattr_entry *last; + struct ext3_inode *raw_inode; + struct ext3_iloc iloc; + char *start, *end; + int ret = -ENOENT; + + if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE) + return -ENOENT; + + ret = ext3_get_inode_loc(inode, &iloc, 1); + if (ret) + return ret; + raw_inode = ext3_raw_inode(&iloc); + + storage_size = EXT3_SB(inode->i_sb)->s_inode_size - + EXT3_GOOD_OLD_INODE_SIZE - + EXT3_I(inode)->i_extra_isize - + sizeof(__u32); + start = (char *) raw_inode + EXT3_GOOD_OLD_INODE_SIZE + + EXT3_I(inode)->i_extra_isize; + if (le32_to_cpu((*(__u32*) start)) != EXT3_XATTR_MAGIC) { + brelse(iloc.bh); + return -ENOENT; + } + start += sizeof(__u32); + end = (char *) raw_inode + EXT3_SB(inode->i_sb)->s_inode_size; + + last = (struct ext3_xattr_entry *) start; + while (!IS_LAST_ENTRY(last)) { + struct ext3_xattr_entry *next = EXT3_XATTR_NEXT(last); + if (le32_to_cpu(last->e_value_size) > storage_size || + (char *) next >= end) { + ext3_error(inode->i_sb, "ext3_xattr_ibody_get", + "inode %ld", inode->i_ino); + brelse(iloc.bh); + return -EIO; + } + if (name_index == last->e_name_index && + name_len == last->e_name_len && + !memcmp(name, last->e_name, name_len)) + goto found; + last = next; + } + + /* can't find EA */ + brelse(iloc.bh); + return -ENOENT; + +found: + size = le32_to_cpu(last->e_value_size); + if (buffer) { + ret = -ERANGE; + if (buffer_size >= size) { + memcpy(buffer, start + le16_to_cpu(last->e_value_offs), + size); + ret = size; + } + } else + ret = size; + brelse(iloc.bh); + return ret; +} + +/* + * ext3_xattr_get() + * + * Copy an extended attribute into the buffer * provided, or compute the buffer size required. * Buffer is NULL to compute the size of the buffer required. * @@ -359,7 +425,31 @@ * used / required on success. */ int -ext3_xattr_list(struct inode *inode, char *buffer, size_t buffer_size) +ext3_xattr_get(struct inode *inode, int name_index, const char *name, + void *buffer, size_t buffer_size) +{ + int err; + + down_read(&EXT3_I(inode)->xattr_sem); + + /* try to find attribute in inode body */ + err = ext3_xattr_ibody_get(inode, name_index, name, + buffer, buffer_size); + if (err < 0) + /* search was unsuccessful, try to find EA in dedicated block */ + err = ext3_xattr_block_get(inode, name_index, name, + buffer, buffer_size); + up_read(&EXT3_I(inode)->xattr_sem); + + return err; +} + +/* ext3_xattr_ibody_list() + * + * generate list of attributes stored in EA block + */ +int +ext3_xattr_block_list(struct inode *inode, char *buffer, size_t buffer_size) { struct buffer_head *bh = NULL; struct ext3_xattr_entry *entry; @@ -370,7 +460,6 @@ ea_idebug(inode, "buffer=%p, buffer_size=%ld", buffer, (long)buffer_size); - down_read(&EXT3_I(inode)->xattr_sem); error = 0; if (!EXT3_I(inode)->i_file_acl) goto cleanup; @@ -431,11 +520,138 @@ cleanup: brelse(bh); - up_read(&EXT3_I(inode)->xattr_sem); return error; } +/* ext3_xattr_ibody_list() + * + * generate list of attributes stored in inode body + */ +int +ext3_xattr_ibody_list(struct inode *inode, char *buffer, size_t buffer_size) +{ + struct ext3_xattr_entry *last; + struct ext3_inode *raw_inode; + char *start, *end, *buf; + struct ext3_iloc iloc; + int storage_size; + int ret; + int size = 0; + + if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE) + return 0; + + ret = ext3_get_inode_loc(inode, &iloc, 1); + if (ret) + return ret; + raw_inode = ext3_raw_inode(&iloc); + + storage_size = EXT3_SB(inode->i_sb)->s_inode_size - + EXT3_GOOD_OLD_INODE_SIZE - + EXT3_I(inode)->i_extra_isize - + sizeof(__u32); + start = (char *) raw_inode + EXT3_GOOD_OLD_INODE_SIZE + + EXT3_I(inode)->i_extra_isize; + if (le32_to_cpu((*(__u32*) start)) != EXT3_XATTR_MAGIC) { + brelse(iloc.bh); + return 0; + } + start += sizeof(__u32); + end = (char *) raw_inode + EXT3_SB(inode->i_sb)->s_inode_size; + + last = (struct ext3_xattr_entry *) start; + while (!IS_LAST_ENTRY(last)) { + struct ext3_xattr_entry *next = EXT3_XATTR_NEXT(last); + struct ext3_xattr_handler *handler; + if (le32_to_cpu(last->e_value_size) > storage_size || + (char *) next >= end) { + ext3_error(inode->i_sb, "ext3_xattr_ibody_list", + "inode %ld", inode->i_ino); + brelse(iloc.bh); + return -EIO; + } + handler = ext3_xattr_handler(last->e_name_index); + if (handler) + size += handler->list(NULL, inode, last->e_name, + last->e_name_len); + last = next; + } + + if (!buffer) { + ret = size; + goto cleanup; + } else { + ret = -ERANGE; + if (size > buffer_size) + goto cleanup; + } + + last = (struct ext3_xattr_entry *) start; + buf = buffer; + while (!IS_LAST_ENTRY(last)) { + struct ext3_xattr_entry *next = EXT3_XATTR_NEXT(last); + struct ext3_xattr_handler *handler; + handler = ext3_xattr_handler(last->e_name_index); + if (handler) + buf += handler->list(buf, inode, last->e_name, + last->e_name_len); + last = next; + } + ret = size; +cleanup: + brelse(iloc.bh); + return ret; +} + +/* + * ext3_xattr_list() + * + * Copy a list of attribute names into the buffer + * provided, or compute the buffer size required. + * Buffer is NULL to compute the size of the buffer required. + * + * Returns a negative error number on failure, or the number of bytes + * used / required on success. + */ +int +ext3_xattr_list(struct inode *inode, char *buffer, size_t buffer_size) +{ + int error; + int size = buffer_size; + + down_read(&EXT3_I(inode)->xattr_sem); + + /* get list of attributes stored in inode body */ + error = ext3_xattr_ibody_list(inode, buffer, buffer_size); + if (error < 0) { + /* some error occured while collecting + * attributes in inode body */ + size = 0; + goto cleanup; + } + size = error; + + /* get list of attributes stored in dedicated block */ + if (buffer) { + buffer_size -= error; + if (buffer_size <= 0) { + buffer = NULL; + buffer_size = 0; + } else + buffer += error; + } + + error = ext3_xattr_block_list(inode, buffer, buffer_size); + if (error < 0) + /* listing was successful, so we return len */ + size = 0; + +cleanup: + up_read(&EXT3_I(inode)->xattr_sem); + return error + size; +} + /* * If the EXT3_FEATURE_COMPAT_EXT_ATTR feature of this file system is * not set, set it. @@ -457,6 +673,279 @@ } /* + * ext3_xattr_ibody_find() + * + * search attribute and calculate free space in inode body + * NOTE: free space includes space our attribute hold + */ +int +ext3_xattr_ibody_find(struct inode *inode, int name_index, + const char *name, struct ext3_xattr_entry *rentry, int *free) +{ + struct ext3_xattr_entry *last; + struct ext3_inode *raw_inode; + int name_len = strlen(name); + int err, storage_size; + struct ext3_iloc iloc; + char *start, *end; + int ret = -ENOENT; + + if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE) + return ret; + + err = ext3_get_inode_loc(inode, &iloc, 1); + if (err) + return -EIO; + raw_inode = ext3_raw_inode(&iloc); + + storage_size = EXT3_SB(inode->i_sb)->s_inode_size - + EXT3_GOOD_OLD_INODE_SIZE - + EXT3_I(inode)->i_extra_isize - + sizeof(__u32); + *free = storage_size - sizeof(__u32); + start = (char *) raw_inode + EXT3_GOOD_OLD_INODE_SIZE + + EXT3_I(inode)->i_extra_isize; + if (le32_to_cpu((*(__u32*) start)) != EXT3_XATTR_MAGIC) { + brelse(iloc.bh); + return -ENOENT; + } + start += sizeof(__u32); + end = (char *) raw_inode + EXT3_SB(inode->i_sb)->s_inode_size; + + last = (struct ext3_xattr_entry *) start; + while (!IS_LAST_ENTRY(last)) { + struct ext3_xattr_entry *next = EXT3_XATTR_NEXT(last); + if (le32_to_cpu(last->e_value_size) > storage_size || + (char *) next >= end) { + ext3_error(inode->i_sb, "ext3_xattr_ibody_find", + "inode %ld", inode->i_ino); + brelse(iloc.bh); + return -EIO; + } + + if (name_index == last->e_name_index && + name_len == last->e_name_len && + !memcmp(name, last->e_name, name_len)) { + memcpy(rentry, last, sizeof(struct ext3_xattr_entry)); + ret = 0; + } else { + *free -= EXT3_XATTR_LEN(last->e_name_len); + *free -= le32_to_cpu(last->e_value_size); + } + last = next; + } + + brelse(iloc.bh); + return ret; +} + +/* + * ext3_xattr_block_find() + * + * search attribute and calculate free space in EA block (if it allocated) + * NOTE: free space includes space our attribute hold + */ +int +ext3_xattr_block_find(struct inode *inode, int name_index, const char *name, + struct ext3_xattr_entry *rentry, int *free) +{ + struct buffer_head *bh = NULL; + struct ext3_xattr_entry *entry; + char *end; + int name_len, error = -ENOENT; + + if (!EXT3_I(inode)->i_file_acl) { + *free = inode->i_sb->s_blocksize - + sizeof(struct ext3_xattr_header) - + sizeof(__u32); + return -ENOENT; + } + ea_idebug(inode, "reading block %d", EXT3_I(inode)->i_file_acl); + bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl); + if (!bh) + return -EIO; + ea_bdebug(bh, "b_count=%d, refcount=%d", + atomic_read(&(bh->b_count)), le32_to_cpu(HDR(bh)->h_refcount)); + end = bh->b_data + bh->b_size; + if (HDR(bh)->h_magic != cpu_to_le32(EXT3_XATTR_MAGIC) || + HDR(bh)->h_blocks != cpu_to_le32(1)) { +bad_block: ext3_error(inode->i_sb, "ext3_xattr_get", + "inode %ld: bad block %d", inode->i_ino, + EXT3_I(inode)->i_file_acl); + brelse(bh); + return -EIO; + } + /* find named attribute */ + name_len = strlen(name); + *free = bh->b_size - sizeof(__u32); + + entry = FIRST_ENTRY(bh); + while (!IS_LAST_ENTRY(entry)) { + struct ext3_xattr_entry *next = + EXT3_XATTR_NEXT(entry); + if ((char *)next >= end) + goto bad_block; + if (name_index == entry->e_name_index && + name_len == entry->e_name_len && + memcmp(name, entry->e_name, name_len) == 0) { + memcpy(rentry, entry, sizeof(struct ext3_xattr_entry)); + error = 0; + } else { + *free -= EXT3_XATTR_LEN(entry->e_name_len); + *free -= le32_to_cpu(entry->e_value_size); + } + entry = next; + } + brelse(bh); + + return error; +} + +/* + * ext3_xattr_inode_set() + * + * this routine add/remove/replace attribute in inode body + */ +int +ext3_xattr_ibody_set(handle_t *handle, struct inode *inode, int name_index, + const char *name, const void *value, size_t value_len, + int flags) +{ + struct ext3_xattr_entry *last, *next, *here = NULL; + struct ext3_inode *raw_inode; + int name_len = strlen(name); + int esize = EXT3_XATTR_LEN(name_len); + struct buffer_head *bh; + int err, storage_size; + struct ext3_iloc iloc; + int free, min_offs; + char *start, *end; + + if (EXT3_SB(inode->i_sb)->s_inode_size <= EXT3_GOOD_OLD_INODE_SIZE) + return -ENOSPC; + + err = ext3_get_inode_loc(inode, &iloc, 1); + if (err) + return err; + raw_inode = ext3_raw_inode(&iloc); + bh = iloc.bh; + + storage_size = EXT3_SB(inode->i_sb)->s_inode_size - + EXT3_GOOD_OLD_INODE_SIZE - + EXT3_I(inode)->i_extra_isize - + sizeof(__u32); + start = (char *) raw_inode + EXT3_GOOD_OLD_INODE_SIZE + + EXT3_I(inode)->i_extra_isize; + if ((*(__u32*) start) != EXT3_XATTR_MAGIC) { + /* inode had no attributes before */ + *((__u32*) start) = cpu_to_le32(EXT3_XATTR_MAGIC); + } + start += sizeof(__u32); + end = (char *) raw_inode + EXT3_SB(inode->i_sb)->s_inode_size; + min_offs = storage_size; + free = storage_size - sizeof(__u32); + + last = (struct ext3_xattr_entry *) start; + while (!IS_LAST_ENTRY(last)) { + next = EXT3_XATTR_NEXT(last); + if (le32_to_cpu(last->e_value_size) > storage_size || + (char *) next >= end) { + ext3_error(inode->i_sb, "ext3_xattr_ibody_set", + "inode %ld", inode->i_ino); + brelse(bh); + return -EIO; + } + + if (last->e_value_size) { + int offs = le16_to_cpu(last->e_value_offs); + if (offs < min_offs) + min_offs = offs; + } + if (name_index == last->e_name_index && + name_len == last->e_name_len && + !memcmp(name, last->e_name, name_len)) + here = last; + else { + /* we calculate all but our attribute + * because it will be removed before changing */ + free -= EXT3_XATTR_LEN(last->e_name_len); + free -= le32_to_cpu(last->e_value_size); + } + last = next; + } + + if (value && (esize + value_len > free)) { + brelse(bh); + return -ENOSPC; + } + + err = ext3_reserve_inode_write(handle, inode, &iloc); + if (err) { + brelse(bh); + return err; + } + + if (here) { + /* time to remove old value */ + struct ext3_xattr_entry *e; + int size = le32_to_cpu(here->e_value_size); + int border = le16_to_cpu(here->e_value_offs); + char *src; + + /* move tail */ + memmove(start + min_offs + size, start + min_offs, + border - min_offs); + + /* recalculate offsets */ + e = (struct ext3_xattr_entry *) start; + while (!IS_LAST_ENTRY(e)) { + struct ext3_xattr_entry *next = EXT3_XATTR_NEXT(e); + int offs = le16_to_cpu(e->e_value_offs); + if (offs < border) + e->e_value_offs = + cpu_to_le16(offs + size); + e = next; + } + min_offs += size; + + /* remove entry */ + border = EXT3_XATTR_LEN(here->e_name_len); + src = (char *) here + EXT3_XATTR_LEN(here->e_name_len); + size = (char *) last - src; + if ((char *) here + size > end) + printk("ALERT at %s:%d: 0x%p + %d > 0x%p\n", + __FILE__, __LINE__, here, size, end); + memmove(here, src, size); + last = (struct ext3_xattr_entry *) ((char *) last - border); + *((__u32 *) last) = 0; + } + + if (value) { + int offs = min_offs - value_len; + /* use last to create new entry */ + last->e_name_len = strlen(name); + last->e_name_index = name_index; + last->e_value_offs = cpu_to_le16(offs); + last->e_value_size = cpu_to_le32(value_len); + last->e_hash = last->e_value_block = 0; + memset(last->e_name, 0, esize); + memcpy(last->e_name, name, last->e_name_len); + if (start + offs + value_len > end) + printk("ALERT at %s:%d: 0x%p + %d + %d > 0x%p\n", + __FILE__, __LINE__, start, offs, + value_len, end); + memcpy(start + offs, value, value_len); + last = EXT3_XATTR_NEXT(last); + *((__u32 *) last) = 0; + } + + ext3_mark_iloc_dirty(handle, inode, &iloc); + brelse(bh); + + return 0; +} + +/* * ext3_xattr_set_handle() * * Create, replace or remove an extended attribute for this inode. Buffer @@ -470,6 +959,104 @@ */ int ext3_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index, + const char *name, const void *value, size_t value_len, + int flags) +{ + struct ext3_xattr_entry entry; + int err, where = 0, found = 0, total; + int free1 = -1, free2 = -1; + int name_len; + + ea_idebug(inode, "name=%d.%s, value=%p, value_len=%ld", + name_index, name, value, (long)value_len); + + if (IS_RDONLY(inode)) + return -EROFS; + if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) + return -EPERM; + if (value == NULL) + value_len = 0; + if (name == NULL) + return -EINVAL; + name_len = strlen(name); + if (name_len > 255 || value_len > inode->i_sb->s_blocksize) + return -ERANGE; + down_write(&EXT3_I(inode)->xattr_sem); + + /* try to find attribute in inode body */ + err = ext3_xattr_ibody_find(inode, name_index, name, &entry, &free1); + if (err == 0) { + /* found EA in inode */ + found = 1; + where = 0; + } else if (err == -ENOENT) { + /* there is no such attribute in inode body */ + /* try to find attribute in dedicated block */ + err = ext3_xattr_block_find(inode, name_index, name, + &entry, &free2); + if (err != 0 && err != -ENOENT) { + /* not found EA in block */ + goto finish; + } else if (err == 0) { + /* found EA in block */ + where = 1; + found = 1; + } + } else + goto finish; + + /* check flags: may replace? may create ? */ + if (found && (flags & XATTR_CREATE)) { + err = -EEXIST; + goto finish; + } else if (!found && (flags & XATTR_REPLACE)) { + err = -ENODATA; + goto finish; + } + + /* check if we have enough space to store attribute */ + total = EXT3_XATTR_LEN(strlen(name)) + value_len; + if (free1 >= 0 && total > free1 && free2 >= 0 && total > free2) { + /* have no enough space */ + err = -ENOSPC; + goto finish; + } + + /* time to remove attribute */ + if (found) { + if (where == 0) { + /* EA is stored in inode body */ + ext3_xattr_ibody_set(handle, inode, name_index, name, + NULL, 0, flags); + } else { + /* EA is stored in separated block */ + ext3_xattr_block_set(handle, inode, name_index, name, + NULL, 0, flags); + } + } + + /* try to store EA in inode body */ + err = ext3_xattr_ibody_set(handle, inode, name_index, name, + value, value_len, flags); + if (err) { + /* can't store EA in inode body */ + /* try to store in block */ + err = ext3_xattr_block_set(handle, inode, name_index, + name, value, value_len, flags); + } + +finish: + up_write(&EXT3_I(inode)->xattr_sem); + return err; +} + +/* + * ext3_xattr_block_set() + * + * this routine add/remove/replace attribute in EA block + */ +int +ext3_xattr_block_set(handle_t *handle, struct inode *inode, int name_index, const char *name, const void *value, size_t value_len, int flags) { @@ -492,22 +1078,7 @@ * towards the end of the block). * end -- Points right after the block pointed to by header. */ - - ea_idebug(inode, "name=%d.%s, value=%p, value_len=%ld", - name_index, name, value, (long)value_len); - - if (IS_RDONLY(inode)) - return -EROFS; - if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - return -EPERM; - if (value == NULL) - value_len = 0; - if (name == NULL) - return -EINVAL; name_len = strlen(name); - if (name_len > 255 || value_len > sb->s_blocksize) - return -ERANGE; - down_write(&EXT3_I(inode)->xattr_sem); if (EXT3_I(inode)->i_file_acl) { /* The inode already has an extended attribute block. */ bh = sb_bread(sb, EXT3_I(inode)->i_file_acl); @@ -733,7 +1304,6 @@ brelse(bh); if (!(bh && header == HDR(bh))) kfree(header); - up_write(&EXT3_I(inode)->xattr_sem); return error; } Index: linux-2.6.0/fs/ext3/xattr.h =================================================================== --- linux-2.6.0.orig/fs/ext3/xattr.h 2003-06-24 18:04:43.000000000 +0400 +++ linux-2.6.0/fs/ext3/xattr.h 2004-01-14 18:54:12.000000000 +0300 @@ -77,7 +77,8 @@ extern int ext3_xattr_get(struct inode *, int, const char *, void *, size_t); extern int ext3_xattr_list(struct inode *, char *, size_t); extern int ext3_xattr_set(struct inode *, int, const char *, const void *, size_t, int); -extern int ext3_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int); +extern int ext3_xattr_set_handle(handle_t *, struct inode *, int, const char *,const void *,size_t,int); +extern int ext3_xattr_block_set(handle_t *, struct inode *, int, const char *,const void *,size_t,int); extern void ext3_xattr_delete_inode(handle_t *, struct inode *); extern void ext3_xattr_put_super(struct super_block *); Index: linux-2.6.0/include/linux/ext3_fs.h =================================================================== --- linux-2.6.0.orig/include/linux/ext3_fs.h 2004-01-14 18:54:11.000000000 +0300 +++ linux-2.6.0/include/linux/ext3_fs.h 2004-01-14 18:54:12.000000000 +0300 @@ -265,6 +265,8 @@ __u32 m_i_reserved2[2]; } masix2; } osd2; /* OS dependent 2 */ + __u16 i_extra_isize; + __u16 i_pad1; }; #define i_size_high i_dir_acl @@ -721,6 +723,7 @@ extern int ext3_forget(handle_t *, int, struct inode *, struct buffer_head *, int); extern struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *); extern struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *); +int ext3_get_inode_loc(struct inode *inode, struct ext3_iloc *iloc, int in_mem); extern void ext3_read_inode (struct inode *); extern void ext3_write_inode (struct inode *, int); Index: linux-2.6.0/include/linux/ext3_fs_i.h =================================================================== --- linux-2.6.0.orig/include/linux/ext3_fs_i.h 2003-12-30 08:32:44.000000000 +0300 +++ linux-2.6.0/include/linux/ext3_fs_i.h 2004-01-14 18:54:12.000000000 +0300 @@ -96,6 +96,9 @@ */ loff_t i_disksize; + /* on-disk additional length */ + __u16 i_extra_isize; + /* * truncate_sem is for serialising ext3_truncate() against * ext3_getblock(). In the 2.4 ext2 design, great chunks of inode's %diffstat fs/ext3/ialloc.c | 5 fs/ext3/inode.c | 10 fs/ext3/xattr.c | 634 +++++++++++++++++++++++++++++++++++++++++++--- fs/ext3/xattr.h | 3 include/linux/ext3_fs.h | 2 include/linux/ext3_fs_i.h | 3 6 files changed, 623 insertions(+), 34 deletions(-) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 10:16 ` Andreas Dilger @ 2004-11-19 11:43 ` tridge 2004-11-19 22:28 ` Andreas Dilger 2004-11-22 13:02 ` tridge 1 sibling, 1 reply; 41+ messages in thread From: tridge @ 2004-11-19 11:43 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-kernel Andreas, > Also, we (CFS) have developed patches for ext3 + e2fsprogs to support > "fast" EAs stored in larger inodes on disk, and this can improve > performance dramatically in the case where you are accessing a large > number of inodes with EAs just. yep, that could help a lot. I imagine it will provide a similar benefit to the option to expand the inode size in XFS, which certainly made a huge difference. > This patch also provides the infrastructure on disk for storing e.g. > nsecond and create timestamps in the ext3 large inodes, but the actual > implementation to save/load these isn't there yet. If that were > available, would you use it instead of explicitly storing the NTTIME in > an EA? certainly! For Samba4 we need 4 timestamps (create/change/write/access), preferably all with 100ns resolution or better. All 4 timestamps need to be settable (unlike st_ctime in posix). The strategy I've adopted is this: - use st_atime and st_mtime for the access and write time fields, with nanosecond resolution if available, otherwise with 1 second resolution. It's just too expensive to update an EA on every read/write, so I didn't put these in the DosAttrib EA. - store create_time and change_time in the user.DosAttrib xattr, as 64 bit 100ns resolution times (same format as NT uses and Samba uses internally). I store change_time there as its definition is a little different from the posix ctime field (plus its settable). If we had a settable create_time field in the inode then I'd certainly want to use it in Samba4. A non-settable one wouldn't be nearly as useful. Some win32 applications care about being able to set all the time fields (such as excel 2003). This wouldn't allow us to get rid of the user.DosAttrib xattr completely though, as we stick a bunch of other stuff in there and will be expanding it soon to help with the case-insensitive speed problem. > I believe the 2.6 stat interface will support nsecond timestamps, yep, we are already using st.st_atim.tv_nsec when configure detects it. It's very useful, but the fact that ext3 doesn't store this on disk leads to potential problems when timestamps regress if inodes are ejected from the cache under memory pressure. That needs fixing. > but I don't think there is any API to get the create time to userspace > though we could hook this up to a pseudo EA. The benefit of storing > these common fields in the inode instead of EAs is less overhead. I think it would make more sense to have a new varient of utime() for setting all available timestamps, and expose all timestamps in stat. A separate API for create time seems a bit hackish. > I would just configure out the xattr sharing code entirely since it will > likely do nothing but increase overhead if any of the EAs on an inode > are unique (this is the most common case, except for POSIX-ACL-only setups). I didn't know it was configurable. I can't see any CONFIG option for it - is there some trick I've missed? > I've attached this patch here. I'll give it a go and let you know how it changes the NBENCH results. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 11:43 ` tridge @ 2004-11-19 22:28 ` Andreas Dilger 0 siblings, 0 replies; 41+ messages in thread From: Andreas Dilger @ 2004-11-19 22:28 UTC (permalink / raw) To: tridge; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 4551 bytes --] On Nov 19, 2004 22:43 +1100, tridge@samba.org wrote: > > This patch also provides the infrastructure on disk for storing e.g. > > nsecond and create timestamps in the ext3 large inodes, but the actual > > implementation to save/load these isn't there yet. If that were > > available, would you use it instead of explicitly storing the NTTIME in > > an EA? > > certainly! I can describe the "infrastructure" here, but there needs to be some smallish coding work in order to get this working and I don't really have much time to do that myself. The basic premise is that for ext3 filesystems formatted with large on-disk inodes we have reserved the first word in the extra space to describe the "extra size" of the fixed fields in struct ext3_inode stored in each inode on disk. This allows us to add permanent fields to the end of struct ext3_inode, and any remaining space is used for the fast EAs before falling back to an external block. This space was always intended to store extra timestamp fields ala: struct ext3_inode { : : } osd2; __u16 i_extra_isize; __u16 i_pad1; __u32 i_ctime_hilow; /* do we need nsecond atimes? */ __u32 i_mtime_hilow; __u32 i_crtime; __u32 i_crtime_hilow; }; Since the i_[mac]time fields are in seconds, I would like to store: _hilow = nseconds >> 6 | (([mac]time64 >> 32) << 26) [mac]time64 = [mac]time | (__u64)((_hilow & 0xfc000000) << 6); nseconds = _hilow << 6; so we get about 60ns resolution but also increase our dynamic range by a factor of 64 (year 8704 problem here we come ;-). Since crtime is new we _could_ store it in the 100ns 64-bit format that NT uses. Consistency is good on the one hand and we only need to do shift and OR, while with straight 100ns times we also get a 6x larger dynamic range (y58000) but also have to do a 64-bit divide by 10^7 for each access. As we read an inode from disk we check i_extra_isize to determine which fields, if any, are valid and when writing the inode we fill in the fields and update i_extra_isize (taking care to push any existing EAs out a bit, though that should be a rare case). This avoids the EA speed/size overhead to parse/read/write these fields, and allows us to add new "fixed" fields into the large inode as necessary. We don't touch any fields that we don't understand (normal ext3 compat flags will tell us if there are incompatible features there). So, in summary, the "i_extra_isize" handling is already there for inodes (currently always set to '4') but we don't do anything with that space yet. > - store create_time and change_time in the user.DosAttrib xattr, as > 64 bit 100ns resolution times (same format as NT uses and Samba > uses internally). I store change_time there as its definition is a > little different from the posix ctime field (plus its settable). > > If we had a settable create_time field in the inode then I'd certainly > want to use it in Samba4. A non-settable one wouldn't be nearly as > useful. Some win32 applications care about being able to set all the > time fields (such as excel 2003). Hmm, seems kind of counter-productive to allow a crtime that is settable... > I think it would make more sense to have a new varient of utime() for > setting all available timestamps, and expose all timestamps in stat. A > separate API for create time seems a bit hackish. By all means go for it ;-). I'm not particularly fond of the proposed pseudo-EA interface. You are probably more likely than anyone to get support for it. > > I would just configure out the xattr sharing code entirely since it will > > likely do nothing but increase overhead if any of the EAs on an inode > > are unique (this is the most common case, except for POSIX-ACL-only setups). > > I didn't know it was configurable. I can't see any CONFIG option for > it - is there some trick I've missed? It's CONFIG_FS_MBCACHE and/or CONFIG_EXT[23]_FS_XATTR_SHARING in the original 2.4 xattr patches, not sure if they've disappeared in 2.6 kernels. Hmm, seems that the CONFIG_FS_MBCACHE option doesn't allow you to turn it off completely, which is a shame since both are completely useless for any EAs which are different for each inode and just introduce overhead. The CONFIG_EXT[23]_FS_XATTR_SHARING options don't exist at all anymore. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 10:16 ` Andreas Dilger 2004-11-19 11:43 ` tridge @ 2004-11-22 13:02 ` tridge 2004-11-22 21:40 ` Andreas Dilger 1 sibling, 1 reply; 41+ messages in thread From: tridge @ 2004-11-22 13:02 UTC (permalink / raw) To: linux-kernel; +Cc: Andreas Dilger, linux-fsdevel I've put up graphs of the first set of dbench3 results for various filesystems at: http://samba.org/~tridge/xattr_results/ All the tests were run on a 2.6.10-rc2 kernel with the patch from Andreas to add support to ext3 for large inodes. I needed to tweak the patch for 2.6.10-rc2, but not by much. Full details on the setup are in the README, and the scripts for reproducing the results yourself (and the graphs) are in the same directory. The results show that the ext3 large inode patch is extremely worthwhile. Using a 256 byte inode on ext3 gained a factor of up to 7x in performance, and only lost a very small amount when xattrs were not used. It took ext3 from a very mediocre performance to being the clear winner among current Linux journaled filesystems for performance when xattrs are used. Eventually I think that larger inodes should become the default. Similarly on xfs, using the large inode option (512 bytes this time) made a huge difference, gaining a factor of 6x in the best case. If all versions of the xfs code can handle large inodes then I think it would be good to change the default, especially as it seems to have almost no cost when xattrs are not used. Without xattrs reiser3 did extremely well under heavier load, where it is less of a in-memory test, just as Hans thought it would. Unfortunately I wasn't able to try reiser4 in these runs due to the lockups I reported earlier, but I look forward to trying it once those are fixed. Reiser3 was also the best "out of the box" journaled filesystem with xattrs, but it was easily beaten by xfs and ext3 once large inodes were enabled in those. jfs wins the award for consistency. As I watched the results develop I was tempted to just disable the jfs tests as it was so slow, but eventually it overtook xfs at very large loads. Maybe if I run large enough loads it will be the overall winner :) The massive gap between ext2 and the other filesystems really shows clearly how much we are paying for journaling. I haven't tried any journal on external device or journal on nvram card tricks yet, but it looks like those will be worth pursuing. I'll leave the test script running overnight generating some more results for even higher loads. I'll update the graphs in the morning. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-22 13:02 ` tridge @ 2004-11-22 21:40 ` Andreas Dilger 0 siblings, 0 replies; 41+ messages in thread From: Andreas Dilger @ 2004-11-22 21:40 UTC (permalink / raw) To: tridge; +Cc: linux-kernel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 2828 bytes --] On Nov 23, 2004 00:02 +1100, tridge@samba.org wrote: > I've put up graphs of the first set of dbench3 results for various > filesystems at: > > http://samba.org/~tridge/xattr_results/ > > The results show that the ext3 large inode patch is extremely > worthwhile. Using a 256 byte inode on ext3 gained a factor of up to 7x > in performance, and only lost a very small amount when xattrs were not > used. It took ext3 from a very mediocre performance to being the clear > winner among current Linux journaled filesystems for performance when > xattrs are used. Eventually I think that larger inodes should become > the default. For Lustre we tune the inode size at format time to allow the storing of the "default" EA data within the larger inode. Is this the case with samba and 256-byte inodes (i.e. is your EA data all going to fit within the extra 124 bytes of space for storing EAs)? If you have to put any of the commonly-used EA data into an external block the benefits are lost. > The massive gap between ext2 and the other filesystems really shows > clearly how much we are paying for journaling. I haven't tried any > journal on external device or journal on nvram card tricks yet, but it > looks like those will be worth pursuing. One of the other things we do for Lustre right away is create the ext3 filesystem with larger journal sizes so that for the many-client cases we do not get synchronous journal flushing if there are lots of active threads. This can make a huge difference in overall performance at high loads. Use "mke2fs -J size=400 ..." to create a 400MB journal (assuming you have at least that much RAM and a large enough block device, at least 4x the journal size just from a "don't waste space" point of view). One factor is that you don't necessarily need to write so much data at one time, but also that ext3 needs to reserve journal space for the worst-case usage, so you get 40-100 threads allocating "worst case" then "filling" the journal (causing new operations to block) and finally completing with only a small fraction of those reserved journal blocks actually used. Having an external journal device also generally gives you a large journal (by default it is the full size of the block device specified) so sometimes the effects of the large journal are confused with the fact that it is external. I haven't seen any perf numbers recently on what kind of effect having an external journal has. I highly doubt that NVRAM cards are any better than a dedicated disk for the journal, since journal IO is write-only (except during recovery) and virtually seek-free. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 7:38 ` performance of filesystem xattrs with Samba4 tridge 2004-11-19 8:08 ` James Morris 2004-11-19 10:16 ` Andreas Dilger @ 2004-11-19 12:03 ` Anton Altaparmakov 2004-11-19 12:43 ` tridge 2004-11-19 15:34 ` Hans Reiser 2004-11-21 22:21 ` Nathan Scott 4 siblings, 1 reply; 41+ messages in thread From: Anton Altaparmakov @ 2004-11-19 12:03 UTC (permalink / raw) To: tridge; +Cc: lkml On Fri, 2004-11-19 at 18:38 +1100, tridge@samba.org wrote: > I've been developing the posix backend for Samba4 over the last few > months. It has now reached the stage where it is passing most of the > test suites, so its time to start some performance testing. > > The biggest change from the kernels point of view is that Samba4 makes > extensive use of filesystem xattrs. Almost every file with have a > user.DosAttrib xattr containing file attributes and additional > timestamp fields. A lot of files will also have a system.NTACL > attribute containing a NT ACL, and many files will have a > user.DosStreams xattr for NT alternate data streams. Some rare files > will have a user.DosEAs xattr for DOS extended attribute > support. Files with streams will also have separate xattrs for each NT > stream. [snip] > Soon we'll be starting to integrate the xattr support with a LSM > module, to allow the kernel to interpret the NT ACLs directly to avoid > races, make things a little more efficient (using a xattr cache > holding unpacked ACLs), and allowing for the possibility of non-Samba > file access to obey the NT ACLs. Note, that NTFS supports all those things natively on the file system, so it may be worth keeping in mind when designing your APIs. It would be nice if one day when ntfs write support is finished, when running Samba on an NTFS partition on Linux, Samba can directly access all those things directly from NTFS. I guess a good way would be if your interface is sufficiently abstracted so that it can use xattrs as a backend or a native backend which NTFS could provide for you or Samba could provide for NTFS. For example NTFS stores the 4 different times in NT format in each inode (base Mft record) so you would not have to take an xattr performance hit there. Anyway, just thought I would mention this, I am not expecting you to do anything about it, especially since full NTFS read-write support is still a long way away... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/, http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 12:03 ` Anton Altaparmakov @ 2004-11-19 12:43 ` tridge 2004-11-19 14:11 ` Anton Altaparmakov 0 siblings, 1 reply; 41+ messages in thread From: tridge @ 2004-11-19 12:43 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: lkml Anton, > Note, that NTFS supports all those things natively on the file system, > so it may be worth keeping in mind when designing your APIs. It would > be nice if one day when ntfs write support is finished, when running > Samba on an NTFS partition on Linux, Samba can directly access all those > things directly from NTFS. yes, I have certainly thought about this, and at the core of Samba4 is a "ntvfs" layer that allows for backends that can take full advantage of whatever the filesystem can offer. The ntvfs/posix/ code in Samba4 is quite small (currently 7k lines of code) and I'm hoping that more specialised backends will be written that talk to other types of filesystems. To get things started I've also written a "cifs" backend for Samba4, that uses another CIFS file server as a storage backend, turning Samba4 into a proxy server. That backend uses the full capabilities of the ntvfs layer, and implements nearly all of the detailed stuff that a NTFS can do. > I guess a good way would be if your interface is sufficiently > abstracted so that it can use xattrs as a backend or a native > backend which NTFS could provide for you or Samba could provide for > NTFS. For example NTFS stores the 4 different times in NT format > in each inode (base Mft record) so you would not have to take an > xattr performance hit there. The big question is what sort of API would you envisage between user space and this filesystem? Are you imagining that Samba mmap the raw disk and use a libntfs library? That would be possible, but would lose one of the big advantages of Samba, which is that the filesystem is available to both posix and windows apps. Or are you thinking that we add a new syscall interface to, a bit like the IRP stuff in the NT IFS? I imagine there would be quite a bit of resistance to that in the Linux kernel community :-) Realistically, I think that in the vast majority of cases Samba is going to be running on top of "mostly posix" filesystems for the forseeable future, unless you manage to do something pretty magical with the ntfs code. But if you do manage to get ntfs in Linux to the stage where its a viable alternative then I'd be delighted to help write the Samba4 backend to match. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 12:43 ` tridge @ 2004-11-19 14:11 ` Anton Altaparmakov 2004-11-20 10:44 ` tridge 0 siblings, 1 reply; 41+ messages in thread From: Anton Altaparmakov @ 2004-11-19 14:11 UTC (permalink / raw) To: tridge; +Cc: lkml Hi Tridge, On Fri, 19 Nov 2004 tridge@samba.org wrote: > > Note, that NTFS supports all those things natively on the file system, > > so it may be worth keeping in mind when designing your APIs. It would > > be nice if one day when ntfs write support is finished, when running > > Samba on an NTFS partition on Linux, Samba can directly access all those > > things directly from NTFS. > > yes, I have certainly thought about this, and at the core of Samba4 is > a "ntvfs" layer that allows for backends that can take full advantage > of whatever the filesystem can offer. The ntvfs/posix/ code in Samba4 > is quite small (currently 7k lines of code) and I'm hoping that more > specialised backends will be written that talk to other types of > filesystems. Sounds great! > To get things started I've also written a "cifs" backend for Samba4, > that uses another CIFS file server as a storage backend, turning > Samba4 into a proxy server. That backend uses the full capabilities of > the ntvfs layer, and implements nearly all of the detailed stuff that > a NTFS can do. > > > I guess a good way would be if your interface is sufficiently > > abstracted so that it can use xattrs as a backend or a native > > backend which NTFS could provide for you or Samba could provide for > > NTFS. For example NTFS stores the 4 different times in NT format > > in each inode (base Mft record) so you would not have to take an > > xattr performance hit there. > > The big question is what sort of API would you envisage between user > space and this filesystem? Are you imagining that Samba mmap the raw > disk and use a libntfs library? That would be possible, but would lose > one of the big advantages of Samba, which is that the filesystem is > available to both posix and windows apps. > > Or are you thinking that we add a new syscall interface to, a bit like > the IRP stuff in the NT IFS? I imagine there would be quite a bit of > resistance to that in the Linux kernel community :-) > > Realistically, I think that in the vast majority of cases Samba is > going to be running on top of "mostly posix" filesystems for the > forseeable future, unless you manage to do something pretty magical > with the ntfs code. But if you do manage to get ntfs in Linux to the > stage where its a viable alternative then I'd be delighted to help > write the Samba4 backend to match. I don't know. I have been mulling over in my head for quite a while what to do about an interface for "advanced ntfs features" but so far I have always pushed this to the back of my mind. After all no point in providing advanced features considering we don't even provide full read-write access yet. I just thought I would mentione NTFS when I saw your post. But to answer your question I definitely would envisage an interface to the kernel driver rather than to libntfs. It is 'just' a matter of deciding how that would look... Partially we will see what happens with Reiser4 as it faces the same or at least very simillar interface problems. Maybe we need a sys_ntfs() or maybe we need to hitchhike the ioctl() interface or maybe the VFS can start providing all required functionality in some to be determined manner that we can use... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 14:11 ` Anton Altaparmakov @ 2004-11-20 10:44 ` tridge 2004-11-20 16:20 ` Hans Reiser 0 siblings, 1 reply; 41+ messages in thread From: tridge @ 2004-11-20 10:44 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: lkml Anton, > But to answer your question I definitely would envisage an interface to > the kernel driver rather than to libntfs. It is 'just' a matter of > deciding how that would look... How about prototyping the API in user space, using a "mmap the block device" based filesystem library? You might also like to take a peek at http://samba.org/ftp/unpacked/samba4/source/include/smb_interfaces.h and http://samba.org/ftp/unpacked/samba4/source/ntvfs/ntvfs.h those two files define the NTFS-like interfaces in Samba4. The interface has proved to be quite flexible. > Partially we will see what happens with Reiser4 as it faces the same or at > least very simillar interface problems. yep, I'm looking forward to experimenting with the "file is a directory" stuff in reiser4 to see how well it can be made to match what is needed for Samba4. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 10:44 ` tridge @ 2004-11-20 16:20 ` Hans Reiser 2004-11-20 23:29 ` tridge 0 siblings, 1 reply; 41+ messages in thread From: Hans Reiser @ 2004-11-20 16:20 UTC (permalink / raw) To: tridge; +Cc: Anton Altaparmakov, lkml tridge@samba.org wrote: > >yep, I'm looking forward to experimenting with the "file is a >directory" stuff in reiser4 to see how well it can be made to match >what is needed for Samba4. > > > There are still bugs with it that have us turning it off for now, but I think we will fix those in the next year. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 16:20 ` Hans Reiser @ 2004-11-20 23:29 ` tridge 0 siblings, 0 replies; 41+ messages in thread From: tridge @ 2004-11-20 23:29 UTC (permalink / raw) To: Hans Reiser; +Cc: Anton Altaparmakov, lkml Hans, > There are still bugs with it that have us turning it off for now, but I > think we will fix those in the next year. Do you plan to add user xattr support before then? The reason I ask is that without either xattr support or named streams Samba4 has no way to store the additional file meta data it needs. Maybe xattr support could be a reiser4 plugin? Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 7:38 ` performance of filesystem xattrs with Samba4 tridge ` (2 preceding siblings ...) 2004-11-19 12:03 ` Anton Altaparmakov @ 2004-11-19 15:34 ` Hans Reiser 2004-11-19 15:58 ` Jan Engelhardt ` (2 more replies) 2004-11-21 22:21 ` Nathan Scott 4 siblings, 3 replies; 41+ messages in thread From: Hans Reiser @ 2004-11-19 15:34 UTC (permalink / raw) To: tridge; +Cc: linux-kernel Is this an fsync intensive benchmark? If no, could you try with reiser4? If yes, you might as well wait for us to optimize fsync first in reiser4. Hans ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 15:34 ` Hans Reiser @ 2004-11-19 15:58 ` Jan Engelhardt 2004-11-19 22:03 ` tridge 2004-11-19 23:01 ` tridge 2 siblings, 0 replies; 41+ messages in thread From: Jan Engelhardt @ 2004-11-19 15:58 UTC (permalink / raw) To: Hans Reiser; +Cc: tridge, linux-kernel >Is this an fsync intensive benchmark? If no, could you try with >reiser4? If yes, you might as well wait for us to optimize fsync first >in reiser4. Do I sense an attempt to get more users from non-reiser*fs to reiser4? ;-) Jan Engelhardt -- Gesellschaft für Wissenschaftliche Datenverarbeitung Am Fassberg, 37077 Göttingen, www.gwdg.de ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 15:34 ` Hans Reiser 2004-11-19 15:58 ` Jan Engelhardt @ 2004-11-19 22:03 ` tridge 2004-11-20 4:51 ` Hans Reiser 2004-11-19 23:01 ` tridge 2 siblings, 1 reply; 41+ messages in thread From: tridge @ 2004-11-19 22:03 UTC (permalink / raw) To: Hans Reiser; +Cc: linux-kernel Hans, > Is this an fsync intensive benchmark? If no, could you try with > reiser4? If yes, you might as well wait for us to optimize fsync first > in reiser4. In the configuration I was running there are no fsync calls. I'll have a go with reiser4 soon and let you know how it goes. I'm also working on a new version of dbench that will better simulate the filesystem access patterns of Samba4. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 22:03 ` tridge @ 2004-11-20 4:51 ` Hans Reiser 0 siblings, 0 replies; 41+ messages in thread From: Hans Reiser @ 2004-11-20 4:51 UTC (permalink / raw) To: tridge; +Cc: linux-kernel, Reiserfs developers mail-list tridge@samba.org wrote: >Hans, > > > Is this an fsync intensive benchmark? If no, could you try with > > reiser4? If yes, you might as well wait for us to optimize fsync first > > in reiser4. > >In the configuration I was running there are no fsync calls. > >I'll have a go with reiser4 soon and let you know how it goes. I'm >also working on a new version of dbench that will better simulate the >filesystem access patterns of Samba4. > > If you can describe what those are, it would do me a lot of good in regards to my understanding what it means about an fs to get a certain result on the benchmark, and what needs to be better optimized. Cheers, Hans ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 15:34 ` Hans Reiser 2004-11-19 15:58 ` Jan Engelhardt 2004-11-19 22:03 ` tridge @ 2004-11-19 23:01 ` tridge 2004-11-20 0:26 ` Andrew Morton 2004-11-20 4:40 ` Hans Reiser 2 siblings, 2 replies; 41+ messages in thread From: tridge @ 2004-11-19 23:01 UTC (permalink / raw) To: Hans Reiser; +Cc: linux-kernel Hans, I did some testing with reiser4 from 2.6.10-rc2-mm2. As far as I can tell it doesn't seem to support the xattr calls (fsetxattr, fgetxattr etc). Is that right, or did I miss a patch somewhere? The code seems to set the xattr methods to NULL and has the prototypes #if'd out. The result without xattr support was 52 MB/sec, which is a bit slower than the reiser3 I tested in 2.6.10-rc2. For easy comparison, here are the non-xattr results for the various filesystems I've tested: tmpfs 69 MB/sec ext2 68 MB/sec ext3 67 MB/sec xfs+2Kinode 63 MB/sec xfs 62 MB/sec reiser 58 MB/sec reiser4 52 MB/sec (on a -mm2 kernel) jfs 36 MB/sec I used default options for mkreiser4, and default mount options. Can you suggest some options to try or would you prefer to wait till I've done the new dbench so you can try this more easily yourself? (you can of course try installing Samba4 to test now, but its a fast moving target and involves a lot more than just filesystem calls). To make sure the problem wasn't some of the other patches in -mm2, I reran the ext3 results on -mm2, and was surprised to find quite a large improvement! ext3 got 73 MB/sec without xattr support. It oopsed when I enabled xattr (I'm working with sct on fixing those oopses). Once the oopses are fixed I'll rerun all the various filesystems with -mm2 and see if it only improves ext3 or if it improves all of them. Would anyone care to hazard a guess as to what aspect of -mm2 is gaining us 10% in overall Samba4 performance? Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 23:01 ` tridge @ 2004-11-20 0:26 ` Andrew Morton 2004-11-21 1:14 ` tridge ` (4 more replies) 2004-11-20 4:40 ` Hans Reiser 1 sibling, 5 replies; 41+ messages in thread From: Andrew Morton @ 2004-11-20 0:26 UTC (permalink / raw) To: tridge; +Cc: reiser, linux-kernel tridge@samba.org wrote: > > Would anyone care to hazard a guess as to what aspect of -mm2 is > gaining us 10% in overall Samba4 performance? Is it reproducible with your tricked-up dbench? If so, please send me a machine description and the relevant command line and I'll do a bsearch. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 0:26 ` Andrew Morton @ 2004-11-21 1:14 ` tridge 2004-11-21 2:12 ` tridge ` (3 subsequent siblings) 4 siblings, 0 replies; 41+ messages in thread From: tridge @ 2004-11-21 1:14 UTC (permalink / raw) To: Andrew Morton; +Cc: reiser, linux-kernel Andrew, > Is it reproducible with your tricked-up dbench? The xattr enabled dbench I did for Stephen was just a quick hack to demonstrate the oops in ext3. I'll do a more complete version over the next few days. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 0:26 ` Andrew Morton 2004-11-21 1:14 ` tridge @ 2004-11-21 2:12 ` tridge 2004-11-21 23:53 ` tridge ` (2 subsequent siblings) 4 siblings, 0 replies; 41+ messages in thread From: tridge @ 2004-11-21 2:12 UTC (permalink / raw) To: Andrew Morton; +Cc: reiser, linux-kernel Andrew, > Is it reproducible with your tricked-up dbench? > > If so, please send me a machine description and the relevant command line > and I'll do a bsearch. I should explain a little more .... The current dbench is showing way too much variance on this test to be really useful. Here are the numbers for 5 runs of dbench 10 on 2.6.10-rc2 and 2.6.10-rc2-mm2: 2.6.10-rc2 325 320 364 360 347 -mm2 347 371 411 322 384 I've solved this variance problem in NBENCH by making the runs fixed time rather than fixed number of operations, and adding a warmup phase. I need to do the same to dbench in order to get sane numbers out that would be at all useful for a binary patch search. The current dbench worked OK when computers were slower, but now it is completing its runs so fast that the noise is just silly. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 0:26 ` Andrew Morton 2004-11-21 1:14 ` tridge 2004-11-21 2:12 ` tridge @ 2004-11-21 23:53 ` tridge 2004-11-23 9:37 ` tridge 2004-11-24 7:53 ` tridge 4 siblings, 0 replies; 41+ messages in thread From: tridge @ 2004-11-21 23:53 UTC (permalink / raw) To: Andrew Morton; +Cc: reiser, linux-kernel Andrew, > Is it reproducible with your tricked-up dbench? > > If so, please send me a machine description and the relevant command line > and I'll do a bsearch. The new dbench is finished (see my reply to Nathan for details). I've done some initial runs comparing 2.6.10-rc2 and 2.6.10-rc2-mm2 and I am not seeing the performance gain with mm2 that I reported earlier. I don't yet know if this is because I screwed up previously, or there is some other factor that I haven't taken account of. I'm now doing a larger set of runs comparing the two kernels with a range of filesystems and much longer run times, plus more repeats per run. I'm also using a script that reformats the filesystem before each run in case that was a factor (as it was for reiser4). I'll get you the results later today. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 0:26 ` Andrew Morton ` (2 preceding siblings ...) 2004-11-21 23:53 ` tridge @ 2004-11-23 9:37 ` tridge 2004-11-23 17:55 ` Andreas Dilger 2004-11-24 7:53 ` tridge 4 siblings, 1 reply; 41+ messages in thread From: tridge @ 2004-11-23 9:37 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Andrew, > > Would anyone care to hazard a guess as to what aspect of -mm2 is > > gaining us 10% in overall Samba4 performance? > > Is it reproducible with your tricked-up dbench? > > If so, please send me a machine description and the relevant command line > and I'll do a bsearch. Sorry for the delay in getting back to you on this. The full set of runs for the data I posted last night took 12 hours to produce, so the machine was a bit busy. I've now confirmed that the new dbench does indeed show a significant improvement in 2.6.10-rc2-mm2 as compared to 2.6.10-rc2. Interestingly, the improvement seems to be only in ext3, which confused me for a while. The difference is also much more dramatic (as a percentage) when xattrs are enabled in the test. Here are the results for dbench3 runs with varying numbers of clients, and with rc2 and rc2-mm2 for ext3. First the non-xattr results: clients -rc2 rc2-mm2 ----------------------- 10 362 376 20 328 357 30 249 270 40 169 199 50 128 155 60 107 143 now the xattr results (using the -x option to dbench) clients -rc2 rc2-mm2 ----------------------- 10 58 125 20 44 64 30 43 54 40 42 52 50 49 49 60 40 47 I don't know why there was no improvement at size 50. for comparison, there is very little difference for xfs (or the other filesystems I tested, which were jfs, reiser and ext2). Here are the non-xattr xfs results: clients -rc2 rc2-mm2 ----------------------- 10 365 368 20 324 328 30 254 257 40 194 212 50 128 139 60 58 59 The script I used to run dbench is at http://samba.org/~tridge/xattr_results/ the details on the machine config are there too. For your bsearch, its probably best to choose one of the clearest and least noisy results (like the xattr result for size 20) and just run the search for that one. That will take a bit under 5 minutes per test if you use the same runtime I did. You could do it quicker, but you risk getting more noise in the results. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-23 9:37 ` tridge @ 2004-11-23 17:55 ` Andreas Dilger 0 siblings, 0 replies; 41+ messages in thread From: Andreas Dilger @ 2004-11-23 17:55 UTC (permalink / raw) To: tridge; +Cc: Andrew Morton, linux-kernel [-- Attachment #1: Type: text/plain, Size: 790 bytes --] On Nov 23, 2004 20:37 +1100, tridge@samba.org wrote: > > > Would anyone care to hazard a guess as to what aspect of -mm2 is > > > gaining us 10% in overall Samba4 performance? > > > > Is it reproducible with your tricked-up dbench? > > > > If so, please send me a machine description and the relevant command line > > and I'll do a bsearch. > > I've now confirmed that the new dbench does indeed show a significant > improvement in 2.6.10-rc2-mm2 as compared to 2.6.10-rc2. Interestingly, > the improvement seems to be only in ext3, which confused me for a while. Might it be the reservation patches? Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 0:26 ` Andrew Morton ` (3 preceding siblings ...) 2004-11-23 9:37 ` tridge @ 2004-11-24 7:53 ` tridge 4 siblings, 0 replies; 41+ messages in thread From: tridge @ 2004-11-24 7:53 UTC (permalink / raw) To: Andrew Morton; +Cc: adilger, linux-kernel Andrew, You can call off your bsearch - I found the culprit. For the 2.6.10-rc2 tests I was running with the patch from Andreas that added large ext3 inode support (in order to also test the ext3-256 case). For the -mm2 test I wasn't. This patch was supposed to have no effect if large inodes were not setup at mkfs time. Unfortunately it does have an affect as it also removes the in-place xattr modification logic from ext3_xattr_set_handle(), so every xattr set becomes the same as a delete+create pair. In plain -rc2 and in -mm2 an xattr set of the same size will be done in-place. As every xattr set is of the same size in dbench3 this made a huge difference. Sorry for the false alarm. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 23:01 ` tridge 2004-11-20 0:26 ` Andrew Morton @ 2004-11-20 4:40 ` Hans Reiser 2004-11-20 6:47 ` tridge 1 sibling, 1 reply; 41+ messages in thread From: Hans Reiser @ 2004-11-20 4:40 UTC (permalink / raw) To: tridge; +Cc: linux-kernel, Reiserfs developers mail-list can you describe qualitatively what your test does? You didn't answer whether it does fsyncs, etc. It might be worth testing it with the extents only mount option for reiser4. Hans ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 4:40 ` Hans Reiser @ 2004-11-20 6:47 ` tridge 2004-11-20 16:13 ` Hans Reiser 0 siblings, 1 reply; 41+ messages in thread From: tridge @ 2004-11-20 6:47 UTC (permalink / raw) To: Hans Reiser; +Cc: linux-kernel, Reiserfs developers mail-list Hans, > can you describe qualitatively what your test does? The access patterns are very similar to dbench, which I believe you are already familiar with. Let me know if you'd like an explanation of dbench. For the test I ran, the basic load file is almost the same as dbench, but the interpretation of the load file is a little bit different. For example, when the load file says "open a file", Samba4 needs to first stat() the file, and if xattrs are being used then it needs to do a fgetattr() to grab the extended DOS attributes. Additionally, if the open has the effect of changing any of those attributes then Samba4 needs to use fsetxattr() to write back the extended attributes, and sometimes fchmod() and utime() as well depending on the open parameters. When dbench interprets one of these load files it would just call open(), skipping all the extra system calls. The full load file I used is at: http://samba.org/ftp/tridge/dbench/client_enterprise.txt and is based on a capture of a "Enterprise Disk Mix" Netbench run, captured using the "nbench" load capturing proxy module in Samba4, using a Win2003 server backend and WinXP client. The working set size is approximately 20 MByte per client, and I was testing with 10 simulated clients. That means its very much a "in memory" test, as the machine has 2G of ram. > You didn't answer whether it does fsyncs, etc. I think I did mention that the test does no fsync calls in the configuration I used. The reason I qualify the answer is that the load file actually contains approximately 1% Flush calls, but in its default configuration these are noops for Samba4. This is due to the confusion in Win32 between a "flush" operation and a "fsync" operation. Microsoft programmers use "flush" like a unix programmer would use fflush() on stdio, which is a noop for Samba. You can also configure Samba to treat flush as a "fsync", which is quite a different operation. The operation mix is as follows, listed with the approximate posix equivalent operation. (27%) ReadX (==pread) (17%) NTCreateX (==open) (16%) QUERY_PATH_INFORMATION (==stat) (13%) Close (==close) (9%) WriteX (==pwrite) (6%) FIND_FIRST (==opendir/readdir/closedir) (3%) Unlink (==unlink) (3%) QUERY_FS_INFORMATION (==statfs) (3%) QUERY_FILE_INFORMATION (==fstat) (1%) SET_FILE_INFORMATION (==fchmod/utime) (1%) Flush (==noop) (1%) Rename (==rename) (0%) UnlockX (==fcntl unlock) (0%) LockX (==fcntl lock) but the above can be a little misleading, as (for example) NTCreateX is a very complex call, and can be used to create directories, create files, open files or even delete files or directories (using the delete on close semantics). > It might be worth testing it with the extents only mount option for > reiser4. My apologies if I have just missed it, but I can't see an option that looks like "extents only" in either reiser4_parse_options() or in Documentation/filesystems/reiser4.txt in 2.6.10-rc2-mm2. Can you let me know the exact option name? Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 6:47 ` tridge @ 2004-11-20 16:13 ` Hans Reiser 2004-11-20 23:16 ` tridge ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Hans Reiser @ 2004-11-20 16:13 UTC (permalink / raw) To: tridge; +Cc: linux-kernel, Reiserfs developers mail-list tridge@samba.org wrote: >Hans, > > > can you describe qualitatively what your test does? > >The access patterns are very similar to dbench, which I believe you >are already familiar with. Let me know if you'd like an explanation of >dbench. > > Actually, I would, because I have never read its code, and have been at a loss for years to understand its meaning as a result of that. >For the test I ran, the basic load file is almost the same as dbench, >but the interpretation of the load file is a little bit different. > >For example, when the load file says "open a file", Samba4 needs to >first stat() the file, and if xattrs are being used then it needs to >do a fgetattr() to grab the extended DOS attributes. Additionally, if >the open has the effect of changing any of those attributes then >Samba4 needs to use fsetxattr() to write back the extended attributes, >and sometimes fchmod() and utime() as well depending on the open >parameters. > >When dbench interprets one of these load files it would just call >open(), skipping all the extra system calls. > >The full load file I used is at: > > http://samba.org/ftp/tridge/dbench/client_enterprise.txt > >and is based on a capture of a "Enterprise Disk Mix" Netbench run, >captured using the "nbench" load capturing proxy module in Samba4, >using a Win2003 server backend and WinXP client. > >The working set size is approximately 20 MByte per client, and I was >testing with 10 simulated clients. That means its very much a "in >memory" test, as the machine has 2G of ram. > > Ah, that explains a lot. For that kind of workload, the simpler the fs the better, because really all you are doing is adding overhead to copy_to_user and copy_from_user. All of reiser4's advanced features will add little or no value if you are staying in ram. > > You didn't answer whether it does fsyncs, etc. > >I think I did mention that the test does no fsync calls in the >configuration I used. The reason I qualify the answer is that the load >file actually contains approximately 1% Flush calls, but in its >default configuration these are noops for Samba4. This is due to the >confusion in Win32 between a "flush" operation and a "fsync" >operation. Microsoft programmers use "flush" like a unix programmer >would use fflush() on stdio, which is a noop for Samba. You can also >configure Samba to treat flush as a "fsync", which is quite a >different operation. > >The operation mix is as follows, listed with the approximate posix >equivalent operation. > >(27%) ReadX (==pread) >(17%) NTCreateX (==open) >(16%) QUERY_PATH_INFORMATION (==stat) >(13%) Close (==close) >(9%) WriteX (==pwrite) >(6%) FIND_FIRST (==opendir/readdir/closedir) >(3%) Unlink (==unlink) >(3%) QUERY_FS_INFORMATION (==statfs) >(3%) QUERY_FILE_INFORMATION (==fstat) >(1%) SET_FILE_INFORMATION (==fchmod/utime) >(1%) Flush (==noop) >(1%) Rename (==rename) >(0%) UnlockX (==fcntl unlock) >(0%) LockX (==fcntl lock) > >but the above can be a little misleading, as (for example) NTCreateX >is a very complex call, and can be used to create directories, create >files, open files or even delete files or directories (using the >delete on close semantics). > > > It might be worth testing it with the extents only mount option for > > reiser4. > >My apologies if I have just missed it, but I can't see an option that >looks like "extents only" in either reiser4_parse_options() or in >Documentation/filesystems/reiser4.txt in 2.6.10-rc2-mm2. Can you let >me know the exact option name? > >Cheers, Tridge > > > > mkfs.reiser4 -o extent=extent40 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 16:13 ` Hans Reiser @ 2004-11-20 23:16 ` tridge 2004-11-21 2:36 ` Hans Reiser 2004-11-21 0:21 ` tridge 2004-11-21 1:53 ` tridge 2 siblings, 1 reply; 41+ messages in thread From: tridge @ 2004-11-20 23:16 UTC (permalink / raw) To: Hans Reiser; +Cc: linux-kernel, Reiserfs developers mail-list Hans, > mkfs.reiser4 -o extent=extent40 This lowered the performance by a small amount (from 52 MB/sec to 50 MB/sec). It also revealed a bug. I have been doing my tests on a cleanly formatted filesystem each time, but this time I re-ran the test a few times in a row to determine just how consistent the results are. The results I got were: mkfs.reiser4 -o extent=extent40 50 MB/sec 48 43 41 37 (stuck) the "stuck" result meant that smbd locked into a permanent D state at the end of the fifth run. Unfortunately ps showed the wait-channel as '-' so I don't have any more information about the bug. I needed to power cycle the machine to recover. To check if this is reproducable I tried it again and got the following: reboot, mkfs again 50 MB/sec 48 44 42 40 (failed) the "failed" on the sixth run was smbd stuck in D state again, this time before the run completed so I didn't get a performance number. I should note that the test completely wipes the directory tree between runs, and the server processes restart, so the only way there can be any state remaining that explains the slowdown between runs is a filesystem bug. Do you think reiser4 could be "leaking" some on-disk structures? To determine if this problem is specific to the extent=extent40 option, I ran the same series of tests against reiser4 without the extent option: reboot, mkfs.reiser4 without options 52 MB/sec 52 45 41 (failed) The failure on the fifth run showed the same symptoms as above. To determine if the bug is specific to reiser4, I then ran the same series of tests against ext3, using the same kernel: reboot, mke2fs -j 70 MB/sec 70 69 70 71 70 So it looks like the gradual slowdown and eventual lockup is specific to reiser4. What can I do to help you track this down? Would you like me to write a "howto" for running this test, or would you prefer to wait till I have an emulation of the test in dbench? To give you an idea of the scales involved, each run lasts 100 seconds, and does approximately 1 million filesystem operations (the exact number of operations completed in the 100 seconds is roughly proportional to the performance result). > Ah, that explains a lot. For that kind of workload, the simpler the fs > the better, because really all you are doing is adding overhead to > copy_to_user and copy_from_user. All of reiser4's advanced features > will add little or no value if you are staying in ram. I'll do some runs with larger numbers of simulated clients and send you those results shortly. Do you think a working set size of about double the total machine memory would be a good size to start showing the reiser4 features? Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 23:16 ` tridge @ 2004-11-21 2:36 ` Hans Reiser 0 siblings, 0 replies; 41+ messages in thread From: Hans Reiser @ 2004-11-21 2:36 UTC (permalink / raw) To: tridge, vs; +Cc: linux-kernel, Reiserfs developers mail-list New benchmarks seem to be especially good at finding bugs. vs, please find the bug and fix it. Hans tridge@samba.org wrote: >Hans, > > > mkfs.reiser4 -o extent=extent40 > >This lowered the performance by a small amount (from 52 MB/sec to 50 >MB/sec). > >It also revealed a bug. I have been doing my tests on a cleanly >formatted filesystem each time, but this time I re-ran the test a few >times in a row to determine just how consistent the results are. The >results I got were: > > mkfs.reiser4 -o extent=extent40 50 MB/sec > 48 > 43 > 41 > 37 (stuck) > >the "stuck" result meant that smbd locked into a permanent D state at >the end of the fifth run. Unfortunately ps showed the wait-channel as >'-' so I don't have any more information about the bug. I needed to >power cycle the machine to recover. > >To check if this is reproducable I tried it again and got the following: > >reboot, mkfs again 50 MB/sec > 48 > 44 > 42 > 40 > (failed) > >the "failed" on the sixth run was smbd stuck in D state again, this >time before the run completed so I didn't get a performance number. > >I should note that the test completely wipes the directory tree >between runs, and the server processes restart, so the only way there >can be any state remaining that explains the slowdown between runs is >a filesystem bug. Do you think reiser4 could be "leaking" some on-disk >structures? > >To determine if this problem is specific to the extent=extent40 >option, I ran the same series of tests against reiser4 without the >extent option: > >reboot, mkfs.reiser4 without options 52 MB/sec > 52 > 45 > 41 > (failed) > >The failure on the fifth run showed the same symptoms as above. > >To determine if the bug is specific to reiser4, I then ran the same >series of tests against ext3, using the same kernel: > > reboot, mke2fs -j 70 MB/sec > 70 > 69 > 70 > 71 > 70 > >So it looks like the gradual slowdown and eventual lockup is specific >to reiser4. What can I do to help you track this down? Would you like >me to write a "howto" for running this test, or would you prefer to >wait till I have an emulation of the test in dbench? > >To give you an idea of the scales involved, each run lasts 100 >seconds, and does approximately 1 million filesystem operations (the >exact number of operations completed in the 100 seconds is roughly >proportional to the performance result). > > > >>Ah, that explains a lot. For that kind of workload, the simpler the fs >>the better, because really all you are doing is adding overhead to >>copy_to_user and copy_from_user. All of reiser4's advanced features >>will add little or no value if you are staying in ram. >> >> > >I'll do some runs with larger numbers of simulated clients and send >you those results shortly. Do you think a working set size of about >double the total machine memory would be a good size to start showing >the reiser4 features? > >Cheers, Tridge > > > > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 16:13 ` Hans Reiser 2004-11-20 23:16 ` tridge @ 2004-11-21 0:21 ` tridge 2004-11-21 2:41 ` Hans Reiser 2004-11-21 1:53 ` tridge 2 siblings, 1 reply; 41+ messages in thread From: tridge @ 2004-11-21 0:21 UTC (permalink / raw) To: Hans Reiser; +Cc: linux-kernel, Reiserfs developers mail-list Hans, A bit more information about the slowdown between runs (and eventual lockup) with reiser4 that I reported in my last email. I found that a umount/mount between runs solved the problem, leading to a fairly consistent result and no lockup. I also found that running a simple /bin/sync between runs solved the problem. This implies to me that it is some in-memory structure that is the culprit. I can't see anything obvious in /proc/slabinfo, but its been a while since I've done any serious kernel development so maybe I just don't know what to look for. I also tried enabling the "strict sync" option in Samba4. This makes the 1% flush operations in the load file map to fsync() instead of a noop. This caused reiser4 to lockup almost immediately, with the same symptoms as the previous lockups I reported (all smbd processes stuck in D state). No oops messages or anything unusual in dmesg. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-21 0:21 ` tridge @ 2004-11-21 2:41 ` Hans Reiser 0 siblings, 0 replies; 41+ messages in thread From: Hans Reiser @ 2004-11-21 2:41 UTC (permalink / raw) To: tridge; +Cc: linux-kernel, Reiserfs developers mail-list, vs tridge@samba.org wrote: >Hans, > >A bit more information about the slowdown between runs (and eventual >lockup) with reiser4 that I reported in my last email. > >I found that a umount/mount between runs solved the problem, leading >to a fairly consistent result and no lockup. I also found that running >a simple /bin/sync between runs solved the problem. > >This implies to me that it is some in-memory structure that is the >culprit. I can't see anything obvious in /proc/slabinfo, but its been >a while since I've done any serious kernel development so maybe I just >don't know what to look for. > >I also tried enabling the "strict sync" option in Samba4. This makes >the 1% flush operations in the load file map to fsync() instead of a >noop. This caused reiser4 to lockup almost immediately, with the same >symptoms as the previous lockups I reported (all smbd processes stuck >in D state). No oops messages or anything unusual in dmesg. > >Cheers, Tridge >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > > > Thanks much tridge. vs, please respond in detail. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-20 16:13 ` Hans Reiser 2004-11-20 23:16 ` tridge 2004-11-21 0:21 ` tridge @ 2004-11-21 1:53 ` tridge 2004-11-21 2:48 ` Hans Reiser 2 siblings, 1 reply; 41+ messages in thread From: tridge @ 2004-11-21 1:53 UTC (permalink / raw) To: Hans Reiser; +Cc: linux-kernel, Reiserfs developers mail-list Hans, > Actually, I would, because I have never read its code, and have been at > a loss for years to understand its meaning as a result of that. ok, first a bit of history. In 1992 Ziff-Davis developed a benchmark called "NetBench" for benchmarking file serving in PC client environments. NetBench was freely downloadable, but without source code. Over the years Netbench became the main benchmark used in the Windows file serving world. In the Network Attached Storage market, good NetBench numbers are absolutely essential, and companies tend to put a lot of effort into building large "NetBench labs" for testing NetBench performance. A couple of companies I have worked at have had people working almost full time on running netbench results with various configurations. NetBench is quite different from Bonnie and other similar benchmarks, as it is based on "replay of captured load". The load files for NetBench come from common real-world scenarios where PC clients run popular applications like MS Word, Excel, Corel Draw, MS Access, Paradox, MS PowerPoint etc while storing their files on a remote PC file server. The usual output of Netbench is an Excel spreadsheet showing fairly detailed performance numbers for different numbers of clients, plus min, max and standard deviation numbers for the response time of each type of operation. NetBench came to prominance in the Linux world when Microsoft paid a company called MindCraft to run some benchmarks comparing Windows file server performance to Samba on Linux. It was initially difficult for the Linux community to respond to this as we had no easy access to a NetBench lab, and setting one up could easily be a million-dollar effort. To fix this, I wrote a suite of three benchmark tools, called "nbench", "dbench" and "tbench". These tools were designed to provide a fairly close emulation of NetBench, and to be extremely simple to use (much simpler than NetBench). I also wanted them to be able to be run on the typical hardware available to many home Linux developers. They don't give output that is nearly as detailed as NetBench, but when combined with common profiling tools this usually isn't a problem. The three tools are: - nbench. This completely emulates a NetBench run. The current versions produce almost identical sets of CIFS network packets to a run of NetBench on WinXP. You need to have a CIFS file server (like Samba) installed to run nbench. - dbench. This emulates just the file system calls that a Samba server would have to perform in order to complete a NetBench run. It doesn't need Samba installed. - tbench. This emulates just the TCP traffic that a Samba server would have to send/receive in order to complete a NetBench run. It doesn't need Samba installed. Over the years I have improved these tools to give better and better emulation of NetBench. Unfortunately this means that you can't meaningfully compare results between versions. All 3 tools use a load file to tell them what operations to perform. This load file is written in terms of CIFS file sharing operations, which are then interpreted by the benchmark tools into either CIFS requests, filesystem requests or TCP traffic. There are a number of ways to generate these load files. You can write one yourself (good for measuring just write speed for example), or you can capture a load file from any CIFS network activity, either by post-processing a tcpdump or by using a Samba. The load files I provide come from capturing real NetBench runs. Note that in all of the above I never claimed that these tools are "good" benchmarks. I merely try to make them produce results that closely predict the results of real NetBench runs. Whether NetBench is actually a "good" benchmark is another topic entirely. Finally, I should note that Spec is considering adding CIFS benchmarking to their suite of benchmarks. Interestingly, they are looking at using something based on my nbench tool, or something close to it, so eventually nbench might become the more "official" benchmark. That would certainly be an interesting turn of events :) Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-21 1:53 ` tridge @ 2004-11-21 2:48 ` Hans Reiser 2004-11-21 3:19 ` tridge 0 siblings, 1 reply; 41+ messages in thread From: Hans Reiser @ 2004-11-21 2:48 UTC (permalink / raw) To: tridge; +Cc: linux-kernel, Reiserfs developers mail-list Would you be willing to do some variation on it that scaled itself to the size of the machine, and generated disk load rather than fitting in ram? I hope you understand my reluctance to optimize for tests that fit into ram..... Thanks, Hans > >Finally, I should note that Spec is considering adding CIFS >benchmarking to their suite of benchmarks. Interestingly, they are >looking at using something based on my nbench tool, or something close >to it, so eventually nbench might become the more "official" >benchmark. That would certainly be an interesting turn of events :) > >Cheers, Tridge > > > > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-21 2:48 ` Hans Reiser @ 2004-11-21 3:19 ` tridge 2004-11-21 6:11 ` Hans Reiser 0 siblings, 1 reply; 41+ messages in thread From: tridge @ 2004-11-21 3:19 UTC (permalink / raw) To: Hans Reiser; +Cc: linux-kernel, Reiserfs developers mail-list Hans, > Would you be willing to do some variation on it that scaled itself to > the size of the machine, and generated disk load rather than fitting in ram? You can do that now by varying the number of simulated clients, or by varying the load file. > I hope you understand my reluctance to optimize for tests that fit into > ram..... to some extent, yes, but "in memory" tests are actually pretty important for file serving. In a typical large office environment with one or two thousand users you will only have between 20 and 100 of those users really actively using the file server at any one time. The others are taking a nap, in meetings or staring out the window. Or maybe (being generous), they are all working furiously with cached data. I haven't actually gone into the cubes to check - I just see the server side stats. Of those that are active, they rarely have a working set size of over 100MB, and usually much less, so it is not uncommon for the whole workload over a period of 5 minutes to fit in memory on typical file servers. This is especially so on the modern big file servers that might have 16G of ram or more, with modern clients that do agressive lease based caching. There are exceptions of course. Big print shops, rendering farms and high performance computing sites are all examples of sites that have active working sets much larger than typical system memory. The point is that you need to test a wide range of working set sizes. You also might like to notice that in the published commercial NetBench runs paid for by the big players (like Microsoft, NetApp, EMC etc), you tend to find that the graph only extends to a number of clients equal to the total machine memory divided by 25MB. That is perhaps not a coincidence given that the working set size per client of NetBench is about 22MB. The people who pay for the benchmarks want their customers to see a graph that doesn't have a big cliff at the right hand side. Also, with journaled filesystems running in-memory benchmarks isn't as silly as it first seems, as there are in fact big differences between how the filesystems cope. It isn't just a memory bandwidth test. Windows clients do huge numbers of meta-data operations, and nearly all of those cause journal writes which hit the metal. So while I sympathise with you wanting reiser4 to be tuned for "big" storage, please remember that a good proportion of the installs are likely to be running "in-memory" workloads. Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-21 3:19 ` tridge @ 2004-11-21 6:11 ` Hans Reiser 0 siblings, 0 replies; 41+ messages in thread From: Hans Reiser @ 2004-11-21 6:11 UTC (permalink / raw) To: tridge; +Cc: linux-kernel, Reiserfs developers mail-list tridge@samba.org wrote: > >So while I sympathise with you wanting reiser4 to be tuned for "big" >storage, please remember that a good proportion of the installs are >likely to be running "in-memory" workloads. > > I agree that in-memory workloads are important, and that is why we compress on flush rather than compressing on write for our compression plugin, and it is why we should spend some time optimizing reiser4 to make its code paths more lightweight for the in-memory case. At the same time, I think that the workloads where the filesystem matters the most are the ones that access the disk. With computers, in a large percentage of the time that people notice themselves waiting, it is the disk drive they are waiting on. Sigh, there are so many things we should optimize for, and it will be years before we have hit all the important ones. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-19 7:38 ` performance of filesystem xattrs with Samba4 tridge ` (3 preceding siblings ...) 2004-11-19 15:34 ` Hans Reiser @ 2004-11-21 22:21 ` Nathan Scott 2004-11-21 23:43 ` tridge 4 siblings, 1 reply; 41+ messages in thread From: Nathan Scott @ 2004-11-21 22:21 UTC (permalink / raw) To: tridge; +Cc: linux-kernel, linux-xfs Hi Andrew, On Fri, Nov 19, 2004 at 06:38:40PM +1100, tridge@samba.org wrote: > ... > The biggest change from the kernels point of view is that Samba4 makes > extensive use of filesystem xattrs. Almost every file with have a > ... > I started some simple benchmarking today using the BENCH-NBENCH > smbtorture benchmark, with 10 simulated clients and loopback > networking on a dual Xeon server with 2G ram and a 50G scsi partition. > I used a 2.6.10-rc2 kernel. This benchmark only involves a > user.DosAttrib xattr of size 44 on every file (that will be the most > common situation in production use). > ... > xfs 62 MB/sec > xfs+xattr 40 MB/sec > xfs+2Kinode 63 MB/sec > xfs+xattr+2Kinode 58 MB/sec > ... > The XFS results with default options are rather disappointing, as XFS > has usually been a good performer for Samba workloads. Increasing the > inode size to 2k brought it back to a more reasonable level. Interesting. There's been on-and-off discussion for some time as to whether the default mkfs parameters should be changed, this will add more fuel to that debate I expect. I'm curious why you went to 2K inodes instead of 512 - I guess because thats the largest inode size with a 4K blocksize? If the defaults were changed, I expect it would be to switch over to 512 byte inodes - do you have numbers for that? > To make it easier to benchmark with xattrs, I'm planning on doing a > new version of dbench with optional xattr support. That will allow > others to play with xattr performance for the above workload without Ah great, thanks, I'll be keen to try that when its available. cheers. -- Nathan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4 2004-11-21 22:21 ` Nathan Scott @ 2004-11-21 23:43 ` tridge 0 siblings, 0 replies; 41+ messages in thread From: tridge @ 2004-11-21 23:43 UTC (permalink / raw) To: Nathan Scott; +Cc: linux-kernel, linux-xfs Nathan, > I'm curious why you went to 2K inodes instead of 512 - I guess > because thats the largest inode size with a 4K blocksize? If > the defaults were changed, I expect it would be to switch over > to 512 byte inodes - do you have numbers for that? It was a fairly arbitrary choice. For the test I was running the xattrs were small (44 bytes), so 512 would have been fine, but some other tests I run use larger xattrs (for NT ACLs, streams, DOS EAs etc). > Ah great, thanks, I'll be keen to try that when its available. It's now released. You can grab it at: http://samba.org/ftp/tridge/dbench/dbench-3.0.tar.gz It should produce much more consistent results than previous versions of dbench, plus it has a -x option to enable xattr support. Other changes include: - the runs are now time limited, rather than being a fixed number of operations. This gives much more consisten results, especially for fast machines. - I've changed the mapping of the filesystem operations to be much closer to what Samba4 does, including the directory scans for case insensitivity, the stat() calls in name resolution and things like statfs() calls. The modelling could still be improved, but its much better than it was. - the load file is now compatible with the smbtorture NBENCH test again (the two diverged a while back). - the default load file has been updated to be based on NetBench 7.0.3, running a enterprise disk mix. - the warmup/execute/cleanup phases are now better separated Cheers, Tridge ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: performance of filesystem xattrs with Samba4
@ 2004-12-03 17:49 Steve French
0 siblings, 0 replies; 41+ messages in thread
From: Steve French @ 2004-12-03 17:49 UTC (permalink / raw)
To: aia21; +Cc: linux-kernel
Anton wrote
> I have been mulling over in my head for quite a while what
> to do about an interface for "advanced ntfs features" but so far I have
> always pushed this to the back of my mind. After all no point in
> providing advanced features considering we don't even provide full
> read-write access yet. I just thought I would mentione NTFS when I saw
>
>But to answer your question I definitely would envisage an interface to
>the kernel driver
The same issue has been on my mind for other filesystems too - since I
can return similar information to NTFS. The "easy" things
to return that could be useful to apps (including Samba4, but also
backup apps etc.) include:
1) file creation time
2) "dos" attribute bits
3) perhaps ACL mapping into "POSIX ACL" (getfacl/setfacl's Linux xattr)
format from the CIFS/NTFS style.
4) streams (which could be mapped in a few cases to xattrs, but are getting
increasingly used and therefore important for certain types of apps - like
network backup e.g. to be able to get access to)
The first two are in the on disk format already of various filesytems (NTFS, VFAT, even JFS, and would be trivial
for me to export in the cifs vfs. I suspect NFSv4 which is similar to CIFS in many ways would also have
an easy time of exporting a few of those. The first two of these could of course be simply special casings
the reserved xattr name "User.DosAttribute" or equivalent used by Samba4. This has a few advantages - local apps work
and migrations to Linux from Windows are easier (as more data is preserved) :)
Note that NTFS now has a form of symlink stored in "OS/2 EAs" on disk (I see them show up on test systems
when the Unix Services are loaded) as well as Unix like devices - very strange but potentially could
be mapped into something that made sense to Linux.
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2004-12-03 17:50 UTC | newest] Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <16759.16648.459393.752417@samba.org> 2004-10-21 18:32 ` [PATCH] Re: idr in Samba4 Jim Houston 2004-10-22 6:17 ` tridge 2004-11-19 7:38 ` performance of filesystem xattrs with Samba4 tridge 2004-11-19 8:08 ` James Morris 2004-11-19 10:16 ` Andreas Dilger 2004-11-19 11:43 ` tridge 2004-11-19 22:28 ` Andreas Dilger 2004-11-22 13:02 ` tridge 2004-11-22 21:40 ` Andreas Dilger 2004-11-19 12:03 ` Anton Altaparmakov 2004-11-19 12:43 ` tridge 2004-11-19 14:11 ` Anton Altaparmakov 2004-11-20 10:44 ` tridge 2004-11-20 16:20 ` Hans Reiser 2004-11-20 23:29 ` tridge 2004-11-19 15:34 ` Hans Reiser 2004-11-19 15:58 ` Jan Engelhardt 2004-11-19 22:03 ` tridge 2004-11-20 4:51 ` Hans Reiser 2004-11-19 23:01 ` tridge 2004-11-20 0:26 ` Andrew Morton 2004-11-21 1:14 ` tridge 2004-11-21 2:12 ` tridge 2004-11-21 23:53 ` tridge 2004-11-23 9:37 ` tridge 2004-11-23 17:55 ` Andreas Dilger 2004-11-24 7:53 ` tridge 2004-11-20 4:40 ` Hans Reiser 2004-11-20 6:47 ` tridge 2004-11-20 16:13 ` Hans Reiser 2004-11-20 23:16 ` tridge 2004-11-21 2:36 ` Hans Reiser 2004-11-21 0:21 ` tridge 2004-11-21 2:41 ` Hans Reiser 2004-11-21 1:53 ` tridge 2004-11-21 2:48 ` Hans Reiser 2004-11-21 3:19 ` tridge 2004-11-21 6:11 ` Hans Reiser 2004-11-21 22:21 ` Nathan Scott 2004-11-21 23:43 ` tridge 2004-12-03 17:49 Steve French
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).