* i_version, NFSv4 change attribute @ 2009-11-22 22:20 J. Bruce Fields 2009-11-23 11:48 ` tytso 0 siblings, 1 reply; 10+ messages in thread From: J. Bruce Fields @ 2009-11-22 22:20 UTC (permalink / raw) To: linux-ext4, linux-fsdevel Since c654b8a9cba6002aad1c01919e4928a79a4a6dcf, the NFS server has made use of ext4's i_version when available. However, the new i_version support is available only when the filesystem is mounted with the i_version mount option. And the change attribute is required for completely correct NFSv4 operation, which we'd prefer to offer by default! I recall having a conversation with Ted T'so about ways to do this without affecting non-NFS-exported filesystems: maybe by providing some sort of persistant flag on the superblock which the nfsd code could turn on automatically if desired? But first it would be useful to know whether there is in fact any disadvantage to just having the i_version on all the time. Jean Noel Cordenner did some tests a couple years ago: http://www.bullopensource.org/ext4/change_attribute/index.html and didn't find any significant difference. But I don't know if those results were convincing (I don't understand what we're looking for here); any suggestions for workloads that would exercise the worst cases? --b. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: i_version, NFSv4 change attribute 2009-11-22 22:20 i_version, NFSv4 change attribute J. Bruce Fields @ 2009-11-23 11:48 ` tytso 2009-11-23 16:44 ` J. Bruce Fields 0 siblings, 1 reply; 10+ messages in thread From: tytso @ 2009-11-23 11:48 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-ext4, linux-fsdevel On Sun, Nov 22, 2009 at 05:20:47PM -0500, J. Bruce Fields wrote: > However, the new i_version support is available only when the filesystem > is mounted with the i_version mount option. And the change attribute is > required for completely correct NFSv4 operation, which we'd prefer to > offer by default! > > I recall having a conversation with Ted Ts'o about ways to do this > without affecting non-NFS-exported filesystems: maybe by providing some > sort of persistant flag on the superblock which the nfsd code could turn > on automatically if desired? > > But first it would be useful to know whether there is in fact any > disadvantage to just having the i_version on all the time. Jean Noel > Cordenner did some tests a couple years ago: > > http://www.bullopensource.org/ext4/change_attribute/index.html > > and didn't find any significant difference. But I don't know if those > results were convincing (I don't understand what we're looking for > here); any suggestions for workloads that would exercise the worst > cases? Hmmm.... the workload that would probably see the most hit would be one where we have multiple processes/thread running on different cpu's that are modifying the inode in parallel. (i.e., the sort of workload that a database with multiple clients would naturally see). The test which Bull did above used a workload which was very heavy on creating and destroying files, and it was only a two processor system (not that it mattered; it looks like the fileop benchmark was single-threaded anyway). The test I would do is something like a 4 or 8 processor test, with lots of parallel I/O to the same file (at which point we would probably end up bottlenecking on inode->i_lock). It would seem to me that a simple way of fixing this would be to use atomic64 type for inode->i_version, so we don't have to take the spinlock and bounce cache lines each time i_version gets updated. What we might decide, at the end of the day, is that for common fs workloads no one is going to notice, and for the parallel intensive workloads (i.e., databases), people will be tuning for this anyway, so we can make i_version the default, and noi_version an option people can use to turn off i_version if they are optimizing for the database workload. A similar tuning knob that I should add is one that allows us to set a custom value for sb->s_time_gran, so that we don't have to dirty the inode and engage the jbd2 machinery after *every* single write. Once I add that, or if you use i_version on a file system with an 128-byte inode so the mtime update granularity is a second, I suspect the cost of i_version will be especially magnified, and the database people will very much want to turn off i_version. And that brings up another potential compromise --- what if we only update i_version every n milliseconds? That way if the file is being modified to the tune of hundreds of thousands of updates a second, NFSv4 clients will see a change fairly quickly, with n milliseconds, but we won't be incrementing i_version at incredibly high rates. I suspect that would violate NFSv4 protocol specs somewhere, but would it cause seriously noticeable breakage that would be user visible? If not, maybe that's something we should allow the user to set, perhaps not as the default? Just a thought. Regards, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: i_version, NFSv4 change attribute 2009-11-23 11:48 ` tytso @ 2009-11-23 16:44 ` J. Bruce Fields 2009-11-23 16:59 ` J. Bruce Fields ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: J. Bruce Fields @ 2009-11-23 16:44 UTC (permalink / raw) To: tytso; +Cc: linux-ext4, linux-fsdevel On Mon, Nov 23, 2009 at 06:48:31AM -0500, tytso@mit.edu wrote: > On Sun, Nov 22, 2009 at 05:20:47PM -0500, J. Bruce Fields wrote: > > However, the new i_version support is available only when the filesystem > > is mounted with the i_version mount option. And the change attribute is > > required for completely correct NFSv4 operation, which we'd prefer to > > offer by default! > > > > I recall having a conversation with Ted Ts'o about ways to do this > > without affecting non-NFS-exported filesystems: maybe by providing some > > sort of persistant flag on the superblock which the nfsd code could turn > > on automatically if desired? > > > > But first it would be useful to know whether there is in fact any > > disadvantage to just having the i_version on all the time. Jean Noel > > Cordenner did some tests a couple years ago: > > > > http://www.bullopensource.org/ext4/change_attribute/index.html > > > > and didn't find any significant difference. But I don't know if those > > results were convincing (I don't understand what we're looking for > > here); any suggestions for workloads that would exercise the worst > > cases? > > Hmmm.... the workload that would probably see the most hit would be > one where we have multiple processes/thread running on different cpu's > that are modifying the inode in parallel. (i.e., the sort of workload > that a database with multiple clients would naturally see). Got it, thanks. Is there an existing easy-to-setup workload I could start with, or would it be sufficient to try the simplest possible code that met the above description? (E.g., fork a process for each cpu, each just overwriting byte 0 as fast as possible, and count total writes performed per second?) > The test which Bull did above used a workload which was very heavy on > creating and destroying files, and it was only a two processor system > (not that it mattered; it looks like the fileop benchmark was > single-threaded anyway). The test I would do is something like a 4 or > 8 processor test, with lots of parallel I/O to the same file (at which > point we would probably end up bottlenecking on inode->i_lock). > > It would seem to me that a simple way of fixing this would be to use > atomic64 type for inode->i_version, so we don't have to take the > spinlock and bounce cache lines each time i_version gets updated. The current locking does seem like overkill. > What we might decide, at the end of the day, is that for common fs > workloads no one is going to notice, and for the parallel intensive > workloads (i.e., databases), people will be tuning for this anyway, so > we can make i_version the default, and noi_version an option people > can use to turn off i_version if they are optimizing for the database > workload. > > A similar tuning knob that I should add is one that allows us to set > a custom value for sb->s_time_gran, so that we don't have to dirty the > inode and engage the jbd2 machinery after *every* single write. Once > I add that, or if you use i_version on a file system with an 128-byte > inode so the mtime update granularity is a second, I suspect the cost > of i_version will be especially magnified, and the database people > will very much want to turn off i_version. > > And that brings up another potential compromise --- what if we only > update i_version every n milliseconds? That way if the file is being > modified to the tune of hundreds of thousands of updates a second, > NFSv4 clients will see a change fairly quickly, with n milliseconds, > but we won't be incrementing i_version at incredibly high rates. I > suspect that would violate NFSv4 protocol specs somewhere, but would > it cause seriously noticeable breakage that would be user visible? If > not, maybe that's something we should allow the user to set, perhaps > not as the default? Just a thought. That knob would affect the probability of the breakage, but not necessarily the seriousness. The race is: - clock tick write read and check i_version write - clock tick If the second write doesn't modify the i_version, the client will never see it. (Unless a later write bumps i_version again.) If the side we want to optimize is the modifications, I wonder if we could do all the i_version increments on *read* of i_version?: - writes (and other inode modifications) set an "i_version_dirty" flag. - reads of i_version clear the i_version_dirty flag, increment i_version, and return the result. As long as the reader sees i_version_flag set only after it sees the write that caused it, I think it all works? --b. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: i_version, NFSv4 change attribute 2009-11-23 16:44 ` J. Bruce Fields @ 2009-11-23 16:59 ` J. Bruce Fields 2009-11-23 18:11 ` Trond Myklebust 2009-11-23 18:35 ` tytso 2 siblings, 0 replies; 10+ messages in thread From: J. Bruce Fields @ 2009-11-23 16:59 UTC (permalink / raw) To: tytso; +Cc: linux-ext4, linux-fsdevel On Mon, Nov 23, 2009 at 11:44:45AM -0500, J. Bruce Fields wrote: > If the side we want to optimize is the modifications, I wonder if we > could do all the i_version increments on *read* of i_version?: > > - writes (and other inode modifications) set an "i_version_dirty" > flag. > - reads of i_version clear the i_version_dirty flag, increment > i_version, and return the result. > > As long as the reader sees i_version_flag set only after it sees the > write that caused it, I think it all works? Except I'm a little confused about how i_version behaves (and should behave) on reboot. Assuming it's currently correct, I think it's enough to also increment i_version as above when writing out the inode. --b. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: i_version, NFSv4 change attribute 2009-11-23 16:44 ` J. Bruce Fields 2009-11-23 16:59 ` J. Bruce Fields @ 2009-11-23 18:11 ` Trond Myklebust 2009-11-23 18:19 ` J. Bruce Fields 2009-11-23 18:35 ` tytso 2 siblings, 1 reply; 10+ messages in thread From: Trond Myklebust @ 2009-11-23 18:11 UTC (permalink / raw) To: J. Bruce Fields; +Cc: tytso, linux-ext4, linux-fsdevel On Mon, 2009-11-23 at 11:44 -0500, J. Bruce Fields wrote: > If the side we want to optimize is the modifications, I wonder if we > could do all the i_version increments on *read* of i_version?: > > - writes (and other inode modifications) set an "i_version_dirty" > flag. > - reads of i_version clear the i_version_dirty flag, increment > i_version, and return the result. > > As long as the reader sees i_version_flag set only after it sees the > write that caused it, I think it all works? That probably won't make much of a difference to performance. Most NFSv4 clients will have every WRITE followed by a GETATTR operation in the same compound, so your i_version_dirty flag will always immediately get cleared. The question is, though, why does the jbd2 machinery need to be engaged on _every_ write? The NFS clients don't care if we lose an i_version count due to a sudden server reboot, since that will trigger a rewrite of the dirty data anyway once the server comes back up again. As long as the i_version is guaranteed to be written to stable storage on a successful call to fsync(), then the NFS data integrity requirements are fully satisfied. Trond ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: i_version, NFSv4 change attribute 2009-11-23 18:11 ` Trond Myklebust @ 2009-11-23 18:19 ` J. Bruce Fields 2009-11-23 18:37 ` Trond Myklebust 2009-11-23 18:51 ` tytso 0 siblings, 2 replies; 10+ messages in thread From: J. Bruce Fields @ 2009-11-23 18:19 UTC (permalink / raw) To: Trond Myklebust; +Cc: tytso, linux-ext4, linux-fsdevel On Mon, Nov 23, 2009 at 01:11:19PM -0500, Trond Myklebust wrote: > On Mon, 2009-11-23 at 11:44 -0500, J. Bruce Fields wrote: > > If the side we want to optimize is the modifications, I wonder if we > > could do all the i_version increments on *read* of i_version?: > > > > - writes (and other inode modifications) set an "i_version_dirty" > > flag. > > - reads of i_version clear the i_version_dirty flag, increment > > i_version, and return the result. > > > > As long as the reader sees i_version_flag set only after it sees the > > write that caused it, I think it all works? > > That probably won't make much of a difference to performance. Most NFSv4 > clients will have every WRITE followed by a GETATTR operation in the > same compound, so your i_version_dirty flag will always immediately get > cleared. I was only thinking about non-NFS performance. > The question is, though, why does the jbd2 machinery need to be engaged > on _every_ write? Is it? I thought I remembered a journaling issue from previous discussions, but Ted seemed concerned just about the overhead of an additional spinlock, and looking at the code, the only test of I_VERSION that I can see indeed is in ext4_mark_iloc_dirty(), and indeed just takes a spinlock and updates the i_version. --b. > The NFS clients don't care if we lose an i_version count due to a > sudden server reboot, since that will trigger a rewrite of the dirty > data anyway once the server comes back up again. As long as the > i_version is guaranteed to be written to stable storage on a > successful call to fsync(), then the NFS data integrity requirements > are fully satisfied. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: i_version, NFSv4 change attribute 2009-11-23 18:19 ` J. Bruce Fields @ 2009-11-23 18:37 ` Trond Myklebust 2009-11-23 18:51 ` tytso 1 sibling, 0 replies; 10+ messages in thread From: Trond Myklebust @ 2009-11-23 18:37 UTC (permalink / raw) To: J. Bruce Fields; +Cc: tytso, linux-ext4, linux-fsdevel On Mon, 2009-11-23 at 13:19 -0500, J. Bruce Fields wrote: > On Mon, Nov 23, 2009 at 01:11:19PM -0500, Trond Myklebust wrote: > > On Mon, 2009-11-23 at 11:44 -0500, J. Bruce Fields wrote: > > > If the side we want to optimize is the modifications, I wonder if we > > > could do all the i_version increments on *read* of i_version?: > > > > > > - writes (and other inode modifications) set an "i_version_dirty" > > > flag. > > > - reads of i_version clear the i_version_dirty flag, increment > > > i_version, and return the result. > > > > > > As long as the reader sees i_version_flag set only after it sees the > > > write that caused it, I think it all works? > > > > That probably won't make much of a difference to performance. Most NFSv4 > > clients will have every WRITE followed by a GETATTR operation in the > > same compound, so your i_version_dirty flag will always immediately get > > cleared. > > I was only thinking about non-NFS performance. I would think that running a high performance database _and_ NFS server on the same machine would tend to be very much of a corner case anyway. In most setups I'm aware of, the database and NFS server tend to be completely separate machines. > > The question is, though, why does the jbd2 machinery need to be engaged > > on _every_ write? > > Is it? See Ted's email. As I read it, his concern was that if they allow people to reduce the a/m/c/time resolution, then the i_version would still force them to dirty the inode on every write... Trond ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: i_version, NFSv4 change attribute 2009-11-23 18:19 ` J. Bruce Fields 2009-11-23 18:37 ` Trond Myklebust @ 2009-11-23 18:51 ` tytso 2009-11-25 20:48 ` J. Bruce Fields 1 sibling, 1 reply; 10+ messages in thread From: tytso @ 2009-11-23 18:51 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Trond Myklebust, linux-ext4, linux-fsdevel On Mon, Nov 23, 2009 at 01:19:51PM -0500, J. Bruce Fields wrote: > > The question is, though, why does the jbd2 machinery need to be engaged > > on _every_ write? > > Is it? > > I thought I remembered a journaling issue from previous discussions, but > Ted seemed concerned just about the overhead of an additional > spinlock, and looking at the code, the only test of I_VERSION that I can > see indeed is in ext4_mark_iloc_dirty(), and indeed just takes a > spinlock and updates the i_version. There are two concerns. One is the inode->i_lock overhead, which at the time when we added i_version, the atomic64 type wasn't added, so the only simple way it could have been implemented was by taking the spinlock. This we can fix, and I think it's a no-brainer that we switch it to be an atomic64, especially for the most common Intel platforms. The second problem is the jbd2 machinery, which gets engaged when the inode changes, which means in the case of sys_write(), if i_version or i_mtime gets changed. At the moment, if we are using a 256-byte inode with ext4, we will be updating i_mtime on every single write, and so when ext4_setattr(), which is called from notify_change() notices that i_mtime is changed, we are engaging the entire jbd2 machinery for every single write. This is not true for a 128-byte inode, since in that case sb->s_time_gran is set to one second, so we would only be updating the inode and engaging the jbd2 machinery once a second. This is true for ext3 and ext4 with 128-byte inodes. Now, all of this having been said, Feodra 11 and 12 have been using ext4 as the default filesystem, and for generic desktop usage, people haven't been screaming about the increased CPU overhead implied by engaging the jbd2 machinery on every sys_write(). However, we have had a report that some enterprise database developers have noticed the increased overhead in ext4, and this is on our list of things that require some performance tuning. Hence my comments about a mount option to adjust s_time_gran for the benefit of database workloads, and once we have that moun option, since enabling i_version would mean once again needing to update the inode at every single write(2) call, we would be back with the same problem. Maybe we can find a way to be more clever about doing some (but not all) of the jbd2 work on each sys_write(), and deferring as much as possible to the commit handling. We need to do some investigating to see if that's possible. Even if it isn't, though, my gut tells me that we will probably be able to enable i_version by default for desktop workloads, and tell database server folks that they should mount with the mount options "noi_version,time_gran=1s", or some such. I'd like to do some testing to confirm my intuition first, of course, but that's how I'm currently leaning. Does that make sense? Regards, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: i_version, NFSv4 change attribute 2009-11-23 18:51 ` tytso @ 2009-11-25 20:48 ` J. Bruce Fields 0 siblings, 0 replies; 10+ messages in thread From: J. Bruce Fields @ 2009-11-25 20:48 UTC (permalink / raw) To: tytso; +Cc: Trond Myklebust, linux-ext4, linux-fsdevel On Mon, Nov 23, 2009 at 01:51:05PM -0500, tytso@mit.edu wrote: > Now, all of this having been said, Feodra 11 and 12 have been using > ext4 as the default filesystem, and for generic desktop usage, people > haven't been screaming about the increased CPU overhead implied by > engaging the jbd2 machinery on every sys_write(). > > However, we have had a report that some enterprise database developers > have noticed the increased overhead in ext4, and this is on our list > of things that require some performance tuning. Hence my comments > about a mount option to adjust s_time_gran for the benefit of database > workloads, and once we have that moun option, since enabling i_version > would mean once again needing to update the inode at every single > write(2) call, we would be back with the same problem. > > Maybe we can find a way to be more clever about doing some (but not > all) of the jbd2 work on each sys_write(), and deferring as much as > possible to the commit handling. We need to do some investigating to > see if that's possible. Even if it isn't, though, my gut tells me > that we will probably be able to enable i_version by default for > desktop workloads, and tell database server folks that they should > mount with the mount options "noi_version,time_gran=1s", or some such. > > I'd like to do some testing to confirm my intuition first, of course, > but that's how I'm currently leaning. Does that make sense? I think so, thanks. So do I have this todo list approximately right?: 1. Use an atomic type instead of a spinlock for i_version, and do some before-and-after benchmarking of writes (following your suggestions in http://marc.info/?l=linux-ext4&m=125900130605891&w=2) 2. Turn on i_version by default. (At this point it shouldn't be making things any worse than the high-resolution timestamps are.) 3. Find someone to run database benchmarks, and work on noi_version,time_gran=1s (or whatever) options for their case. I wish I could volunteer at least for #1, but embarassingly don't have much more than dual-core machines lying around right now to test with. --b. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: i_version, NFSv4 change attribute 2009-11-23 16:44 ` J. Bruce Fields 2009-11-23 16:59 ` J. Bruce Fields 2009-11-23 18:11 ` Trond Myklebust @ 2009-11-23 18:35 ` tytso 2 siblings, 0 replies; 10+ messages in thread From: tytso @ 2009-11-23 18:35 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-ext4, linux-fsdevel On Mon, Nov 23, 2009 at 11:44:45AM -0500, J. Bruce Fields wrote: > > Got it, thanks. Is there an existing easy-to-setup workload I could > start with, or would it be sufficient to try the simplest possible code > that met the above description? (E.g., fork a process for each cpu, > each just overwriting byte 0 as fast as possible, and count total writes > performed per second?) We were actually talking about this on the ext4 call today. The problem is that there isn't a ready-made benchmark that will easily measure this. A database benchmark will show up (and we may have some results from the DB2 folks indicating the cost of upgrading the timestamps with a nanosecond granuality), but these of course aren't easy to run. The simplest possible workload that you have proposed is the worst case, and I have no doubt that will show the contention on inode->i_lock from inode_inc_version(), and I bet we'll see a big improvement when we change inode->i_version to be an atomic64 type. It will probably also show the overhead of ext4_mark_inode_dirty() being called all the time. Perhaps a slightly fairer and more realistic benchmark would do a write to byte zero followed by an fsync(), and measures both the CPU time per write as well as the writes per second. Either will do the job, though, and I'd recommend grabbing oprofile and lockstat measurement to see what bottlenecks we are hitting with that the workload. > If the side we want to optimize is the modifications, I wonder if we > could do all the i_version increments on *read* of i_version?: > > - writes (and other inode modifications) set an "i_version_dirty" > flag. > - reads of i_version clear the i_version_dirty flag, increment > i_version, and return the result. > > As long as the reader sees i_version_flag set only after it sees the > write that caused it, I think it all works? I can see two potential problems with that. One is that this implies that the read needs to kick off a journal operation, which means that the act of reading i_version might cause the caller to sleep (not all the time, but in some cases, such as if we get unlucky and need to do a journal commit or checkpoint before we can update i_version). I don't know if the NFSv4 server code would be happy with that! The second problem is what happens if we crash before a read happens. On the ext4 call, Andreas suggested trying to do this work at commit time. This would either mean that i_version would only get updated at the commit interval (default: 5 seconds), or that i_version might be updated more frequently than that, but we would defer as much as possible to the commit time, since it's already the case that if we crash before the commit happens, i_version could end up going backwards (since we may have returned i_version numbers that were never committed). I'm not entirely convinced how much this will actually help, since we have to reserve space in the transaction for the inode update, even if we don't do the copy to the journaled bufferheads on every sys_write(), since we will end up having to take various journal locks on every sys_write() anyway. We'll have to code it up to see whether or not it helps, or how painful it is to actually implement. What I'm hoping we'll find is that for a typical desktop workload i_version updates don't really hurt, and we can enable it by default for desktop workloads. My concern is that really with the database workloads, i_version updates may be especially hurtful especially for certain high dollar value (but rare) benchmarks, such as TPC-C/TPC-H. - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2009-11-25 20:48 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-11-22 22:20 i_version, NFSv4 change attribute J. Bruce Fields 2009-11-23 11:48 ` tytso 2009-11-23 16:44 ` J. Bruce Fields 2009-11-23 16:59 ` J. Bruce Fields 2009-11-23 18:11 ` Trond Myklebust 2009-11-23 18:19 ` J. Bruce Fields 2009-11-23 18:37 ` Trond Myklebust 2009-11-23 18:51 ` tytso 2009-11-25 20:48 ` J. Bruce Fields 2009-11-23 18:35 ` tytso
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.