* [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-23 1:37 ` Waiman Long 0 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw) To: , Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris Cc: Waiman Long, linux-cifs, linux-nfs, Norton, Scott J, autofs, samba-technical, linux-kernel, Chandramouleeswaran, Aswin, Andi Kleen, Dave Chinner, linux-fsdevel, ceph-devel Change log: v2->v3 - Fix the RCU lock problem found by Al Viro. - Rebase the code to the latest v3.10-rc1 linux mainline. - Remove patch 4 which may be problematic if the dentry is deleted. - Rerun performance measurement using 3.10-rc1 kernel. v1->v2 - Include performance improvement in the AIM7 benchmark results because of this patch. - Modify dget_parent() to avoid taking the lock, if possible, to further improve AIM7 benchmark results. During some perf-record sessions of the kernel running the high_systime workload of the AIM7 benchmark, it was found that quite a large portion of the spinlock contention was due to the perf_event_mmap_event() function itself. This perf kernel function calls d_path() which, in turn, call path_get() and dput() indirectly. These 3 functions were the hottest functions shown in the perf-report output of the _raw_spin_lock() function in an 8-socket system with 80 cores (hyperthreading off) with a 3.10-rc1 kernel: - 13.91% reaim [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock + 35.54% path_get + 34.85% dput + 19.49% d_path In fact, the output of the "perf record -s -a" (without call-graph) showed: 13.37% reaim [kernel.kallsyms] [k] _raw_spin_lock 7.61% ls [kernel.kallsyms] [k] _raw_spin_lock 3.54% true [kernel.kallsyms] [k] _raw_spin_lock Without using the perf monitoring tool, the actual execution profile will be quite different. In fact, with this patch set applied, the output of the same "perf record -s -a" command became: 2.82% reaim [kernel.kallsyms] [k] _raw_spin_lock 1.11% ls [kernel.kallsyms] [k] _raw_spin_lock 0.26% true [kernel.kallsyms] [k] _raw_spin_lock So the time spent on _raw_spin_lock() function went down from 24.52% to 4.19%. It can be seen that the performance data collected by the perf-record command can be heavily skewed in some cases on a system with a large number of CPUs. This set of patches enables the perf command to give a more accurate and reliable picture of what is really happening in the system. At the same time, they can also improve the general performance of systems especially those with a large number of CPUs. The d_path() function takes the following two locks: 1. dentry->d_lock [spinlock] from dget()/dget_parent()/dput() 2. rename_lock [seqlock] from d_path() This set of patches were designed to minimize the locking overhead of these code paths. The current kernel takes the dentry->d_lock lock whenever it wants to increment or decrement the d_count reference count. However, nothing big will really happen until the reference count goes all the way to 1 or 0. Actually, we don't need to take the lock when reference count is bigger than 1. Instead, atomic cmpxchg() function can be used to increment or decrement the count in these situations. For safety, other reference count update operations have to be changed to use atomic instruction as well. The rename_lock is a sequence lock. The d_path() function takes the writer lock because it needs to traverse different dentries through pointers to get the full path name. Hence it can't tolerate changes in those pointers. But taking the writer lock also prevent multiple d_path() calls to proceed concurrently. A solution is to introduce a new lock type where there will be a second type of reader which can block the writers - the sequence read/write lock (seqrwlock). The d_path() and related functions will then be changed to take the reader lock instead of the writer lock. This will allow multiple d_path() operations to proceed concurrently. Additional performance testing was conducted using the AIM7 benchmark. It is mainly the first patch that has impact on the AIM7 benchmark. Please see the patch description of the first patch on more information about the benchmark results. Incidentally, these patches also have a favorable impact on Oracle database performance when measured by the Oracle SLOB benchmark. The following tests with multiple threads were also run on kernels with and without the patch on an 8-socket 80-core system and a PC with 4-core i5 processor: 1. find $HOME -size 0b 2. cat /proc/*/maps /proc/*/numa_maps 3. git diff For both the find-size and cat-maps tests, the performance difference with hot cache was within a few percentage points and hence within the margin of error. Single-thread performance was slightly worse, but multithread performance was generally a bit better. Apparently, reference count update isn't a significant factor in those tests. Their perf traces indicates that there was less spinlock content in functions like dput(), but the function itself ran a little bit longer on average. The git-diff test showed no difference in performance. There is a slight increase in system time compensated by a slight decrease in user time. Of the 3 patches, patch 3 is dependent on patch 2. The first patch is independent can be applied individually. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Waiman Long (3): dcache: Don't take unnecessary lock in d_count update dcache: introduce a new sequence read/write lock type dcache: change rename_lock to a sequence read/write lock fs/autofs4/waitq.c | 6 +- fs/ceph/mds_client.c | 4 +- fs/cifs/dir.c | 4 +- fs/dcache.c | 120 ++++++++++++++++++++-------------------- fs/namei.c | 2 +- fs/nfs/namespace.c | 6 +- include/linux/dcache.h | 105 +++++++++++++++++++++++++++++++++-- include/linux/seqrwlock.h | 137 +++++++++++++++++++++++++++++++++++++++++++++ kernel/auditsc.c | 4 +- 9 files changed, 310 insertions(+), 78 deletions(-) create mode 100644 include/linux/seqrwlock.h ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-23 1:37 ` Waiman Long 0 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw) To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris Cc: Waiman Long, linux-cifs, linux-nfs, Norton, Scott J, autofs, samba-technical, linux-kernel, Chandramouleeswaran, Aswin, Andi Kleen, Dave Chinner, linux-fsdevel, ceph-devel Change log: v2->v3 - Fix the RCU lock problem found by Al Viro. - Rebase the code to the latest v3.10-rc1 linux mainline. - Remove patch 4 which may be problematic if the dentry is deleted. - Rerun performance measurement using 3.10-rc1 kernel. v1->v2 - Include performance improvement in the AIM7 benchmark results because of this patch. - Modify dget_parent() to avoid taking the lock, if possible, to further improve AIM7 benchmark results. During some perf-record sessions of the kernel running the high_systime workload of the AIM7 benchmark, it was found that quite a large portion of the spinlock contention was due to the perf_event_mmap_event() function itself. This perf kernel function calls d_path() which, in turn, call path_get() and dput() indirectly. These 3 functions were the hottest functions shown in the perf-report output of the _raw_spin_lock() function in an 8-socket system with 80 cores (hyperthreading off) with a 3.10-rc1 kernel: - 13.91% reaim [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock + 35.54% path_get + 34.85% dput + 19.49% d_path In fact, the output of the "perf record -s -a" (without call-graph) showed: 13.37% reaim [kernel.kallsyms] [k] _raw_spin_lock 7.61% ls [kernel.kallsyms] [k] _raw_spin_lock 3.54% true [kernel.kallsyms] [k] _raw_spin_lock Without using the perf monitoring tool, the actual execution profile will be quite different. In fact, with this patch set applied, the output of the same "perf record -s -a" command became: 2.82% reaim [kernel.kallsyms] [k] _raw_spin_lock 1.11% ls [kernel.kallsyms] [k] _raw_spin_lock 0.26% true [kernel.kallsyms] [k] _raw_spin_lock So the time spent on _raw_spin_lock() function went down from 24.52% to 4.19%. It can be seen that the performance data collected by the perf-record command can be heavily skewed in some cases on a system with a large number of CPUs. This set of patches enables the perf command to give a more accurate and reliable picture of what is really happening in the system. At the same time, they can also improve the general performance of systems especially those with a large number of CPUs. The d_path() function takes the following two locks: 1. dentry->d_lock [spinlock] from dget()/dget_parent()/dput() 2. rename_lock [seqlock] from d_path() This set of patches were designed to minimize the locking overhead of these code paths. The current kernel takes the dentry->d_lock lock whenever it wants to increment or decrement the d_count reference count. However, nothing big will really happen until the reference count goes all the way to 1 or 0. Actually, we don't need to take the lock when reference count is bigger than 1. Instead, atomic cmpxchg() function can be used to increment or decrement the count in these situations. For safety, other reference count update operations have to be changed to use atomic instruction as well. The rename_lock is a sequence lock. The d_path() function takes the writer lock because it needs to traverse different dentries through pointers to get the full path name. Hence it can't tolerate changes in those pointers. But taking the writer lock also prevent multiple d_path() calls to proceed concurrently. A solution is to introduce a new lock type where there will be a second type of reader which can block the writers - the sequence read/write lock (seqrwlock). The d_path() and related functions will then be changed to take the reader lock instead of the writer lock. This will allow multiple d_path() operations to proceed concurrently. Additional performance testing was conducted using the AIM7 benchmark. It is mainly the first patch that has impact on the AIM7 benchmark. Please see the patch description of the first patch on more information about the benchmark results. Incidentally, these patches also have a favorable impact on Oracle database performance when measured by the Oracle SLOB benchmark. The following tests with multiple threads were also run on kernels with and without the patch on an 8-socket 80-core system and a PC with 4-core i5 processor: 1. find $HOME -size 0b 2. cat /proc/*/maps /proc/*/numa_maps 3. git diff For both the find-size and cat-maps tests, the performance difference with hot cache was within a few percentage points and hence within the margin of error. Single-thread performance was slightly worse, but multithread performance was generally a bit better. Apparently, reference count update isn't a significant factor in those tests. Their perf traces indicates that there was less spinlock content in functions like dput(), but the function itself ran a little bit longer on average. The git-diff test showed no difference in performance. There is a slight increase in system time compensated by a slight decrease in user time. Of the 3 patches, patch 3 is dependent on patch 2. The first patch is independent can be applied individually. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Waiman Long (3): dcache: Don't take unnecessary lock in d_count update dcache: introduce a new sequence read/write lock type dcache: change rename_lock to a sequence read/write lock fs/autofs4/waitq.c | 6 +- fs/ceph/mds_client.c | 4 +- fs/cifs/dir.c | 4 +- fs/dcache.c | 120 ++++++++++++++++++++-------------------- fs/namei.c | 2 +- fs/nfs/namespace.c | 6 +- include/linux/dcache.h | 105 +++++++++++++++++++++++++++++++++-- include/linux/seqrwlock.h | 137 +++++++++++++++++++++++++++++++++++++++++++++ kernel/auditsc.c | 4 +- 9 files changed, 310 insertions(+), 78 deletions(-) create mode 100644 include/linux/seqrwlock.h ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-23 1:37 Waiman Long 2013-05-23 9:42 ` Dave Chinner 0 siblings, 1 reply; 33+ messages in thread From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw) To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris Cc: Waiman Long, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, samba-technical, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen, Dave Chinner Change log: v2->v3 - Fix the RCU lock problem found by Al Viro. - Rebase the code to the latest v3.10-rc1 linux mainline. - Remove patch 4 which may be problematic if the dentry is deleted. - Rerun performance measurement using 3.10-rc1 kernel. v1->v2 - Include performance improvement in the AIM7 benchmark results because of this patch. - Modify dget_parent() to avoid taking the lock, if possible, to further improve AIM7 benchmark results. During some perf-record sessions of the kernel running the high_systime workload of the AIM7 benchmark, it was found that quite a large portion of the spinlock contention was due to the perf_event_mmap_event() function itself. This perf kernel function calls d_path() which, in turn, call path_get() and dput() indirectly. These 3 functions were the hottest functions shown in the perf-report output of the _raw_spin_lock() function in an 8-socket system with 80 cores (hyperthreading off) with a 3.10-rc1 kernel: - 13.91% reaim [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock + 35.54% path_get + 34.85% dput + 19.49% d_path In fact, the output of the "perf record -s -a" (without call-graph) showed: 13.37% reaim [kernel.kallsyms] [k] _raw_spin_lock 7.61% ls [kernel.kallsyms] [k] _raw_spin_lock 3.54% true [kernel.kallsyms] [k] _raw_spin_lock Without using the perf monitoring tool, the actual execution profile will be quite different. In fact, with this patch set applied, the output of the same "perf record -s -a" command became: 2.82% reaim [kernel.kallsyms] [k] _raw_spin_lock 1.11% ls [kernel.kallsyms] [k] _raw_spin_lock 0.26% true [kernel.kallsyms] [k] _raw_spin_lock So the time spent on _raw_spin_lock() function went down from 24.52% to 4.19%. It can be seen that the performance data collected by the perf-record command can be heavily skewed in some cases on a system with a large number of CPUs. This set of patches enables the perf command to give a more accurate and reliable picture of what is really happening in the system. At the same time, they can also improve the general performance of systems especially those with a large number of CPUs. The d_path() function takes the following two locks: 1. dentry->d_lock [spinlock] from dget()/dget_parent()/dput() 2. rename_lock [seqlock] from d_path() This set of patches were designed to minimize the locking overhead of these code paths. The current kernel takes the dentry->d_lock lock whenever it wants to increment or decrement the d_count reference count. However, nothing big will really happen until the reference count goes all the way to 1 or 0. Actually, we don't need to take the lock when reference count is bigger than 1. Instead, atomic cmpxchg() function can be used to increment or decrement the count in these situations. For safety, other reference count update operations have to be changed to use atomic instruction as well. The rename_lock is a sequence lock. The d_path() function takes the writer lock because it needs to traverse different dentries through pointers to get the full path name. Hence it can't tolerate changes in those pointers. But taking the writer lock also prevent multiple d_path() calls to proceed concurrently. A solution is to introduce a new lock type where there will be a second type of reader which can block the writers - the sequence read/write lock (seqrwlock). The d_path() and related functions will then be changed to take the reader lock instead of the writer lock. This will allow multiple d_path() operations to proceed concurrently. Additional performance testing was conducted using the AIM7 benchmark. It is mainly the first patch that has impact on the AIM7 benchmark. Please see the patch description of the first patch on more information about the benchmark results. Incidentally, these patches also have a favorable impact on Oracle database performance when measured by the Oracle SLOB benchmark. The following tests with multiple threads were also run on kernels with and without the patch on an 8-socket 80-core system and a PC with 4-core i5 processor: 1. find $HOME -size 0b 2. cat /proc/*/maps /proc/*/numa_maps 3. git diff For both the find-size and cat-maps tests, the performance difference with hot cache was within a few percentage points and hence within the margin of error. Single-thread performance was slightly worse, but multithread performance was generally a bit better. Apparently, reference count update isn't a significant factor in those tests. Their perf traces indicates that there was less spinlock content in functions like dput(), but the function itself ran a little bit longer on average. The git-diff test showed no difference in performance. There is a slight increase in system time compensated by a slight decrease in user time. Of the 3 patches, patch 3 is dependent on patch 2. The first patch is independent can be applied individually. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Waiman Long (3): dcache: Don't take unnecessary lock in d_count update dcache: introduce a new sequence read/write lock type dcache: change rename_lock to a sequence read/write lock fs/autofs4/waitq.c | 6 +- fs/ceph/mds_client.c | 4 +- fs/cifs/dir.c | 4 +- fs/dcache.c | 120 ++++++++++++++++++++-------------------- fs/namei.c | 2 +- fs/nfs/namespace.c | 6 +- include/linux/dcache.h | 105 +++++++++++++++++++++++++++++++++-- include/linux/seqrwlock.h | 137 +++++++++++++++++++++++++++++++++++++++++++++ kernel/auditsc.c | 4 +- 9 files changed, 310 insertions(+), 78 deletions(-) create mode 100644 include/linux/seqrwlock.h ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-23 1:37 Waiman Long @ 2013-05-23 9:42 ` Dave Chinner 2013-05-23 21:34 ` Waiman Long 0 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2013-05-23 9:42 UTC (permalink / raw) To: Waiman Long Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, samba-technical, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote: > Change log: > > v2->v3 > - Fix the RCU lock problem found by Al Viro. > - Rebase the code to the latest v3.10-rc1 linux mainline. > - Remove patch 4 which may be problematic if the dentry is deleted. > - Rerun performance measurement using 3.10-rc1 kernel. > > v1->v2 > - Include performance improvement in the AIM7 benchmark results because > of this patch. > - Modify dget_parent() to avoid taking the lock, if possible, to further > improve AIM7 benchmark results. > > During some perf-record sessions of the kernel running the high_systime > workload of the AIM7 benchmark, it was found that quite a large portion > of the spinlock contention was due to the perf_event_mmap_event() > function itself. This perf kernel function calls d_path() which, > in turn, call path_get() and dput() indirectly. These 3 functions > were the hottest functions shown in the perf-report output of > the _raw_spin_lock() function in an 8-socket system with 80 cores > (hyperthreading off) with a 3.10-rc1 kernel: <sigh> What was it I said about this patchset when you posted it to speed up an Oracle benchmark back in february? I'll repeat: "Nobody should be doing reverse dentry-to-name lookups in a quantity sufficient for it to become a performance limiting factor." And that makes whatever that tracepoint is doing utterly stupid. Dumping out full paths in a tracepoint is hardly "low overhead", and that tracepoint should be stomped on from a great height. Sure, print the filename straight out of the dentry into a tracepoint, but repeated calculating the full path (probably for the same set of dentries) is just a dumb thing to do. Anyway, your numbers show that a single AIM7 microbenchmark goes better under tracing the specific mmap event that uses d_path(), but the others are on average a little bit slower. That doesn't convince me that this is worth doing. Indeed, what happens to performance when you aren't tracing? Indeed, have you analysed what makes that microbenchmark contend so much on the dentry lock while reverse lookups are occuring? Dentry lock contention does not necessarily indicate a problem with the dentry locks, and without knowing why there is contention occuring (esp. compared to the other benchmarks) we can't really determine if this is a good solution or not... IOWs, you need more than one microbenchmark that interacts with some naive monitoring code to justify the complexity these locking changes introduce.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-23 9:42 ` Dave Chinner @ 2013-05-23 21:34 ` Waiman Long 0 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-23 21:34 UTC (permalink / raw) To: Dave Chinner Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, autofs-u79uwXL29TY76Z2rM5mHXA, ceph-devel-u79uwXL29TY76Z2rM5mHXA, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On 05/23/2013 05:42 AM, Dave Chinner wrote: > On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote: >> Change log: >> >> v2->v3 >> - Fix the RCU lock problem found by Al Viro. >> - Rebase the code to the latest v3.10-rc1 linux mainline. >> - Remove patch 4 which may be problematic if the dentry is deleted. >> - Rerun performance measurement using 3.10-rc1 kernel. >> >> v1->v2 >> - Include performance improvement in the AIM7 benchmark results because >> of this patch. >> - Modify dget_parent() to avoid taking the lock, if possible, to further >> improve AIM7 benchmark results. >> >> During some perf-record sessions of the kernel running the high_systime >> workload of the AIM7 benchmark, it was found that quite a large portion >> of the spinlock contention was due to the perf_event_mmap_event() >> function itself. This perf kernel function calls d_path() which, >> in turn, call path_get() and dput() indirectly. These 3 functions >> were the hottest functions shown in the perf-report output of >> the _raw_spin_lock() function in an 8-socket system with 80 cores >> (hyperthreading off) with a 3.10-rc1 kernel: > What was it I said about this patchset when you posted it to speed > up an Oracle benchmark back in february? I'll repeat: > > "Nobody should be doing reverse dentry-to-name lookups in a quantity > sufficient for it to become a performance limiting factor." Thank for the comment, but my point is that it is the d_lock contention is skewing the data about how much spin lock contention had actually happened in the workload and it makes it harder to pinpoint problem areas to look at. This is not about performance, it is about accurate representation of performance data. Ideally, we want the overhead of turning on perf instrumentation to be as low as possible. > > And that makes whatever that tracepoint is doing utterly stupid. > Dumping out full paths in a tracepoint is hardly "low overhead", and > that tracepoint should be stomped on from a great height. Sure, > print the filename straight out of the dentry into a tracepoint, > but repeated calculating the full path (probably for the same set of > dentries) is just a dumb thing to do. > > Anyway, your numbers show that a single AIM7 microbenchmark goes > better under tracing the specific mmap event that uses d_path(), but > the others are on average a little bit slower. That doesn't convince > me that this is worth doing. Indeed, what happens to performance > when you aren't tracing? > > Indeed, have you analysed what makes that > microbenchmark contend so much on the dentry lock while reverse > lookups are occuring? Dentry lock contention does not necessarily > indicate a problem with the dentry locks, and without knowing why > there is contention occuring (esp. compared to the other benchmarks) > we can't really determine if this is a good solution or not... What made it contend so much was the large number of CPUs available in my test system which is a 8-socket Westmere EX machines with 80 cores. As perf was collecting data from every core, the threads will unavoidably bump into each other to translate dentries back to the full paths. The current code only allows one CPU in the d_path() critical path. My patch will allow all of them to be in the critical path concurrently. The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So even a 4-socket machine will have up to 60 cores or 120 virtual CPUs if hyperthreading is turned on. > IOWs, you need more than one microbenchmark that interacts with > some naive monitoring code to justify the complexity these locking > changes introduce.... The first patch can also independently improve the performance of a number of AIM7 workloads including almost 7X improvement in the short workload. More detailed information of these types of performance benefit was discussed in the patch description of the first patch. I will try to collect more performance improvement data on other workloads too. Thank for the review. Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-23 21:34 ` Waiman Long 0 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-23 21:34 UTC (permalink / raw) To: Dave Chinner Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On 05/23/2013 05:42 AM, Dave Chinner wrote: > On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote: >> Change log: >> >> v2->v3 >> - Fix the RCU lock problem found by Al Viro. >> - Rebase the code to the latest v3.10-rc1 linux mainline. >> - Remove patch 4 which may be problematic if the dentry is deleted. >> - Rerun performance measurement using 3.10-rc1 kernel. >> >> v1->v2 >> - Include performance improvement in the AIM7 benchmark results because >> of this patch. >> - Modify dget_parent() to avoid taking the lock, if possible, to further >> improve AIM7 benchmark results. >> >> During some perf-record sessions of the kernel running the high_systime >> workload of the AIM7 benchmark, it was found that quite a large portion >> of the spinlock contention was due to the perf_event_mmap_event() >> function itself. This perf kernel function calls d_path() which, >> in turn, call path_get() and dput() indirectly. These 3 functions >> were the hottest functions shown in the perf-report output of >> the _raw_spin_lock() function in an 8-socket system with 80 cores >> (hyperthreading off) with a 3.10-rc1 kernel: > What was it I said about this patchset when you posted it to speed > up an Oracle benchmark back in february? I'll repeat: > > "Nobody should be doing reverse dentry-to-name lookups in a quantity > sufficient for it to become a performance limiting factor." Thank for the comment, but my point is that it is the d_lock contention is skewing the data about how much spin lock contention had actually happened in the workload and it makes it harder to pinpoint problem areas to look at. This is not about performance, it is about accurate representation of performance data. Ideally, we want the overhead of turning on perf instrumentation to be as low as possible. > > And that makes whatever that tracepoint is doing utterly stupid. > Dumping out full paths in a tracepoint is hardly "low overhead", and > that tracepoint should be stomped on from a great height. Sure, > print the filename straight out of the dentry into a tracepoint, > but repeated calculating the full path (probably for the same set of > dentries) is just a dumb thing to do. > > Anyway, your numbers show that a single AIM7 microbenchmark goes > better under tracing the specific mmap event that uses d_path(), but > the others are on average a little bit slower. That doesn't convince > me that this is worth doing. Indeed, what happens to performance > when you aren't tracing? > > Indeed, have you analysed what makes that > microbenchmark contend so much on the dentry lock while reverse > lookups are occuring? Dentry lock contention does not necessarily > indicate a problem with the dentry locks, and without knowing why > there is contention occuring (esp. compared to the other benchmarks) > we can't really determine if this is a good solution or not... What made it contend so much was the large number of CPUs available in my test system which is a 8-socket Westmere EX machines with 80 cores. As perf was collecting data from every core, the threads will unavoidably bump into each other to translate dentries back to the full paths. The current code only allows one CPU in the d_path() critical path. My patch will allow all of them to be in the critical path concurrently. The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So even a 4-socket machine will have up to 60 cores or 120 virtual CPUs if hyperthreading is turned on. > IOWs, you need more than one microbenchmark that interacts with > some naive monitoring code to justify the complexity these locking > changes introduce.... The first patch can also independently improve the performance of a number of AIM7 workloads including almost 7X improvement in the short workload. More detailed information of these types of performance benefit was discussed in the patch description of the first patch. I will try to collect more performance improvement data on other workloads too. Thank for the review. Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-23 21:34 ` Waiman Long (?) @ 2013-05-27 2:09 ` Dave Chinner 2013-05-29 15:55 ` Waiman Long -1 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2013-05-27 2:09 UTC (permalink / raw) To: Waiman Long Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote: > On 05/23/2013 05:42 AM, Dave Chinner wrote: > >On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote: > >>Change log: > >> > >>v2->v3 > >> - Fix the RCU lock problem found by Al Viro. > >> - Rebase the code to the latest v3.10-rc1 linux mainline. > >> - Remove patch 4 which may be problematic if the dentry is deleted. > >> - Rerun performance measurement using 3.10-rc1 kernel. > >> > >>v1->v2 > >> - Include performance improvement in the AIM7 benchmark results because > >> of this patch. > >> - Modify dget_parent() to avoid taking the lock, if possible, to further > >> improve AIM7 benchmark results. > >> > >>During some perf-record sessions of the kernel running the high_systime > >>workload of the AIM7 benchmark, it was found that quite a large portion > >>of the spinlock contention was due to the perf_event_mmap_event() > >>function itself. This perf kernel function calls d_path() which, > >>in turn, call path_get() and dput() indirectly. These 3 functions > >>were the hottest functions shown in the perf-report output of > >>the _raw_spin_lock() function in an 8-socket system with 80 cores > >>(hyperthreading off) with a 3.10-rc1 kernel: > >What was it I said about this patchset when you posted it to speed > >up an Oracle benchmark back in february? I'll repeat: > > > >"Nobody should be doing reverse dentry-to-name lookups in a quantity > >sufficient for it to become a performance limiting factor." > > Thank for the comment, but my point is that it is the d_lock > contention is skewing the data about how much spin lock contention > had actually happened in the workload and it makes it harder to > pinpoint problem areas to look at. This is not about performance, it > is about accurate representation of performance data. Ideally, we > want the overhead of turning on perf instrumentation to be as low as > possible. Right. But d_path will never be "low overhead", and as such it shouldn't be used by perf. > >And that makes whatever that tracepoint is doing utterly stupid. > >Dumping out full paths in a tracepoint is hardly "low overhead", and > >that tracepoint should be stomped on from a great height. Sure, > >print the filename straight out of the dentry into a tracepoint, > >but repeated calculating the full path (probably for the same set of > >dentries) is just a dumb thing to do. > > > >Anyway, your numbers show that a single AIM7 microbenchmark goes > >better under tracing the specific mmap event that uses d_path(), but > >the others are on average a little bit slower. That doesn't convince > >me that this is worth doing. Indeed, what happens to performance > >when you aren't tracing? > > > >Indeed, have you analysed what makes that > >microbenchmark contend so much on the dentry lock while reverse > >lookups are occuring? Dentry lock contention does not necessarily > >indicate a problem with the dentry locks, and without knowing why > >there is contention occuring (esp. compared to the other benchmarks) > >we can't really determine if this is a good solution or not... > > What made it contend so much was the large number of CPUs available > in my test system which is a 8-socket Westmere EX machines with 80 > cores. As perf was collecting data from every core, the threads will > unavoidably bump into each other to translate dentries back to the > full paths. The current code only allows one CPU in the d_path() > critical path. My patch will allow all of them to be in the critical > path concurrently. Yes, I know *how* contention occurs and what your patch does. I'm asking you to explain *why* we need to fix this specific workload, and why the tracepoint that calls d_path() actually needs to do that. Your numbers indicate that your patchset decreases peformance in the common, non-d_path intensive workloads, so why should we add all this complexity to optimise a slow path at the expense of significant complexity and lower performance in the fast path? > The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So > even a 4-socket machine will have up to 60 cores or 120 virtual CPUs > if hyperthreading is turned on. I don't see why that matters. I've been dealing with systems much larger than that since early 2002, adn the process for delaing with lock contention hasn't changed much in the last 10 years. First we need to determine what you are doing is something that is sane, determining whether there's a simpler fix than changing locking, and whether it's actually any faster in the common case we care about.... > >IOWs, you need more than one microbenchmark that interacts with > >some naive monitoring code to justify the complexity these locking > >changes introduce.... > The first patch can also independently improve the performance of a > number of AIM7 workloads including almost 7X improvement in the > short workload. More detailed information of these types of > performance benefit was discussed in the patch description of the > first patch. I will try to collect more performance improvement data > on other workloads too. > > Thank for the review. I haven't reviewed anything. All I've made comments about you needing to justifying the complexity of the change. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-27 2:09 ` Dave Chinner @ 2013-05-29 15:55 ` Waiman Long [not found] ` <51A624E2.3000301-VXdhtT5mjnY@public.gmane.org> 0 siblings, 1 reply; 33+ messages in thread From: Waiman Long @ 2013-05-29 15:55 UTC (permalink / raw) To: Dave Chinner Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On 05/26/2013 10:09 PM, Dave Chinner wrote: > On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote: >> On 05/23/2013 05:42 AM, Dave Chinner wrote: >>> >>> What was it I said about this patchset when you posted it to speed >>> up an Oracle benchmark back in february? I'll repeat: >>> >>> "Nobody should be doing reverse dentry-to-name lookups in a quantity >>> sufficient for it to become a performance limiting factor." >> Thank for the comment, but my point is that it is the d_lock >> contention is skewing the data about how much spin lock contention >> had actually happened in the workload and it makes it harder to >> pinpoint problem areas to look at. This is not about performance, it >> is about accurate representation of performance data. Ideally, we >> want the overhead of turning on perf instrumentation to be as low as >> possible. > Right. But d_path will never be "low overhead", and as such it > shouldn't be used by perf. The d_path() is called by perf_event_mmap_event() which translates VMA to its file path for memory segments backed by files. As perf is not just for sampling data within the kernel, it can also be used for checking access pattern in the user space. As a result, it needs to map VMAs back to the backing files to access their symbols information. If d_path() is not the right function to call for this purpose, what other alternatives do we have? There may be ways to reduce calls to d_path() by doing some kind of caching, but that will certainly increase the complexity of the perf code. >>> And that makes whatever that tracepoint is doing utterly stupid. >>> Dumping out full paths in a tracepoint is hardly "low overhead", and >>> that tracepoint should be stomped on from a great height. Sure, >>> print the filename straight out of the dentry into a tracepoint, >>> but repeated calculating the full path (probably for the same set of >>> dentries) is just a dumb thing to do. >>> >>> Anyway, your numbers show that a single AIM7 microbenchmark goes >>> better under tracing the specific mmap event that uses d_path(), but >>> the others are on average a little bit slower. That doesn't convince >>> me that this is worth doing. Indeed, what happens to performance >>> when you aren't tracing? >>> >>> Indeed, have you analysed what makes that >>> microbenchmark contend so much on the dentry lock while reverse >>> lookups are occuring? Dentry lock contention does not necessarily >>> indicate a problem with the dentry locks, and without knowing why >>> there is contention occuring (esp. compared to the other benchmarks) >>> we can't really determine if this is a good solution or not... >> What made it contend so much was the large number of CPUs available >> in my test system which is a 8-socket Westmere EX machines with 80 >> cores. As perf was collecting data from every core, the threads will >> unavoidably bump into each other to translate dentries back to the >> full paths. The current code only allows one CPU in the d_path() >> critical path. My patch will allow all of them to be in the critical >> path concurrently. > Yes, I know *how* contention occurs and what your patch does. I'm > asking you to explain *why* we need to fix this specific workload, > and why the tracepoint that calls d_path() actually needs to do > that. My patch set consists of 2 different changes. The first one is to avoid taking the d_lock lock when updating the reference count in the dentries. This particular change also benefit some other workloads that are filesystem intensive. One particular example is the short workload in the AIM7 benchmark. One of the job type in the short workload is "misc_rtns_1" which calls security functions like getpwnam(), getpwuid(), getgrgid() a couple of times. These functions open the /etc/passwd or /etc/group files, read their content and close the files. It is the intensive open/read/close sequence from multiple threads that is causing 80%+ contention in the d_lock on a system with large number of cores. The MIT's MOSBench paper also outlined dentry reference counting as a scalability roadblock for its Apache and Exim tests. So that change will certainly help workloads that are dcache reference counting intensive. The second rename_lock change is mainly for reducing the d_path lock contention for perf and some other applications that may need to access /proc system files like maps or numa_maps frequently. If you think the rename_lock change is not worth the effort to just benefit perf, I would still like to see the first one go in as it can certainly benefit other workload. > Your numbers indicate that your patchset decreases peformance in the > common, non-d_path intensive workloads, so why should we add all > this complexity to optimise a slow path at the expense of > significant complexity and lower performance in the fast > path? My numbers didn't actually indicate a performance regression in other common workloads. They just indicate that there wasn't a significant changes in performance as + or - a few percentages here and there are within the margin of errors for the tests that I used. >> The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So >> even a 4-socket machine will have up to 60 cores or 120 virtual CPUs >> if hyperthreading is turned on. > I don't see why that matters. I've been dealing with systems much > larger than that since early 2002, adn the process for delaing with > lock contention hasn't changed much in the last 10 years. First we > need to determine what you are doing is something that is sane, > determining whether there's a simpler fix than changing locking, and > whether it's actually any faster in the common case we care > about.... I certainly agree with that. My testing indicated no change in performance for the common case, but significant speed-up in some scenarios with large number of CPUs. Regards, Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <51A624E2.3000301-VXdhtT5mjnY@public.gmane.org>]
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 15:55 ` Waiman Long @ 2013-05-29 16:13 ` Andi Kleen 0 siblings, 0 replies; 33+ messages in thread From: Andi Kleen @ 2013-05-29 16:13 UTC (permalink / raw) To: Waiman Long Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, autofs-u79uwXL29TY76Z2rM5mHXA, ceph-devel-u79uwXL29TY76Z2rM5mHXA, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen > The d_path() is called by perf_event_mmap_event() which translates > VMA to its file path for memory segments backed by files. As perf is > not just for sampling data within the kernel, it can also be used > for checking access pattern in the user space. As a result, it needs > to map VMAs back to the backing files to access their symbols > information. If d_path() is not the right function to call for this > purpose, what other alternatives do we have? In principle it should be only called for new file mappings getting maped. Do you really have that many new file mappings all the time? Or is this related to program startup? > > My patch set consists of 2 different changes. The first one is to > avoid taking the d_lock lock when updating the reference count in > the dentries. This particular change also benefit some other > workloads that are filesystem intensive. One particular example is > the short workload in the AIM7 benchmark. One of the job type in the > short workload is "misc_rtns_1" which calls security functions like > getpwnam(), getpwuid(), getgrgid() a couple of times. These > functions open the /etc/passwd or /etc/group files, read their > content and close the files. It is the intensive open/read/close > sequence from multiple threads that is causing 80%+ contention in > the d_lock on a system with large number of cores. The MIT's > MOSBench paper also outlined dentry reference counting as a The paper was before Nick Piggin's RCU (and our) work on this. Modern kernels do not have dcache problems with mosbench, unless you run weird security modules like SMACK that effectively disable dcache RCU. BTW lock elision may fix these problems anyways, in a much simpler way. -Andi ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org -- Speaking for myself only. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-29 16:13 ` Andi Kleen 0 siblings, 0 replies; 33+ messages in thread From: Andi Kleen @ 2013-05-29 16:13 UTC (permalink / raw) To: Waiman Long Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen > The d_path() is called by perf_event_mmap_event() which translates > VMA to its file path for memory segments backed by files. As perf is > not just for sampling data within the kernel, it can also be used > for checking access pattern in the user space. As a result, it needs > to map VMAs back to the backing files to access their symbols > information. If d_path() is not the right function to call for this > purpose, what other alternatives do we have? In principle it should be only called for new file mappings getting maped. Do you really have that many new file mappings all the time? Or is this related to program startup? > > My patch set consists of 2 different changes. The first one is to > avoid taking the d_lock lock when updating the reference count in > the dentries. This particular change also benefit some other > workloads that are filesystem intensive. One particular example is > the short workload in the AIM7 benchmark. One of the job type in the > short workload is "misc_rtns_1" which calls security functions like > getpwnam(), getpwuid(), getgrgid() a couple of times. These > functions open the /etc/passwd or /etc/group files, read their > content and close the files. It is the intensive open/read/close > sequence from multiple threads that is causing 80%+ contention in > the d_lock on a system with large number of cores. The MIT's > MOSBench paper also outlined dentry reference counting as a The paper was before Nick Piggin's RCU (and our) work on this. Modern kernels do not have dcache problems with mosbench, unless you run weird security modules like SMACK that effectively disable dcache RCU. BTW lock elision may fix these problems anyways, in a much simpler way. -Andi ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <20130529161358.GJ6123-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>]
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 16:13 ` Andi Kleen @ 2013-05-29 20:23 ` Waiman Long -1 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-29 20:23 UTC (permalink / raw) To: Andi Kleen Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, autofs-u79uwXL29TY76Z2rM5mHXA, ceph-devel-u79uwXL29TY76Z2rM5mHXA, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Chandramouleeswaran, Aswin, Norton, Scott J On 05/29/2013 12:13 PM, Andi Kleen wrote: >> The d_path() is called by perf_event_mmap_event() which translates >> VMA to its file path for memory segments backed by files. As perf is >> not just for sampling data within the kernel, it can also be used >> for checking access pattern in the user space. As a result, it needs >> to map VMAs back to the backing files to access their symbols >> information. If d_path() is not the right function to call for this >> purpose, what other alternatives do we have? > In principle it should be only called for new file mappings > getting maped. Do you really have that many new file mappings all > the time? Or is this related to program startup? The AIM7 benchmark that I used runs a large number of relatively short jobs. I think each time a new job is spawned, the file mappngs have to be redone again. It is probably not a big problem for long running processes. >> My patch set consists of 2 different changes. The first one is to >> avoid taking the d_lock lock when updating the reference count in >> the dentries. This particular change also benefit some other >> workloads that are filesystem intensive. One particular example is >> the short workload in the AIM7 benchmark. One of the job type in the >> short workload is "misc_rtns_1" which calls security functions like >> getpwnam(), getpwuid(), getgrgid() a couple of times. These >> functions open the /etc/passwd or /etc/group files, read their >> content and close the files. It is the intensive open/read/close >> sequence from multiple threads that is causing 80%+ contention in >> the d_lock on a system with large number of cores. The MIT's >> MOSBench paper also outlined dentry reference counting as a > The paper was before Nick Piggin's RCU (and our) work on this. > Modern kernels do not have dcache problems with mosbench, unless > you run weird security modules like SMACK that effectively > disable dcache RCU. I had tried, but not yet able to run the MOSBench myself. Thank for letting me know that the dcache problem wrt MOSBench was fixed. > BTW lock elision may fix these problems anyways, in a much > simpler way. I will certainly hope so. However, there will still be a lot of computers out there running pre-Haswell Intel chips. For them, locking is still a problem that need to be solved. Regards, Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-29 20:23 ` Waiman Long 0 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-29 20:23 UTC (permalink / raw) To: Andi Kleen Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On 05/29/2013 12:13 PM, Andi Kleen wrote: >> The d_path() is called by perf_event_mmap_event() which translates >> VMA to its file path for memory segments backed by files. As perf is >> not just for sampling data within the kernel, it can also be used >> for checking access pattern in the user space. As a result, it needs >> to map VMAs back to the backing files to access their symbols >> information. If d_path() is not the right function to call for this >> purpose, what other alternatives do we have? > In principle it should be only called for new file mappings > getting maped. Do you really have that many new file mappings all > the time? Or is this related to program startup? The AIM7 benchmark that I used runs a large number of relatively short jobs. I think each time a new job is spawned, the file mappngs have to be redone again. It is probably not a big problem for long running processes. >> My patch set consists of 2 different changes. The first one is to >> avoid taking the d_lock lock when updating the reference count in >> the dentries. This particular change also benefit some other >> workloads that are filesystem intensive. One particular example is >> the short workload in the AIM7 benchmark. One of the job type in the >> short workload is "misc_rtns_1" which calls security functions like >> getpwnam(), getpwuid(), getgrgid() a couple of times. These >> functions open the /etc/passwd or /etc/group files, read their >> content and close the files. It is the intensive open/read/close >> sequence from multiple threads that is causing 80%+ contention in >> the d_lock on a system with large number of cores. The MIT's >> MOSBench paper also outlined dentry reference counting as a > The paper was before Nick Piggin's RCU (and our) work on this. > Modern kernels do not have dcache problems with mosbench, unless > you run weird security modules like SMACK that effectively > disable dcache RCU. I had tried, but not yet able to run the MOSBench myself. Thank for letting me know that the dcache problem wrt MOSBench was fixed. > BTW lock elision may fix these problems anyways, in a much > simpler way. I will certainly hope so. However, there will still be a lot of computers out there running pre-Haswell Intel chips. For them, locking is still a problem that need to be solved. Regards, Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 15:55 ` Waiman Long @ 2013-05-29 16:18 ` Simo Sorce 0 siblings, 0 replies; 33+ messages in thread From: Simo Sorce @ 2013-05-29 16:18 UTC (permalink / raw) To: Waiman Long Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, autofs-u79uwXL29TY76Z2rM5mHXA, ceph-devel-u79uwXL29TY76Z2rM5mHXA, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On Wed, 2013-05-29 at 11:55 -0400, Waiman Long wrote: > My patch set consists of 2 different changes. The first one is to avoid > taking the d_lock lock when updating the reference count in the > dentries. This particular change also benefit some other workloads that > are filesystem intensive. One particular example is the short workload > in the AIM7 benchmark. One of the job type in the short workload is > "misc_rtns_1" which calls security functions like getpwnam(), > getpwuid(), getgrgid() a couple of times. These functions open the > /etc/passwd or /etc/group files, read their content and close the files. > It is the intensive open/read/close sequence from multiple threads that > is causing 80%+ contention in the d_lock on a system with large number > of cores. To be honest a workload base on /etc/passwd or /etc/group is completely artificial, in actual usage, if you really have such access you use nscd or sssd with their shared memory caches to completely remove most of the file access. I have no beef on the rest but repeated access to Nsswitch information is not something you need to optimize at the file system layer and should not be brought up as a point in favor. Simo. -- Simo Sorce * Red Hat, Inc * New York ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-29 16:18 ` Simo Sorce 0 siblings, 0 replies; 33+ messages in thread From: Simo Sorce @ 2013-05-29 16:18 UTC (permalink / raw) To: Waiman Long Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On Wed, 2013-05-29 at 11:55 -0400, Waiman Long wrote: > My patch set consists of 2 different changes. The first one is to avoid > taking the d_lock lock when updating the reference count in the > dentries. This particular change also benefit some other workloads that > are filesystem intensive. One particular example is the short workload > in the AIM7 benchmark. One of the job type in the short workload is > "misc_rtns_1" which calls security functions like getpwnam(), > getpwuid(), getgrgid() a couple of times. These functions open the > /etc/passwd or /etc/group files, read their content and close the files. > It is the intensive open/read/close sequence from multiple threads that > is causing 80%+ contention in the d_lock on a system with large number > of cores. To be honest a workload base on /etc/passwd or /etc/group is completely artificial, in actual usage, if you really have such access you use nscd or sssd with their shared memory caches to completely remove most of the file access. I have no beef on the rest but repeated access to Nsswitch information is not something you need to optimize at the file system layer and should not be brought up as a point in favor. Simo. -- Simo Sorce * Red Hat, Inc * New York ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 16:18 ` Simo Sorce (?) @ 2013-05-29 16:56 ` Andi Kleen 2013-05-29 17:03 ` Simo Sorce 2013-05-29 20:37 ` Waiman Long -1 siblings, 2 replies; 33+ messages in thread From: Andi Kleen @ 2013-05-29 16:56 UTC (permalink / raw) To: Simo Sorce Cc: Waiman Long, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On Wed, May 29, 2013 at 12:18:09PM -0400, Simo Sorce wrote: > On Wed, 2013-05-29 at 11:55 -0400, Waiman Long wrote: > > > My patch set consists of 2 different changes. The first one is to avoid > > taking the d_lock lock when updating the reference count in the > > dentries. This particular change also benefit some other workloads that > > are filesystem intensive. One particular example is the short workload > > in the AIM7 benchmark. One of the job type in the short workload is > > "misc_rtns_1" which calls security functions like getpwnam(), > > getpwuid(), getgrgid() a couple of times. These functions open the > > /etc/passwd or /etc/group files, read their content and close the files. > > It is the intensive open/read/close sequence from multiple threads that > > is causing 80%+ contention in the d_lock on a system with large number > > of cores. > > To be honest a workload base on /etc/passwd or /etc/group is completely > artificial, in actual usage, if you really have such access you use > nscd or sssd with their shared memory caches to completely remove most > of the file access. I don't fully agree at this point. A lot of things can be tuned away, but in practice we want things to perform well out of the box without needing all kinds of magic tuning that only Also this is just normal file access, nothing special about it. It simply has to scale. For all kinds of workloads. And it does, just d_path messes it up. > I have no beef on the rest but repeated access to Nsswitch information > is not something you need to optimize at the file system layer and > should not be brought up as a point in favor. This is about repeated access to arbitrary files. -Andi ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 16:56 ` Andi Kleen @ 2013-05-29 17:03 ` Simo Sorce 2013-05-29 20:37 ` Waiman Long 1 sibling, 0 replies; 33+ messages in thread From: Simo Sorce @ 2013-05-29 17:03 UTC (permalink / raw) To: Andi Kleen Cc: Waiman Long, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On Wed, 2013-05-29 at 18:56 +0200, Andi Kleen wrote: > On Wed, May 29, 2013 at 12:18:09PM -0400, Simo Sorce wrote: > > On Wed, 2013-05-29 at 11:55 -0400, Waiman Long wrote: > > > > > My patch set consists of 2 different changes. The first one is to avoid > > > taking the d_lock lock when updating the reference count in the > > > dentries. This particular change also benefit some other workloads that > > > are filesystem intensive. One particular example is the short workload > > > in the AIM7 benchmark. One of the job type in the short workload is > > > "misc_rtns_1" which calls security functions like getpwnam(), > > > getpwuid(), getgrgid() a couple of times. These functions open the > > > /etc/passwd or /etc/group files, read their content and close the files. > > > It is the intensive open/read/close sequence from multiple threads that > > > is causing 80%+ contention in the d_lock on a system with large number > > > of cores. > > > > To be honest a workload base on /etc/passwd or /etc/group is completely > > artificial, in actual usage, if you really have such access you use > > nscd or sssd with their shared memory caches to completely remove most > > of the file access. > > I don't fully agree at this point. A lot of things can be tuned away, > but in practice we want things to perform well out of the box without > needing all kinds of magic tuning that only Phrase seem cut mid-sentence ? > Also this is just normal file access, nothing special about it. > It simply has to scale. For all kinds of workloads. > > And it does, just d_path messes it up. Well there are reasonable workloads and artificial ones, I am just warning not to use 'that' specific test as a good indicator, if you have other reasonable workloads that show a similar flow feel free to bring it up. > > I have no beef on the rest but repeated access to Nsswitch information > > is not something you need to optimize at the file system layer and > > should not be brought up as a point in favor. > > This is about repeated access to arbitrary files. Ok. I do not want to start a discussion on this, I just pointed out the specific point was not really a good example hopefully there are others that justify the patchset. Simo. -- Simo Sorce * Red Hat, Inc * New York ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 16:56 ` Andi Kleen 2013-05-29 17:03 ` Simo Sorce @ 2013-05-29 20:37 ` Waiman Long 1 sibling, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-29 20:37 UTC (permalink / raw) To: Andi Kleen Cc: Simo Sorce, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On 05/29/2013 12:56 PM, Andi Kleen wrote: > On Wed, May 29, 2013 at 12:18:09PM -0400, Simo Sorce wrote: >> To be honest a workload base on /etc/passwd or /etc/group is completely >> artificial, in actual usage, if you really have such access you use >> nscd or sssd with their shared memory caches to completely remove most >> of the file access. > I don't fully agree at this point. A lot of things can be tuned away, > but in practice we want things to perform well out of the box without > needing all kinds of magic tuning that only > > Also this is just normal file access, nothing special about it. > It simply has to scale. For all kinds of workloads. > > And it does, just d_path messes it up. Just for clarification, the AIM7 workload is not affected by the current d_path() code, they are speed-limited by the lock contention in the dcache reference counting code. However, both the d_path() change and the dentry reference counting change are needed to to eliminate the overhead introduced by the use of the perf-record command. Regards, Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <1369844289.2769.146.camel-Hs+ccMQdwurzDu64bZtGtWD2FQJk+8+b@public.gmane.org>]
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 16:18 ` Simo Sorce @ 2013-05-29 20:32 ` Waiman Long -1 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-29 20:32 UTC (permalink / raw) To: Simo Sorce Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, autofs-u79uwXL29TY76Z2rM5mHXA, ceph-devel-u79uwXL29TY76Z2rM5mHXA, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On 05/29/2013 12:18 PM, Simo Sorce wrote: > On Wed, 2013-05-29 at 11:55 -0400, Waiman Long wrote: > >> My patch set consists of 2 different changes. The first one is to avoid >> taking the d_lock lock when updating the reference count in the >> dentries. This particular change also benefit some other workloads that >> are filesystem intensive. One particular example is the short workload >> in the AIM7 benchmark. One of the job type in the short workload is >> "misc_rtns_1" which calls security functions like getpwnam(), >> getpwuid(), getgrgid() a couple of times. These functions open the >> /etc/passwd or /etc/group files, read their content and close the files. >> It is the intensive open/read/close sequence from multiple threads that >> is causing 80%+ contention in the d_lock on a system with large number >> of cores. > To be honest a workload base on /etc/passwd or /etc/group is completely > artificial, in actual usage, if you really have such access you use > nscd or sssd with their shared memory caches to completely remove most > of the file access. > I have no beef on the rest but repeated access to Nsswitch information > is not something you need to optimize at the file system layer and > should not be brought up as a point in favor. The misc_rtns_1 workload that I described here is just part of a larger workload involving other activities. It represents just 1/17 of the total jobs that were spawned. This particular job type, however, dominates the time because of the lock contention that it created. I agree that it is an artificial workload as most benchmarks are. It is certainly an exaggeration of what a real workload may be, but it doesn't mean that similar contention will not happen in the real world especially when the trend is to have more and more CPU cores packed in the same machine. Regards, Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-29 20:32 ` Waiman Long 0 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-29 20:32 UTC (permalink / raw) To: Simo Sorce Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On 05/29/2013 12:18 PM, Simo Sorce wrote: > On Wed, 2013-05-29 at 11:55 -0400, Waiman Long wrote: > >> My patch set consists of 2 different changes. The first one is to avoid >> taking the d_lock lock when updating the reference count in the >> dentries. This particular change also benefit some other workloads that >> are filesystem intensive. One particular example is the short workload >> in the AIM7 benchmark. One of the job type in the short workload is >> "misc_rtns_1" which calls security functions like getpwnam(), >> getpwuid(), getgrgid() a couple of times. These functions open the >> /etc/passwd or /etc/group files, read their content and close the files. >> It is the intensive open/read/close sequence from multiple threads that >> is causing 80%+ contention in the d_lock on a system with large number >> of cores. > To be honest a workload base on /etc/passwd or /etc/group is completely > artificial, in actual usage, if you really have such access you use > nscd or sssd with their shared memory caches to completely remove most > of the file access. > I have no beef on the rest but repeated access to Nsswitch information > is not something you need to optimize at the file system layer and > should not be brought up as a point in favor. The misc_rtns_1 workload that I described here is just part of a larger workload involving other activities. It represents just 1/17 of the total jobs that were spawned. This particular job type, however, dominates the time because of the lock contention that it created. I agree that it is an artificial workload as most benchmarks are. It is certainly an exaggeration of what a real workload may be, but it doesn't mean that similar contention will not happen in the real world especially when the trend is to have more and more CPU cores packed in the same machine. Regards, Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 15:55 ` Waiman Long @ 2013-05-29 18:46 ` J. Bruce Fields 0 siblings, 0 replies; 33+ messages in thread From: J. Bruce Fields @ 2013-05-29 18:46 UTC (permalink / raw) To: Waiman Long Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, autofs-u79uwXL29TY76Z2rM5mHXA, ceph-devel-u79uwXL29TY76Z2rM5mHXA, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On Wed, May 29, 2013 at 11:55:14AM -0400, Waiman Long wrote: > On 05/26/2013 10:09 PM, Dave Chinner wrote: > >On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote: > >>On 05/23/2013 05:42 AM, Dave Chinner wrote: > >>> > >>>What was it I said about this patchset when you posted it to speed > >>>up an Oracle benchmark back in february? I'll repeat: > >>> > >>>"Nobody should be doing reverse dentry-to-name lookups in a quantity > >>>sufficient for it to become a performance limiting factor." > >>Thank for the comment, but my point is that it is the d_lock > >>contention is skewing the data about how much spin lock contention > >>had actually happened in the workload and it makes it harder to > >>pinpoint problem areas to look at. This is not about performance, it > >>is about accurate representation of performance data. Ideally, we > >>want the overhead of turning on perf instrumentation to be as low as > >>possible. > >Right. But d_path will never be "low overhead", and as such it > >shouldn't be used by perf. > > The d_path() is called by perf_event_mmap_event() which translates > VMA to its file path for memory segments backed by files. As perf is > not just for sampling data within the kernel, it can also be used > for checking access pattern in the user space. As a result, it needs > to map VMAs back to the backing files to access their symbols > information. If d_path() is not the right function to call for this > purpose, what other alternatives do we have? As Dave said before, is the last path component sufficient? Or how about an inode number? --b. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-29 18:46 ` J. Bruce Fields 0 siblings, 0 replies; 33+ messages in thread From: J. Bruce Fields @ 2013-05-29 18:46 UTC (permalink / raw) To: Waiman Long Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On Wed, May 29, 2013 at 11:55:14AM -0400, Waiman Long wrote: > On 05/26/2013 10:09 PM, Dave Chinner wrote: > >On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote: > >>On 05/23/2013 05:42 AM, Dave Chinner wrote: > >>> > >>>What was it I said about this patchset when you posted it to speed > >>>up an Oracle benchmark back in february? I'll repeat: > >>> > >>>"Nobody should be doing reverse dentry-to-name lookups in a quantity > >>>sufficient for it to become a performance limiting factor." > >>Thank for the comment, but my point is that it is the d_lock > >>contention is skewing the data about how much spin lock contention > >>had actually happened in the workload and it makes it harder to > >>pinpoint problem areas to look at. This is not about performance, it > >>is about accurate representation of performance data. Ideally, we > >>want the overhead of turning on perf instrumentation to be as low as > >>possible. > >Right. But d_path will never be "low overhead", and as such it > >shouldn't be used by perf. > > The d_path() is called by perf_event_mmap_event() which translates > VMA to its file path for memory segments backed by files. As perf is > not just for sampling data within the kernel, it can also be used > for checking access pattern in the user space. As a result, it needs > to map VMAs back to the backing files to access their symbols > information. If d_path() is not the right function to call for this > purpose, what other alternatives do we have? As Dave said before, is the last path component sufficient? Or how about an inode number? --b. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 18:46 ` J. Bruce Fields (?) @ 2013-05-29 20:37 ` Andi Kleen [not found] ` <20130529203700.GM6123-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org> ` (2 more replies) -1 siblings, 3 replies; 33+ messages in thread From: Andi Kleen @ 2013-05-29 20:37 UTC (permalink / raw) To: J. Bruce Fields Cc: Waiman Long, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen > As Dave said before, is the last path component sufficient? Or how > about an inode number? Neither works, the profiler needs to find the file and read it. inode searching would be incredible expensive, unless the file system provided a "open-by-inode" primitive In normal operation you only have that event when starting up the program and when each shared library gets mapped in. -Andi ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <20130529203700.GM6123-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org>]
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 20:37 ` Andi Kleen @ 2013-05-29 20:43 ` J. Bruce Fields 2013-05-29 21:19 ` Jörn Engel 2013-06-06 3:48 ` Dave Chinner 2 siblings, 0 replies; 33+ messages in thread From: J. Bruce Fields @ 2013-05-29 20:43 UTC (permalink / raw) To: Andi Kleen Cc: Waiman Long, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, autofs-u79uwXL29TY76Z2rM5mHXA, ceph-devel-u79uwXL29TY76Z2rM5mHXA, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Chandramouleeswaran, Aswin, Norton, Scott J On Wed, May 29, 2013 at 10:37:00PM +0200, Andi Kleen wrote: > > As Dave said before, is the last path component sufficient? Or how > > about an inode number? > > Neither works, the profiler needs to find the file and read it. > > inode searching would be incredible expensive, How fast does it need to be? Actually analyzing the profile is something you only do once after everything's over, right? > unless the file system provided a "open-by-inode" primitive Well, I suppose there is open_by_handle_at. --b. > In normal operation you only have that event when starting up the > program and when each shared library gets mapped in. > > -Andi ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-29 20:43 ` J. Bruce Fields 0 siblings, 0 replies; 33+ messages in thread From: J. Bruce Fields @ 2013-05-29 20:43 UTC (permalink / raw) To: Andi Kleen Cc: Waiman Long, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On Wed, May 29, 2013 at 10:37:00PM +0200, Andi Kleen wrote: > > As Dave said before, is the last path component sufficient? Or how > > about an inode number? > > Neither works, the profiler needs to find the file and read it. > > inode searching would be incredible expensive, How fast does it need to be? Actually analyzing the profile is something you only do once after everything's over, right? > unless the file system provided a "open-by-inode" primitive Well, I suppose there is open_by_handle_at. --b. > In normal operation you only have that event when starting up the > program and when each shared library gets mapped in. > > -Andi ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 20:43 ` J. Bruce Fields (?) @ 2013-05-29 21:01 ` Andi Kleen -1 siblings, 0 replies; 33+ messages in thread From: Andi Kleen @ 2013-05-29 21:01 UTC (permalink / raw) To: J. Bruce Fields Cc: Andi Kleen, Waiman Long, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On Wed, May 29, 2013 at 04:43:23PM -0400, J. Bruce Fields wrote: > On Wed, May 29, 2013 at 10:37:00PM +0200, Andi Kleen wrote: > > > As Dave said before, is the last path component sufficient? Or how > > > about an inode number? > > > > Neither works, the profiler needs to find the file and read it. > > > > inode searching would be incredible expensive, > > How fast does it need to be? Actually analyzing the profile is > something you only do once after everything's over, right? "perf top" does it live. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 20:37 ` Andi Kleen @ 2013-05-29 21:19 ` Jörn Engel 2013-05-29 21:19 ` Jörn Engel 2013-06-06 3:48 ` Dave Chinner 2 siblings, 0 replies; 33+ messages in thread From: Jörn Engel @ 2013-05-29 21:19 UTC (permalink / raw) To: Andi Kleen Cc: J. Bruce Fields, Waiman Long, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On Wed, 29 May 2013 22:37:00 +0200, Andi Kleen wrote: > > > As Dave said before, is the last path component sufficient? Or how > > about an inode number? > > Neither works, the profiler needs to find the file and read it. Ignoring all the complexity this would cause downstream, you could do the path lookup just once, attach some cookie to it and return the cookie ever-after. Maybe some combination of i_sb and i_ino would be good enough as a cookie. Jörn -- Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming. -- Rob Pike ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-29 21:19 ` Jörn Engel 0 siblings, 0 replies; 33+ messages in thread From: Jörn Engel @ 2013-05-29 21:19 UTC (permalink / raw) To: Andi Kleen Cc: J. Bruce Fields, Waiman Long, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On Wed, 29 May 2013 22:37:00 +0200, Andi Kleen wrote: > > > As Dave said before, is the last path component sufficient? Or how > > about an inode number? > > Neither works, the profiler needs to find the file and read it. Ignoring all the complexity this would cause downstream, you could do the path lookup just once, attach some cookie to it and return the cookie ever-after. Maybe some combination of i_sb and i_ino would be good enough as a cookie. Jörn -- Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming. -- Rob Pike ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 21:19 ` Jörn Engel @ 2013-05-30 15:48 ` Waiman Long -1 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-30 15:48 UTC (permalink / raw) To: Jörn Engel Cc: Andi Kleen, J. Bruce Fields, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On 05/29/2013 05:19 PM, Jörn Engel wrote: > On Wed, 29 May 2013 22:37:00 +0200, Andi Kleen wrote: >>> As Dave said before, is the last path component sufficient? Or how >>> about an inode number? >> Neither works, the profiler needs to find the file and read it. > Ignoring all the complexity this would cause downstream, you could do > the path lookup just once, attach some cookie to it and return the > cookie ever-after. Maybe some combination of i_sb and i_ino would be > good enough as a cookie. Still, it is just shifting the complexity from the d_path code to the perf kernel subsystem as it needs to keep track of what paths have been sent up before. It also have complications in case the tracked files are being deleted or moved around in the filesystem. Some kind of notification mechanism has to be implemented in the dentry layer to notify the perf subsystem. Regards, Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-30 15:48 ` Waiman Long 0 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-30 15:48 UTC (permalink / raw) To: Jörn Engel Cc: Andi Kleen, J. Bruce Fields, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On 05/29/2013 05:19 PM, Jörn Engel wrote: > On Wed, 29 May 2013 22:37:00 +0200, Andi Kleen wrote: >>> As Dave said before, is the last path component sufficient? Or how >>> about an inode number? >> Neither works, the profiler needs to find the file and read it. > Ignoring all the complexity this would cause downstream, you could do > the path lookup just once, attach some cookie to it and return the > cookie ever-after. Maybe some combination of i_sb and i_ino would be > good enough as a cookie. Still, it is just shifting the complexity from the d_path code to the perf kernel subsystem as it needs to keep track of what paths have been sent up before. It also have complications in case the tracked files are being deleted or moved around in the filesystem. Some kind of notification mechanism has to be implemented in the dentry layer to notify the perf subsystem. Regards, Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-30 15:48 ` Waiman Long @ 2013-05-30 15:11 ` Jörn Engel -1 siblings, 0 replies; 33+ messages in thread From: Jörn Engel @ 2013-05-30 15:11 UTC (permalink / raw) To: Waiman Long Cc: Andi Kleen, J. Bruce Fields, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On Thu, 30 May 2013 11:48:50 -0400, Waiman Long wrote: > On 05/29/2013 05:19 PM, Jörn Engel wrote: > >On Wed, 29 May 2013 22:37:00 +0200, Andi Kleen wrote: > >>>As Dave said before, is the last path component sufficient? Or how > >>>about an inode number? > >>Neither works, the profiler needs to find the file and read it. > >Ignoring all the complexity this would cause downstream, you could do > >the path lookup just once, attach some cookie to it and return the > >cookie ever-after. Maybe some combination of i_sb and i_ino would be > >good enough as a cookie. > > Still, it is just shifting the complexity from the d_path code to > the perf kernel subsystem as it needs to keep track of what paths > have been sent up before. That sounds like a good thing to have. Every single linux user depends on the dcache. Only a relatively small subset cares about perf. Having dcache pay the cost for perf's special needs is a classical externality. > It also have complications in case the > tracked files are being deleted or moved around in the filesystem. > Some kind of notification mechanism has to be implemented in the > dentry layer to notify the perf subsystem. Agreed. The whole approach is based on getting the 99% case right and occasionally being wrong. For perf this may be acceptable, not sure. Jörn -- Without a major sea change, nothing that is under copyright today will ever come out from under it and fall into the public domain. -- Jake Edge -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system @ 2013-05-30 15:11 ` Jörn Engel 0 siblings, 0 replies; 33+ messages in thread From: Jörn Engel @ 2013-05-30 15:11 UTC (permalink / raw) To: Waiman Long Cc: Andi Kleen, J. Bruce Fields, Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On Thu, 30 May 2013 11:48:50 -0400, Waiman Long wrote: > On 05/29/2013 05:19 PM, Jörn Engel wrote: > >On Wed, 29 May 2013 22:37:00 +0200, Andi Kleen wrote: > >>>As Dave said before, is the last path component sufficient? Or how > >>>about an inode number? > >>Neither works, the profiler needs to find the file and read it. > >Ignoring all the complexity this would cause downstream, you could do > >the path lookup just once, attach some cookie to it and return the > >cookie ever-after. Maybe some combination of i_sb and i_ino would be > >good enough as a cookie. > > Still, it is just shifting the complexity from the d_path code to > the perf kernel subsystem as it needs to keep track of what paths > have been sent up before. That sounds like a good thing to have. Every single linux user depends on the dcache. Only a relatively small subset cares about perf. Having dcache pay the cost for perf's special needs is a classical externality. > It also have complications in case the > tracked files are being deleted or moved around in the filesystem. > Some kind of notification mechanism has to be implemented in the > dentry layer to notify the perf subsystem. Agreed. The whole approach is based on getting the 99% case right and occasionally being wrong. For perf this may be acceptable, not sure. Jörn -- Without a major sea change, nothing that is under copyright today will ever come out from under it and fall into the public domain. -- Jake Edge ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 20:37 ` Andi Kleen [not found] ` <20130529203700.GM6123-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org> 2013-05-29 21:19 ` Jörn Engel @ 2013-06-06 3:48 ` Dave Chinner 2 siblings, 0 replies; 33+ messages in thread From: Dave Chinner @ 2013-06-06 3:48 UTC (permalink / raw) To: Andi Kleen Cc: J. Bruce Fields, Waiman Long, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J On Wed, May 29, 2013 at 10:37:00PM +0200, Andi Kleen wrote: > > As Dave said before, is the last path component sufficient? Or how > > about an inode number? > > Neither works, the profiler needs to find the file and read it. > > inode searching would be incredible expensive, unless the file system > provided a "open-by-inode" primitive That's effectively what fs/fhandle.c gives you. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system 2013-05-29 18:46 ` J. Bruce Fields (?) (?) @ 2013-05-29 20:40 ` Waiman Long -1 siblings, 0 replies; 33+ messages in thread From: Waiman Long @ 2013-05-29 20:40 UTC (permalink / raw) To: J. Bruce Fields Cc: Dave Chinner, Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil, Steve French, Trond Myklebust, Eric Paris, linux-fsdevel, linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen On 05/29/2013 02:46 PM, J. Bruce Fields wrote: > On Wed, May 29, 2013 at 11:55:14AM -0400, Waiman Long wrote: >> On 05/26/2013 10:09 PM, Dave Chinner wrote: >>> On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote: >>>> On 05/23/2013 05:42 AM, Dave Chinner wrote: >>>>> What was it I said about this patchset when you posted it to speed >>>>> up an Oracle benchmark back in february? I'll repeat: >>>>> >>>>> "Nobody should be doing reverse dentry-to-name lookups in a quantity >>>>> sufficient for it to become a performance limiting factor." >>>> Thank for the comment, but my point is that it is the d_lock >>>> contention is skewing the data about how much spin lock contention >>>> had actually happened in the workload and it makes it harder to >>>> pinpoint problem areas to look at. This is not about performance, it >>>> is about accurate representation of performance data. Ideally, we >>>> want the overhead of turning on perf instrumentation to be as low as >>>> possible. >>> Right. But d_path will never be "low overhead", and as such it >>> shouldn't be used by perf. >> The d_path() is called by perf_event_mmap_event() which translates >> VMA to its file path for memory segments backed by files. As perf is >> not just for sampling data within the kernel, it can also be used >> for checking access pattern in the user space. As a result, it needs >> to map VMAs back to the backing files to access their symbols >> information. If d_path() is not the right function to call for this >> purpose, what other alternatives do we have? > As Dave said before, is the last path component sufficient? Or how > about an inode number? I don't think so. The user-space perf command will need the full pathname to locate the binaries and read their debug information. Just returning the last path component or the inode number will not cut it. Regards, Longman ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2013-06-06 3:48 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-05-23 1:37 [PATCH 0/3 v3] dcache: make it more scalable on large system Waiman Long 2013-05-23 1:37 ` Waiman Long -- strict thread matches above, loose matches on Subject: below -- 2013-05-23 1:37 Waiman Long 2013-05-23 9:42 ` Dave Chinner 2013-05-23 21:34 ` Waiman Long 2013-05-23 21:34 ` Waiman Long 2013-05-27 2:09 ` Dave Chinner 2013-05-29 15:55 ` Waiman Long [not found] ` <51A624E2.3000301-VXdhtT5mjnY@public.gmane.org> 2013-05-29 16:13 ` Andi Kleen 2013-05-29 16:13 ` Andi Kleen [not found] ` <20130529161358.GJ6123-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org> 2013-05-29 20:23 ` Waiman Long 2013-05-29 20:23 ` Waiman Long 2013-05-29 16:18 ` Simo Sorce 2013-05-29 16:18 ` Simo Sorce 2013-05-29 16:56 ` Andi Kleen 2013-05-29 17:03 ` Simo Sorce 2013-05-29 20:37 ` Waiman Long [not found] ` <1369844289.2769.146.camel-Hs+ccMQdwurzDu64bZtGtWD2FQJk+8+b@public.gmane.org> 2013-05-29 20:32 ` Waiman Long 2013-05-29 20:32 ` Waiman Long 2013-05-29 18:46 ` J. Bruce Fields 2013-05-29 18:46 ` J. Bruce Fields 2013-05-29 20:37 ` Andi Kleen [not found] ` <20130529203700.GM6123-1g7Xle2YJi4/4alezvVtWx2eb7JE58TQ@public.gmane.org> 2013-05-29 20:43 ` J. Bruce Fields 2013-05-29 20:43 ` J. Bruce Fields 2013-05-29 21:01 ` Andi Kleen 2013-05-29 21:19 ` Jörn Engel 2013-05-29 21:19 ` Jörn Engel 2013-05-30 15:48 ` Waiman Long 2013-05-30 15:48 ` Waiman Long 2013-05-30 15:11 ` Jörn Engel 2013-05-30 15:11 ` Jörn Engel 2013-06-06 3:48 ` Dave Chinner 2013-05-29 20:40 ` Waiman Long
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.