* 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 @ 2008-11-16 17:38 Rafael J. Wysocki 2008-11-16 17:38 ` [Bug #11207] VolanoMark regression with 2.6.27-rc1 Rafael J. Wysocki ` (17 more replies) 0 siblings, 18 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:38 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Andrew Morton, Natalie Protasevich, Kernel Testers List [NOTE: I closed a number of Bugzilla entries dedicated to regressions introduced between 2.6.26 and 2.6.27 that appeared to have been fixed to me or where the reporters had been totally unresponsive for extended periods of time (given that they are notified every week ...).] This message contains a list of some regressions introduced between 2.6.26 and 2.6.27, for which there are no fixes in the mainline I know of. If any of them have been fixed already, please let me know. If you know of any other unresolved regressions introduced between 2.6.26 and 2.6.27, please let me know either and I'll add them to the list. Also, please let me know if any of the entries below are invalid. Each entry from the list will be sent additionally in an automatic reply to this message with CCs to the people involved in reporting and handling the issue. Listed regressions statistics: Date Total Pending Unresolved ---------------------------------------- 2008-11-16 199 18 14 2008-11-09 196 28 23 2008-11-02 195 34 28 2008-10-26 190 34 29 2008-10-04 181 41 33 2008-09-27 173 35 28 2008-09-21 169 45 36 2008-09-15 163 46 32 2008-09-12 163 51 38 2008-09-07 150 43 33 2008-08-30 135 48 36 2008-08-23 122 48 40 2008-08-16 103 47 37 2008-08-10 80 52 31 2008-08-02 47 31 20 Unresolved regressions ---------------------- Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12048 Subject : Regression in bonding between 2.6.26.8 and 2.6.27.6 Submitter : Jesper Krogh <jesper@krogh.cc> Date : 2008-11-16 9:41 (1 days old) References : http://marc.info/?l=linux-kernel&m=122682977001048&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12039 Subject : Regression: USB/DVB 2.6.26.8 --> 2.6.27.6 Submitter : David <david@unsolicited.net> Date : 2008-11-14 20:20 (3 days old) References : http://marc.info/?l=linux-kernel&m=122669568022274&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11983 Subject : iwlagn: wrong command queue 31, command id 0x0 Submitter : Matt Mackall <mpm@selenic.com> Date : 2008-11-06 4:16 (11 days old) References : http://marc.info/?l=linux-kernel&m=122598672815803&w=4 http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703 Handled-By : reinette chatre <reinette.chatre@intel.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11886 Subject : without serial console system doesn't poweroff Submitter : Daniel Smolik <marvin@mydatex.cz> Date : 2008-10-29 04:06 (19 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11876 Subject : RCU hang on cpu re-hotplug with 2.6.27rc8 Submitter : Andi Kleen <andi@firstfloor.org> Date : 2008-10-06 23:28 (42 days old) References : http://marc.info/?l=linux-kernel&m=122333610602399&w=2 Handled-By : Paul E. McKenney <paulmck@linux.vnet.ibm.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11836 Subject : Scheduler on C2D CPU and latest 2.6.27 kernel Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-10-21 9:59 (27 days old) References : http://marc.info/?l=linux-kernel&m=122458320502371&w=4 Handled-By : Chris Snook <csnook@redhat.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11698 Subject : 2.6.27-rc7, freezes with > 1 s2ram cycle Submitter : Soeren Sonnenburg <kernel@nn7.de> Date : 2008-09-29 11:29 (49 days old) References : http://marc.info/?l=linux-kernel&m=122268780926859&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11664 Subject : acpi errors and random freeze on sony vaio sr Submitter : Giovanni Pellerano <giovanni.pellerano@gmail.com> Date : 2008-09-28 03:48 (50 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11569 Subject : Panic stop CPUs regression Submitter : Andi Kleen <andi@firstfloor.org> Date : 2008-09-02 13:49 (76 days old) References : http://marc.info/?l=linux-kernel&m=122036356127282&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543 Subject : kernel panic: softlockup in tick_periodic() ??? Submitter : Joshua Hoblitt <j_kernel@hoblitt.com> Date : 2008-09-11 16:46 (67 days old) References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4 Handled-By : Thomas Gleixner <tglx@linutronix.de> Cyrill Gorcunov <gorcunov@gmail.com> Ingo Molnar <mingo@elte.hu> Cyrill Gorcunov <gorcunov@gmail.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404 Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr Submitter : rdunlap <randy.dunlap@oracle.com> Date : 2008-08-21 5:52 (88 days old) References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4 http://marc.info/?l=linux-kernel&m=121932889105368&w=4 Handled-By : Miller, Mike (OS Dev) <Mike.Miller@hp.com> James Bottomley <James.Bottomley@hansenpartnership.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308 Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28 Submitter : Christoph Lameter <cl@linux-foundation.org> Date : 2008-08-11 18:36 (98 days old) References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4 http://marc.info/?l=linux-kernel&m=122125737421332&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11215 Subject : INFO: possible recursive locking detected ps2_command Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-07-31 9:41 (109 days old) References : http://marc.info/?l=linux-kernel&m=121749737011637&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11207 Subject : VolanoMark regression with 2.6.27-rc1 Submitter : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Date : 2008-07-31 3:20 (109 days old) References : http://marc.info/?l=linux-kernel&m=121747464114335&w=4 Handled-By : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Peter Zijlstra <a.p.zijlstra@chello.nl> Dhaval Giani <dhaval@linux.vnet.ibm.com> Miao Xie <miaox@cn.fujitsu.com> Regressions with patches ------------------------ Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11865 Subject : WOL for E100 Doesn't Work Anymore Submitter : roger <rogerx@sdf.lonestar.org> Date : 2008-10-26 21:56 (22 days old) Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18646&action=view Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11843 Subject : usb hdd problems with 2.6.27.2 Submitter : Luciano Rocha <luciano@eurotux.com> Date : 2008-10-22 16:22 (26 days old) References : http://marc.info/?l=linux-kernel&m=122469318102679&w=4 Handled-By : Luciano Rocha <luciano@eurotux.com> Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 Subject : mounting XFS produces a segfault Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> Date : 2008-10-21 18:00 (27 days old) Handled-By : Dave Chinner <dgc@sgi.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11795 Subject : ks959-sir dongle no longer works under 2.6.27 (REGRESSION) Submitter : Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Date : 2008-10-20 10:49 (28 days old) Handled-By : Samuel Ortiz <samuel@sortiz.org> Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11795#c22 For details, please visit the bug entries and follow the links given in references. As you can see, there is a Bugzilla entry for each of the listed regressions. There also is a Bugzilla entry used for tracking the regressions introduced between 2.6.26 and 2.6.27, unresolved as well as resolved, at: http://bugzilla.kernel.org/show_bug.cgi?id=11167 Please let me know if there are any Bugzilla entries that should be added to the list in there. Thanks, Rafael ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11207] VolanoMark regression with 2.6.27-rc1 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki @ 2008-11-16 17:38 ` Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11215] INFO: possible recursive locking detected ps2_command Rafael J. Wysocki ` (16 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:38 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, Dhaval Giani, Miao Xie, Peter Zijlstra, Zhang, Yanmin This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11207 Subject : VolanoMark regression with 2.6.27-rc1 Submitter : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Date : 2008-07-31 3:20 (109 days old) References : http://marc.info/?l=linux-kernel&m=121747464114335&w=4 Handled-By : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Peter Zijlstra <a.p.zijlstra@chello.nl> Dhaval Giani <dhaval@linux.vnet.ibm.com> Miao Xie <miaox@cn.fujitsu.com> ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11215] INFO: possible recursive locking detected ps2_command 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki 2008-11-16 17:38 ` [Bug #11207] VolanoMark regression with 2.6.27-rc1 Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 Rafael J. Wysocki ` (15 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Zdenek Kabelac This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11215 Subject : INFO: possible recursive locking detected ps2_command Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-07-31 9:41 (109 days old) References : http://marc.info/?l=linux-kernel&m=121749737011637&w=4 ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki 2008-11-16 17:38 ` [Bug #11207] VolanoMark regression with 2.6.27-rc1 Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11215] INFO: possible recursive locking detected ps2_command Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-17 9:06 ` Ingo Molnar 2008-11-16 17:40 ` [Bug #11664] acpi errors and random freeze on sony vaio sr Rafael J. Wysocki ` (14 subsequent siblings) 17 siblings, 1 reply; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Christoph Lameter This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308 Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28 Submitter : Christoph Lameter <cl@linux-foundation.org> Date : 2008-08-11 18:36 (98 days old) References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4 http://marc.info/?l=linux-kernel&m=122125737421332&w=4 ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-16 17:40 ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 Rafael J. Wysocki @ 2008-11-17 9:06 ` Ingo Molnar 2008-11-17 9:14 ` David Miller 2008-11-19 19:43 ` Christoph Lameter 0 siblings, 2 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 9:06 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux Kernel Mailing List, Kernel Testers List, Christoph Lameter, Mike Galbraith, Peter Zijlstra * Rafael J. Wysocki <rjw@sisk.pl> wrote: > This message has been generated automatically as a part of a report > of regressions introduced between 2.6.26 and 2.6.27. > > The following bug entry is on the current list of known regressions > introduced between 2.6.26 and 2.6.27. Please verify if it still should > be listed and let me know (either way). > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308 > Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28 > Submitter : Christoph Lameter <cl@linux-foundation.org> > Date : 2008-08-11 18:36 (98 days old) > References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4 > http://marc.info/?l=linux-kernel&m=122125737421332&w=4 Christoph, as per the recent analysis of Mike: http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html all scheduler components of this regression have been eliminated. In fact his numbers show that scheduler speedups since 2.6.22 have offset and hidden most other sources of tbench regression. (i.e. the scheduler portion got 5% faster, hence it was able to offset a slowdown of 5% in other areas of the kernel that tbench triggers) Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 9:06 ` Ingo Molnar @ 2008-11-17 9:14 ` David Miller 2008-11-17 11:01 ` Ingo Molnar 2008-11-19 19:43 ` Christoph Lameter 1 sibling, 1 reply; 191+ messages in thread From: David Miller @ 2008-11-17 9:14 UTC (permalink / raw) To: mingo; +Cc: rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra From: Ingo Molnar <mingo@elte.hu> Date: Mon, 17 Nov 2008 10:06:48 +0100 > > * Rafael J. Wysocki <rjw@sisk.pl> wrote: > > > This message has been generated automatically as a part of a report > > of regressions introduced between 2.6.26 and 2.6.27. > > > > The following bug entry is on the current list of known regressions > > introduced between 2.6.26 and 2.6.27. Please verify if it still should > > be listed and let me know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308 > > Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28 > > Submitter : Christoph Lameter <cl@linux-foundation.org> > > Date : 2008-08-11 18:36 (98 days old) > > References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4 > > http://marc.info/?l=linux-kernel&m=122125737421332&w=4 > > Christoph, as per the recent analysis of Mike: > > http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html > > all scheduler components of this regression have been eliminated. > > In fact his numbers show that scheduler speedups since 2.6.22 have > offset and hidden most other sources of tbench regression. (i.e. the > scheduler portion got 5% faster, hence it was able to offset a > slowdown of 5% in other areas of the kernel that tbench triggers) Although I respect the improvements, wake_up() is still several orders of magnitude slower than it was in 2.6.22 and wake_up() is at the top of the profiles in tbench runs. It really is premature to close this regression at this time. I am working with every spare moment I have to try and nail this stuff, but unless someone else helps me people need to be patient. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 9:14 ` David Miller @ 2008-11-17 11:01 ` Ingo Molnar 2008-11-17 11:20 ` Eric Dumazet 2008-11-17 19:21 ` David Miller 0 siblings, 2 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 11:01 UTC (permalink / raw) To: David Miller Cc: rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 13750 bytes --] * David Miller <davem@davemloft.net> wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Mon, 17 Nov 2008 10:06:48 +0100 > > > > > * Rafael J. Wysocki <rjw@sisk.pl> wrote: > > > > > This message has been generated automatically as a part of a report > > > of regressions introduced between 2.6.26 and 2.6.27. > > > > > > The following bug entry is on the current list of known regressions > > > introduced between 2.6.26 and 2.6.27. Please verify if it still should > > > be listed and let me know (either way). > > > > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308 > > > Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28 > > > Submitter : Christoph Lameter <cl@linux-foundation.org> > > > Date : 2008-08-11 18:36 (98 days old) > > > References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4 > > > http://marc.info/?l=linux-kernel&m=122125737421332&w=4 > > > > Christoph, as per the recent analysis of Mike: > > > > http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html > > > > all scheduler components of this regression have been eliminated. > > > > In fact his numbers show that scheduler speedups since 2.6.22 have > > offset and hidden most other sources of tbench regression. (i.e. the > > scheduler portion got 5% faster, hence it was able to offset a > > slowdown of 5% in other areas of the kernel that tbench triggers) > > Although I respect the improvements, wake_up() is still several > orders of magnitude slower than it was in 2.6.22 and wake_up() is at > the top of the profiles in tbench runs. hm, several orders of magnitude slower? That contradicts Mike's numbers and my own numbers and profiles as well: see below. The scheduler's overhead barely even registers on a 16-way x86 system i'm running tbench on. Here's the NMI profile during 64 threads tbench on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]: Throughput 3437.65 MB/sec 64 procs ================================== 21570252 total ........ 1494803 copy_user_generic_string 998232 sock_rfree 491471 tcp_ack 482405 ip_dont_fragment 470685 ip_local_deliver 436325 constant_test_bit [ called by napi_disable_pending() ] 375469 avc_has_perm_noaudit 347663 tcp_sendmsg 310383 tcp_recvmsg 300412 __inet_lookup_established 294377 system_call 286603 tcp_transmit_skb 251782 selinux_ip_postroute 236028 tcp_current_mss 235631 schedule 234013 netif_rx 229854 _local_bh_enable_ip 219501 tcp_v4_rcv [ etc. - see full profile attached further below ] Note that the scheduler does not even show up in the profile up to entry #15! I've also summarized NMI profiler output by major subsystems: NET overhead (12603450/21570252): 58.43% security overhead ( 1903598/21570252): 8.83% usercopy overhead ( 1753617/21570252): 8.13% sched overhead ( 1599406/21570252): 7.41% syscall overhead ( 560487/21570252): 2.60% IRQ overhead ( 555439/21570252): 2.58% slab overhead ( 492421/21570252): 2.28% timer overhead ( 226573/21570252): 1.05% pagealloc overhead ( 192681/21570252): 0.89% PID overhead ( 115123/21570252): 0.53% VFS overhead ( 107926/21570252): 0.50% pagecache overhead ( 62552/21570252): 0.29% gtod overhead ( 38651/21570252): 0.18% IDLE overhead ( 0/21570252): 0.00% --------------------------------------------------------- left ( 1349494/21570252): 6.26% The scheduler's functions are absolutely flat, and consistent with an extreme context-switching rate of 1.35 million per second. The scheduler can go up to about 20 million context switches per second on this system: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 32 0 0 32229696 29308 649880 0 0 0 0 164135 20026853 24 76 0 0 0 32 0 0 32229752 29308 649880 0 0 0 0 164203 20032770 24 76 0 0 0 32 0 0 32229752 29308 649880 0 0 0 0 164201 20036492 25 75 0 0 0 ... and 7% scheduling overhead is roughly consistent with 1.35/20.0. Wake up affinities and data flow caching is just fine in this workload - we've got scheduler statistics for that and they look good too. It all looks like pure old-fashioned straight overhead in the networking layer to me. Do we still touch the same global cacheline for every localhost packet we process? Anything like that would show up big time. Anyway, in terms of scheduling there's absolutely nothing anomalous i can see about this workload. Scheduling looks healthy throughout - and the few things we noticed causing unnecessary overhead are now fixed in -rc5. (but it's all in the <5% range of impact of total scheduling overhead - i.e. in the 0.4% absolute range in this workload) And the thing is, the scheduler's task in this workload is by far the most difficult one conceptually: it has to manage and optimize concurrency of _future_ processing, with an event frequency that is _WAY_ out of the normal patterns: more than 1.3 million context switches per second (!). It also switches to/from completely independent contexts of computing, with the all the implications that this brings. Networking and VFS "just" has to shuffle around bits in memory along a very specific plan given to it by user-space. That plan is well-specified and goes along the lines of: "copy this (already cached) file content to that socket" and back. By the raw throughput figures the system is pushing a couple of million data packets per second. Still we spend 7 times more CPU time in the networking code than in the scheduler or in the user-copy code. Why? Ingo -------------------------> 21570252 total ........ 1494803 copy_user_generic_string 998232 sock_rfree 491471 tcp_ack 482405 ip_dont_fragment 470685 ip_local_deliver 436325 constant_test_bit 375469 avc_has_perm_noaudit 347663 tcp_sendmsg 310383 tcp_recvmsg 300412 __inet_lookup_established 294377 system_call 286603 tcp_transmit_skb 251782 selinux_ip_postroute 236028 tcp_current_mss 235631 schedule 234013 netif_rx 229854 _local_bh_enable_ip 219501 tcp_v4_rcv 210046 netlbl_enabled 205022 constant_test_bit 199598 skb_release_head_state 187952 ip_queue_xmit 178779 tcp_established_options 175955 dev_queue_xmit 169904 netif_receive_skb 166629 ip_finish_output2 162291 sysret_check 151262 __switch_to 143355 audit_syscall_entry 142694 load_cr3 136571 memset_c 136115 nf_hook_slow 130825 ip_local_deliver_finish 128795 ip_rcv 125995 selinux_socket_sock_rcv_skb 123944 net_rx_action 123100 __copy_skb_header 122052 __inet_lookup 121744 constant_test_bit 119444 get_page_from_freelist 116486 avc_has_perm 115643 audit_syscall_exit 115123 find_pid_ns 114483 tcp_cleanup_rbuf 111350 tcp_rcv_established 109853 __mod_timer 107891 lock_sock_nested 107316 napi_disable_pending 106581 release_sock 104402 skb_copy_datagram_iovec 101591 __tcp_push_pending_frames 101206 tcp_event_data_recv 98046 kmem_cache_alloc_node 97982 tcp_v4_do_rcv 92714 sys_recvfrom 91551 rb_erase 89730 kfree 87979 ip_rcv_finish 87166 compare_ether_addr 86982 selinux_parse_skb 86731 nf_iterate 79690 selinux_ipv4_output 79347 __cache_free 78992 audit_free_names 78127 skb_release_data 77501 mod_timer 77241 __sock_recvmsg 77228 sock_recvmsg 77211 ____cache_alloc 76495 tcp_rcv_space_adjust 75283 sk_wait_data 71772 sys_sendto 71594 sched_clock 70880 eth_type_trans 70238 memcpy_toiovec 69193 do_softirq 68341 __update_sched_clock 67597 tcp_v4_md5_lookup 67424 try_to_wake_up 64465 sock_common_recvmsg 64116 put_prev_task_fair 63964 process_backlog 62216 __do_softirq 62093 tcp_cwnd_validate 61128 __alloc_skb 60588 put_page 59536 dput 58411 __ip_local_out 56349 avc_audit 55626 __napi_schedule 55525 selinux_ipv4_postroute 54499 __enqueue_entity 53599 local_bh_disable 53418 unroll_tree_refs 53162 __unlazy_fpu 53084 cfs_rq_of 52475 set_next_entity 51108 thread_return 50458 ip_output 50268 sched_clock_cpu 49974 tcp_send_delayed_ack 49736 ip_finish_output 49670 finish_task_switch 49070 ___swab16 48499 audit_get_context 48347 raw_local_deliver 47824 tcp_rtt_estimator 46707 tcp_push 46405 constant_test_bit 45859 select_task_rq_fair 45188 math_state_restore 44889 check_preempt_wakeup 44449 task_rq_lock 43704 sel_netif_sid 43377 sock_sendmsg 42612 sk_reset_timer 42606 __skb_clone 42223 __find_general_cachep 41950 selinux_socket_sendmsg 41716 constant_test_bit 41097 skb_push 40723 lock_sock 40715 system_call_after_swapgs 40399 selinux_netlbl_inode_permission 40179 rb_insert_color 40021 __kfree_skb 40015 sockfd_lookup_light 39216 internal_add_timer 39024 skb_can_coalesce 38838 __tcp_select_window 38651 current_kernel_time 38533 tcp_v4_md5_do_lookup 38372 __sock_sendmsg 38162 selinux_socket_recvmsg 37812 sel_netport_sid 37727 account_group_exec_runtime 37695 switch_mm 36247 nf_hook_thresh 36057 auditsys 35266 pick_next_task_fair 35064 __tcp_ack_snd_check 35052 sock_def_readable 34826 sysret_careful 34578 _local_bh_enable 34498 free_hot_cold_page 34338 kmap 34028 loopback_xmit 33320 sk_stream_alloc_skb 33269 test_ti_thread_flag 33219 skb_fill_page_desc 33049 tcp_is_cwnd_limited 33012 update_min_vruntime 32431 native_read_tsc 32398 dst_release 31661 get_pageblock_flags_group 31652 path_put 31516 tcp_push_pending_frames 31265 netif_needs_gso 31175 constant_test_bit 31077 __cycles_2_ns 30971 socket_has_perm 30893 __phys_addr 30867 lock_timer_base 30585 __wake_up 30456 ret_from_sys_call 30147 skb_release_all 29356 local_bh_enable 29334 __skb_insert 28681 tcp_cwnd_test 28652 __skb_dequeue 28612 prepare_to_wait 28268 kmem_cache_free 28193 set_bit 28149 dequeue_task_fair 27906 skb_header_pointer 27861 sys_kill 27803 selinux_task_kill 27627 audit_free_aux 27600 selinux_netlbl_sock_rcv_skb 26794 update_curr 26777 __alloc_pages_internal 26469 skb_entail 26458 pskb_may_pull 26216 inet_ehashfn 26075 call_softirq 26033 copy_from_user 25933 __local_bh_disable 25666 fget_light 25270 inet_csk_reset_xmit_timer 25071 signal_pending_state 24117 tcp_init_tso_segs 24109 TCP_ECN_check_ce 23702 nf_hook_thresh 23558 copy_to_user 23426 sysret_audit 23267 sk_wake_async 22627 tcp_options_write 22174 netif_tx_queue_stopped 21795 tcp_prequeue_process 21757 tcp_set_skb_tso_segs 21579 avc_hash 21565 ___swab16 21560 ip_local_out 21445 sk_wmem_schedule 21234 get_page 21200 __wake_up_common 21042 sel_netnode_find 20772 sock_put 20625 schedule_timeout 20613 __napi_complete 20563 fput_light 20532 tcp_bound_to_half_wnd 19912 cap_task_kill 19773 sysret_signal 19374 compound_head 19121 get_seconds 19048 PageLRU 18893 zone_watermark_ok 18635 tcp_snd_wnd_test 18634 enqueue_task_fair 18603 rb_next 18598 next_zones_zonelist 18534 resched_task 17820 hash_64 17801 autoremove_wake_function 17451 __skb_queue_before 17283 native_load_tls 17227 __skb_dequeue 17149 xfrm4_policy_check 16942 zone_statistics 16886 skb_reset_network_header 16824 ___swab16 16725 pskb_may_pull 16645 dev_hard_start_xmit 16580 sk_filter 16523 tcp_ca_event 16479 tcp_win_from_space 16408 tcp_parse_aligned_timestamp 16204 finish_wait 16124 virt_to_slab 15965 tcp_v4_send_check 15920 skb_reset_transport_header 15867 tcp_data_snd_check 15819 security_sock_rcv_skb 15665 tcp_ack_saw_tstamp 15621 skb_network_offset 15568 virt_to_head_page 15553 dst_confirm 15320 skb_pull 15277 clear_bit 15179 alloc_pages_current 14991 bictcp_acked 14743 tcp_store_ts_recent 14660 sel_netnode_sid 14650 __xchg 14573 task_has_perm 14561 tcp_v4_check 14492 net_invalid_timestamp 14485 security_socket_recvmsg 14363 __dequeue_entity 14318 pid_nr_ns 14311 device_not_available 14212 local_bh_enable_ip 14092 virt_to_cache 13804 netpoll_rx 13781 fcheck_files 13724 tcp_adjust_fackets_out 13717 net_timestamp 13638 ___swab16 13576 sel_netport_find 13563 __kmalloc_node 13530 __inc_zone_state 13215 pid_vnr 13208 free_pages_check 13008 security_socket_sendmsg 12971 ip_skb_dst_mtu 12827 __cpu_set 12782 bictcp_cong_avoid 12779 test_tsk_thread_flag 12734 wakeup_preempt_entity 12651 sel_netif_find 12545 skb_set_owner_r 12534 skb_headroom 12348 tcp_event_new_data_sent 12251 place_entity 12047 set_bit 11805 update_rq_clock 11788 detach_timer 11659 policy_zonelist 11423 skb_clone 11380 __skb_queue_tail 11249 dequeue_task 10823 init_rootdomain 10690 __cpu_clear 10558 default_wake_function 10556 tcp_rcv_rtt_measure_ts 10451 PageSlab 10427 sock_wfree 10277 calc_delta_fair 10237 tcp_validate_incoming 10218 task_rq_unlock 10023 page_get_cache [-- Attachment #2: config --] [-- Type: text/plain, Size: 72924 bytes --] # # Automatically generated make config: don't edit # Linux kernel version: 2.6.28-rc5 # Mon Nov 17 11:59:36 2008 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_FAST_CMPXCHG_LOCAL=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_GENERIC_SPINLOCK=y # CONFIG_RWSEM_XCHGADD_ALGORITHM is not set CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_DEFAULT_IDLE=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ZONE_DMA32=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_X86_SMP=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_X86_BIOS_REBOOT=y CONFIG_X86_TRAMPOLINE=y # CONFIG_KTIME_SCALAR is not set CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y # CONFIG_TASK_XACCT is not set CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_TREE=y # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=20 # CONFIG_CGROUPS is not set CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y # CONFIG_GROUP_SCHED is not set CONFIG_SYSFS_DEPRECATED=y CONFIG_SYSFS_DEPRECATED_V2=y CONFIG_RELAY=y CONFIG_NAMESPACES=y # CONFIG_UTS_NS is not set # CONFIG_IPC_NS is not set # CONFIG_USER_NS is not set # CONFIG_PID_NS is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y CONFIG_KALLSYMS_EXTRA_PASS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_PCSPKR_PLATFORM=y CONFIG_COMPAT_BRK=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_AIO=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_PCI_QUIRKS=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set CONFIG_PROFILING=y # CONFIG_MARKERS is not set CONFIG_OPROFILE=m CONFIG_OPROFILE_IBS=y CONFIG_HAVE_OPROFILE=y CONFIG_KPROBES=y CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y CONFIG_KRETPROBES=y CONFIG_HAVE_IOREMAP_PROT=y CONFIG_HAVE_KPROBES=y CONFIG_HAVE_KRETPROBES=y CONFIG_HAVE_ARCH_TRACEHOOK=y CONFIG_USE_GENERIC_SMP_HELPERS=y # CONFIG_HAVE_GENERIC_DMA_COHERENT is not set CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y # CONFIG_MODULE_FORCE_LOAD is not set CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y CONFIG_MODULE_SRCVERSION_ALL=y CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y CONFIG_BLK_DEV_IO_TRACE=y # CONFIG_BLK_DEV_BSG is not set # CONFIG_BLK_DEV_INTEGRITY is not set CONFIG_BLOCK_COMPAT=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" CONFIG_PREEMPT_NOTIFIERS=y CONFIG_CLASSIC_RCU=y CONFIG_FREEZER=y # # Processor type and features # # CONFIG_NO_HZ is not set # CONFIG_HIGH_RES_TIMERS is not set CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_SMP=y CONFIG_X86_FIND_SMP_CONFIG=y CONFIG_X86_MPPARSE=y CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_VSMP is not set # CONFIG_PARAVIRT_GUEST is not set # CONFIG_MEMTEST is not set # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set # CONFIG_M686 is not set # CONFIG_MPENTIUMII is not set # CONFIG_MPENTIUMIII is not set # CONFIG_MPENTIUMM is not set # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MK8 is not set # CONFIG_MCRUSOE is not set # CONFIG_MEFFICEON is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP3D is not set # CONFIG_MGEODEGX1 is not set # CONFIG_MGEODE_LX is not set # CONFIG_MCYRIXIII is not set # CONFIG_MVIAC3_2 is not set # CONFIG_MVIAC7 is not set # CONFIG_MPSC is not set # CONFIG_MCORE2 is not set CONFIG_GENERIC_CPU=y CONFIG_X86_CPU=y CONFIG_X86_L1_CACHE_BYTES=128 CONFIG_X86_INTERNODE_CACHE_BYTES=128 CONFIG_X86_CMPXCHG=y CONFIG_X86_L1_CACHE_SHIFT=7 CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_TSC=y CONFIG_X86_CMPXCHG64=y CONFIG_X86_CMOV=y CONFIG_X86_MINIMUM_CPU_FAMILY=64 CONFIG_X86_DEBUGCTLMSR=y CONFIG_CPU_SUP_INTEL=y CONFIG_CPU_SUP_AMD=y CONFIG_CPU_SUP_CENTAUR_64=y CONFIG_X86_DS=y CONFIG_X86_PTRACE_BTS=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_DMI=y CONFIG_GART_IOMMU=y CONFIG_CALGARY_IOMMU=y CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y # CONFIG_AMD_IOMMU is not set CONFIG_SWIOTLB=y CONFIG_IOMMU_HELPER=y CONFIG_NR_CPUS=255 CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y CONFIG_X86_MCE=y CONFIG_X86_MCE_INTEL=y CONFIG_X86_MCE_AMD=y # CONFIG_I8K is not set CONFIG_MICROCODE=m CONFIG_MICROCODE_INTEL=y # CONFIG_MICROCODE_AMD is not set CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=y CONFIG_ARCH_PHYS_ADDR_T_64BIT=y CONFIG_NUMA=y CONFIG_K8_NUMA=y CONFIG_X86_64_ACPI_NUMA=y CONFIG_NODES_SPAN_OTHER_NODES=y # CONFIG_NUMA_EMU is not set CONFIG_NODES_SHIFT=6 CONFIG_ARCH_SPARSEMEM_DEFAULT=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_SELECT_MEMORY_MODEL=y CONFIG_SELECT_MEMORY_MODEL=y # CONFIG_FLATMEM_MANUAL is not set # CONFIG_DISCONTIGMEM_MANUAL is not set CONFIG_SPARSEMEM_MANUAL=y CONFIG_SPARSEMEM=y CONFIG_NEED_MULTIPLE_NODES=y CONFIG_HAVE_MEMORY_PRESENT=y CONFIG_SPARSEMEM_EXTREME=y CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_VMEMMAP=y # CONFIG_MEMORY_HOTPLUG is not set CONFIG_PAGEFLAGS_EXTENDED=y CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_MIGRATION=y CONFIG_RESOURCES_64BIT=y CONFIG_PHYS_ADDR_T_64BIT=y CONFIG_ZONE_DMA_FLAG=1 CONFIG_BOUNCE=y CONFIG_VIRT_TO_BUS=y CONFIG_UNEVICTABLE_LRU=y CONFIG_MMU_NOTIFIER=y CONFIG_X86_CHECK_BIOS_CORRUPTION=y CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y CONFIG_X86_RESERVE_LOW_64K=y CONFIG_MTRR=y # CONFIG_MTRR_SANITIZER is not set # CONFIG_X86_PAT is not set # CONFIG_EFI is not set # CONFIG_SECCOMP is not set # CONFIG_HZ_100 is not set CONFIG_HZ_250=y # CONFIG_HZ_300 is not set # CONFIG_HZ_1000 is not set CONFIG_HZ=250 # CONFIG_SCHED_HRTICK is not set CONFIG_KEXEC=y # CONFIG_CRASH_DUMP is not set CONFIG_PHYSICAL_START=0x200000 # CONFIG_RELOCATABLE is not set CONFIG_PHYSICAL_ALIGN=0x200000 CONFIG_HOTPLUG_CPU=y CONFIG_COMPAT_VDSO=y # CONFIG_CMDLINE_BOOL is not set CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y # # Power management and ACPI options # CONFIG_PM=y # CONFIG_PM_DEBUG is not set CONFIG_PM_SLEEP_SMP=y CONFIG_PM_SLEEP=y CONFIG_SUSPEND=y CONFIG_SUSPEND_FREEZER=y # CONFIG_HIBERNATION is not set CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y CONFIG_ACPI_PROCFS=y CONFIG_ACPI_PROCFS_POWER=y CONFIG_ACPI_SYSFS_POWER=y CONFIG_ACPI_PROC_EVENT=y CONFIG_ACPI_AC=m CONFIG_ACPI_BATTERY=m CONFIG_ACPI_BUTTON=m CONFIG_ACPI_FAN=y CONFIG_ACPI_DOCK=y CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_HOTPLUG_CPU=y CONFIG_ACPI_THERMAL=y CONFIG_ACPI_NUMA=y # CONFIG_ACPI_WMI is not set CONFIG_ACPI_ASUS=m CONFIG_ACPI_TOSHIBA=m # CONFIG_ACPI_CUSTOM_DSDT is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set # CONFIG_ACPI_PCI_SLOT is not set CONFIG_ACPI_SYSTEM=y CONFIG_X86_PM_TIMER=y CONFIG_ACPI_CONTAINER=y CONFIG_ACPI_SBS=m # # CPU Frequency scaling # CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=y CONFIG_CPU_FREQ_DEBUG=y CONFIG_CPU_FREQ_STAT=m CONFIG_CPU_FREQ_STAT_DETAILS=y # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y # CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=m CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=m CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m # # CPUFreq processor drivers # CONFIG_X86_ACPI_CPUFREQ=y CONFIG_X86_POWERNOW_K8=y CONFIG_X86_POWERNOW_K8_ACPI=y # CONFIG_X86_SPEEDSTEP_CENTRINO is not set # CONFIG_X86_P4_CLOCKMOD is not set # # shared options # # CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set # CONFIG_X86_SPEEDSTEP_LIB is not set # CONFIG_CPU_IDLE is not set # # Memory power savings # # CONFIG_I7300_IDLE is not set # # Bus options (PCI etc.) # CONFIG_PCI=y CONFIG_PCI_DIRECT=y CONFIG_PCI_MMCONFIG=y CONFIG_PCI_DOMAINS=y CONFIG_PCIEPORTBUS=y CONFIG_HOTPLUG_PCI_PCIE=m CONFIG_PCIEAER=y # CONFIG_PCIEASPM is not set CONFIG_ARCH_SUPPORTS_MSI=y # CONFIG_PCI_MSI is not set CONFIG_PCI_LEGACY=y # CONFIG_PCI_DEBUG is not set CONFIG_HT_IRQ=y CONFIG_ISA_DMA_API=y CONFIG_K8_NB=y CONFIG_PCCARD=y # CONFIG_PCMCIA_DEBUG is not set CONFIG_PCMCIA=y CONFIG_PCMCIA_LOAD_CIS=y CONFIG_PCMCIA_IOCTL=y CONFIG_CARDBUS=y # # PC-card bridges # CONFIG_YENTA=y CONFIG_YENTA_O2=y CONFIG_YENTA_RICOH=y CONFIG_YENTA_TI=y CONFIG_YENTA_ENE_TUNE=y CONFIG_YENTA_TOSHIBA=y CONFIG_PD6729=m CONFIG_I82092=m CONFIG_PCCARD_NONSTATIC=y CONFIG_HOTPLUG_PCI=y CONFIG_HOTPLUG_PCI_FAKE=m CONFIG_HOTPLUG_PCI_ACPI=m CONFIG_HOTPLUG_PCI_ACPI_IBM=m # CONFIG_HOTPLUG_PCI_CPCI is not set CONFIG_HOTPLUG_PCI_SHPC=m # # Executable file formats / Emulations # CONFIG_BINFMT_ELF=y CONFIG_COMPAT_BINFMT_ELF=y # CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set # CONFIG_HAVE_AOUT is not set CONFIG_BINFMT_MISC=y CONFIG_IA32_EMULATION=y # CONFIG_IA32_AOUT is not set CONFIG_COMPAT=y CONFIG_COMPAT_FOR_U64_ALIGNMENT=y CONFIG_SYSVIPC_COMPAT=y CONFIG_NET=y # # Networking options # CONFIG_PACKET=y CONFIG_PACKET_MMAP=y CONFIG_UNIX=y CONFIG_XFRM=y CONFIG_XFRM_USER=y # CONFIG_XFRM_SUB_POLICY is not set CONFIG_XFRM_MIGRATE=y # CONFIG_XFRM_STATISTICS is not set CONFIG_XFRM_IPCOMP=m CONFIG_NET_KEY=m CONFIG_NET_KEY_MIGRATE=y CONFIG_INET=y CONFIG_IP_MULTICAST=y CONFIG_IP_ADVANCED_ROUTER=y CONFIG_ASK_IP_FIB_HASH=y # CONFIG_IP_FIB_TRIE is not set CONFIG_IP_FIB_HASH=y CONFIG_IP_MULTIPLE_TABLES=y CONFIG_IP_ROUTE_MULTIPATH=y CONFIG_IP_ROUTE_VERBOSE=y # CONFIG_IP_PNP is not set CONFIG_NET_IPIP=m CONFIG_NET_IPGRE=m CONFIG_NET_IPGRE_BROADCAST=y CONFIG_IP_MROUTE=y CONFIG_IP_PIMSM_V1=y CONFIG_IP_PIMSM_V2=y # CONFIG_ARPD is not set CONFIG_SYN_COOKIES=y CONFIG_INET_AH=m CONFIG_INET_ESP=m CONFIG_INET_IPCOMP=m CONFIG_INET_XFRM_TUNNEL=m CONFIG_INET_TUNNEL=m CONFIG_INET_XFRM_MODE_TRANSPORT=m CONFIG_INET_XFRM_MODE_TUNNEL=m CONFIG_INET_XFRM_MODE_BEET=m CONFIG_INET_LRO=m CONFIG_INET_DIAG=m CONFIG_INET_TCP_DIAG=m CONFIG_TCP_CONG_ADVANCED=y CONFIG_TCP_CONG_BIC=y CONFIG_TCP_CONG_CUBIC=m CONFIG_TCP_CONG_WESTWOOD=m CONFIG_TCP_CONG_HTCP=m CONFIG_TCP_CONG_HSTCP=m CONFIG_TCP_CONG_HYBLA=m CONFIG_TCP_CONG_VEGAS=m CONFIG_TCP_CONG_SCALABLE=m CONFIG_TCP_CONG_LP=m CONFIG_TCP_CONG_VENO=m # CONFIG_TCP_CONG_YEAH is not set # CONFIG_TCP_CONG_ILLINOIS is not set CONFIG_DEFAULT_BIC=y # CONFIG_DEFAULT_CUBIC is not set # CONFIG_DEFAULT_HTCP is not set # CONFIG_DEFAULT_VEGAS is not set # CONFIG_DEFAULT_WESTWOOD is not set # CONFIG_DEFAULT_RENO is not set CONFIG_DEFAULT_TCP_CONG="bic" CONFIG_TCP_MD5SIG=y CONFIG_IPV6=m CONFIG_IPV6_PRIVACY=y CONFIG_IPV6_ROUTER_PREF=y CONFIG_IPV6_ROUTE_INFO=y # CONFIG_IPV6_OPTIMISTIC_DAD is not set CONFIG_INET6_AH=m CONFIG_INET6_ESP=m CONFIG_INET6_IPCOMP=m # CONFIG_IPV6_MIP6 is not set CONFIG_INET6_XFRM_TUNNEL=m CONFIG_INET6_TUNNEL=m CONFIG_INET6_XFRM_MODE_TRANSPORT=m CONFIG_INET6_XFRM_MODE_TUNNEL=m CONFIG_INET6_XFRM_MODE_BEET=m CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=m CONFIG_IPV6_SIT=m CONFIG_IPV6_NDISC_NODETYPE=y CONFIG_IPV6_TUNNEL=m # CONFIG_IPV6_MULTIPLE_TABLES is not set # CONFIG_IPV6_MROUTE is not set CONFIG_NETLABEL=y CONFIG_NETWORK_SECMARK=y CONFIG_NETFILTER=y CONFIG_NETFILTER_DEBUG=y CONFIG_NETFILTER_ADVANCED=y CONFIG_BRIDGE_NETFILTER=y # # Core Netfilter Configuration # CONFIG_NETFILTER_NETLINK=m CONFIG_NETFILTER_NETLINK_QUEUE=m CONFIG_NETFILTER_NETLINK_LOG=m CONFIG_NF_CONNTRACK=y CONFIG_NF_CT_ACCT=y CONFIG_NF_CONNTRACK_MARK=y CONFIG_NF_CONNTRACK_SECMARK=y CONFIG_NF_CONNTRACK_EVENTS=y CONFIG_NF_CT_PROTO_DCCP=m CONFIG_NF_CT_PROTO_GRE=m CONFIG_NF_CT_PROTO_SCTP=m # CONFIG_NF_CT_PROTO_UDPLITE is not set CONFIG_NF_CONNTRACK_AMANDA=m CONFIG_NF_CONNTRACK_FTP=m CONFIG_NF_CONNTRACK_H323=m CONFIG_NF_CONNTRACK_IRC=m CONFIG_NF_CONNTRACK_NETBIOS_NS=m CONFIG_NF_CONNTRACK_PPTP=m CONFIG_NF_CONNTRACK_SANE=m CONFIG_NF_CONNTRACK_SIP=m CONFIG_NF_CONNTRACK_TFTP=m # CONFIG_NF_CT_NETLINK is not set # CONFIG_NETFILTER_TPROXY is not set CONFIG_NETFILTER_XTABLES=m CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m CONFIG_NETFILTER_XT_TARGET_CONNMARK=m CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=m CONFIG_NETFILTER_XT_TARGET_DSCP=m CONFIG_NETFILTER_XT_TARGET_MARK=m CONFIG_NETFILTER_XT_TARGET_NFLOG=m CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m CONFIG_NETFILTER_XT_TARGET_NOTRACK=m # CONFIG_NETFILTER_XT_TARGET_RATEEST is not set # CONFIG_NETFILTER_XT_TARGET_TRACE is not set CONFIG_NETFILTER_XT_TARGET_SECMARK=m CONFIG_NETFILTER_XT_TARGET_TCPMSS=m # CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP is not set CONFIG_NETFILTER_XT_MATCH_COMMENT=m CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m # CONFIG_NETFILTER_XT_MATCH_CONNLIMIT is not set CONFIG_NETFILTER_XT_MATCH_CONNMARK=m CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m CONFIG_NETFILTER_XT_MATCH_DCCP=m CONFIG_NETFILTER_XT_MATCH_DSCP=m CONFIG_NETFILTER_XT_MATCH_ESP=m CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m CONFIG_NETFILTER_XT_MATCH_HELPER=m # CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set CONFIG_NETFILTER_XT_MATCH_LENGTH=m CONFIG_NETFILTER_XT_MATCH_LIMIT=m CONFIG_NETFILTER_XT_MATCH_MAC=m CONFIG_NETFILTER_XT_MATCH_MARK=m CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m # CONFIG_NETFILTER_XT_MATCH_OWNER is not set CONFIG_NETFILTER_XT_MATCH_POLICY=m CONFIG_NETFILTER_XT_MATCH_PHYSDEV=m CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m CONFIG_NETFILTER_XT_MATCH_QUOTA=m # CONFIG_NETFILTER_XT_MATCH_RATEEST is not set CONFIG_NETFILTER_XT_MATCH_REALM=m # CONFIG_NETFILTER_XT_MATCH_RECENT is not set CONFIG_NETFILTER_XT_MATCH_SCTP=m CONFIG_NETFILTER_XT_MATCH_STATE=m CONFIG_NETFILTER_XT_MATCH_STATISTIC=m CONFIG_NETFILTER_XT_MATCH_STRING=m CONFIG_NETFILTER_XT_MATCH_TCPMSS=m # CONFIG_NETFILTER_XT_MATCH_TIME is not set # CONFIG_NETFILTER_XT_MATCH_U32 is not set CONFIG_IP_VS=m # CONFIG_IP_VS_IPV6 is not set # CONFIG_IP_VS_DEBUG is not set CONFIG_IP_VS_TAB_BITS=12 # # IPVS transport protocol load balancing support # CONFIG_IP_VS_PROTO_TCP=y CONFIG_IP_VS_PROTO_UDP=y CONFIG_IP_VS_PROTO_AH_ESP=y CONFIG_IP_VS_PROTO_ESP=y CONFIG_IP_VS_PROTO_AH=y # # IPVS scheduler # CONFIG_IP_VS_RR=m CONFIG_IP_VS_WRR=m CONFIG_IP_VS_LC=m CONFIG_IP_VS_WLC=m CONFIG_IP_VS_LBLC=m CONFIG_IP_VS_LBLCR=m CONFIG_IP_VS_DH=m CONFIG_IP_VS_SH=m CONFIG_IP_VS_SED=m CONFIG_IP_VS_NQ=m # # IPVS application helper # CONFIG_IP_VS_FTP=m # # IP: Netfilter Configuration # CONFIG_NF_DEFRAG_IPV4=m CONFIG_NF_CONNTRACK_IPV4=m CONFIG_NF_CONNTRACK_PROC_COMPAT=y CONFIG_IP_NF_QUEUE=m CONFIG_IP_NF_IPTABLES=m CONFIG_IP_NF_MATCH_ADDRTYPE=m CONFIG_IP_NF_MATCH_AH=m CONFIG_IP_NF_MATCH_ECN=m CONFIG_IP_NF_MATCH_TTL=m CONFIG_IP_NF_FILTER=m CONFIG_IP_NF_TARGET_REJECT=m CONFIG_IP_NF_TARGET_LOG=m CONFIG_IP_NF_TARGET_ULOG=m CONFIG_NF_NAT=m CONFIG_NF_NAT_NEEDED=y CONFIG_IP_NF_TARGET_MASQUERADE=m CONFIG_IP_NF_TARGET_NETMAP=m CONFIG_IP_NF_TARGET_REDIRECT=m CONFIG_NF_NAT_SNMP_BASIC=m CONFIG_NF_NAT_PROTO_DCCP=m CONFIG_NF_NAT_PROTO_GRE=m CONFIG_NF_NAT_PROTO_SCTP=m CONFIG_NF_NAT_FTP=m CONFIG_NF_NAT_IRC=m CONFIG_NF_NAT_TFTP=m CONFIG_NF_NAT_AMANDA=m CONFIG_NF_NAT_PPTP=m CONFIG_NF_NAT_H323=m CONFIG_NF_NAT_SIP=m CONFIG_IP_NF_MANGLE=m CONFIG_IP_NF_TARGET_CLUSTERIP=m CONFIG_IP_NF_TARGET_ECN=m CONFIG_IP_NF_TARGET_TTL=m CONFIG_IP_NF_RAW=m # CONFIG_IP_NF_SECURITY is not set CONFIG_IP_NF_ARPTABLES=m CONFIG_IP_NF_ARPFILTER=m CONFIG_IP_NF_ARP_MANGLE=m # # IPv6: Netfilter Configuration # CONFIG_NF_CONNTRACK_IPV6=m CONFIG_IP6_NF_QUEUE=m CONFIG_IP6_NF_IPTABLES=m CONFIG_IP6_NF_MATCH_AH=m CONFIG_IP6_NF_MATCH_EUI64=m CONFIG_IP6_NF_MATCH_FRAG=m CONFIG_IP6_NF_MATCH_OPTS=m CONFIG_IP6_NF_MATCH_HL=m CONFIG_IP6_NF_MATCH_IPV6HEADER=m CONFIG_IP6_NF_MATCH_MH=m CONFIG_IP6_NF_MATCH_RT=m CONFIG_IP6_NF_TARGET_LOG=m CONFIG_IP6_NF_FILTER=m CONFIG_IP6_NF_TARGET_REJECT=m CONFIG_IP6_NF_MANGLE=m CONFIG_IP6_NF_TARGET_HL=m CONFIG_IP6_NF_RAW=m # CONFIG_IP6_NF_SECURITY is not set # # DECnet: Netfilter Configuration # # CONFIG_DECNET_NF_GRABULATOR is not set CONFIG_BRIDGE_NF_EBTABLES=m CONFIG_BRIDGE_EBT_BROUTE=m CONFIG_BRIDGE_EBT_T_FILTER=m CONFIG_BRIDGE_EBT_T_NAT=m CONFIG_BRIDGE_EBT_802_3=m CONFIG_BRIDGE_EBT_AMONG=m CONFIG_BRIDGE_EBT_ARP=m CONFIG_BRIDGE_EBT_IP=m # CONFIG_BRIDGE_EBT_IP6 is not set CONFIG_BRIDGE_EBT_LIMIT=m CONFIG_BRIDGE_EBT_MARK=m CONFIG_BRIDGE_EBT_PKTTYPE=m CONFIG_BRIDGE_EBT_STP=m CONFIG_BRIDGE_EBT_VLAN=m CONFIG_BRIDGE_EBT_ARPREPLY=m CONFIG_BRIDGE_EBT_DNAT=m CONFIG_BRIDGE_EBT_MARK_T=m CONFIG_BRIDGE_EBT_REDIRECT=m CONFIG_BRIDGE_EBT_SNAT=m CONFIG_BRIDGE_EBT_LOG=m CONFIG_BRIDGE_EBT_ULOG=m # CONFIG_BRIDGE_EBT_NFLOG is not set CONFIG_IP_DCCP=m CONFIG_INET_DCCP_DIAG=m CONFIG_IP_DCCP_ACKVEC=y # # DCCP CCIDs Configuration (EXPERIMENTAL) # CONFIG_IP_DCCP_CCID2=m # CONFIG_IP_DCCP_CCID2_DEBUG is not set CONFIG_IP_DCCP_CCID3=m # CONFIG_IP_DCCP_CCID3_DEBUG is not set CONFIG_IP_DCCP_CCID3_RTO=100 CONFIG_IP_DCCP_TFRC_LIB=m # # DCCP Kernel Hacking # # CONFIG_IP_DCCP_DEBUG is not set # CONFIG_NET_DCCPPROBE is not set CONFIG_IP_SCTP=m # CONFIG_SCTP_DBG_MSG is not set # CONFIG_SCTP_DBG_OBJCNT is not set # CONFIG_SCTP_HMAC_NONE is not set # CONFIG_SCTP_HMAC_SHA1 is not set CONFIG_SCTP_HMAC_MD5=y CONFIG_TIPC=m # CONFIG_TIPC_ADVANCED is not set # CONFIG_TIPC_DEBUG is not set CONFIG_ATM=m CONFIG_ATM_CLIP=m # CONFIG_ATM_CLIP_NO_ICMP is not set CONFIG_ATM_LANE=m # CONFIG_ATM_MPOA is not set CONFIG_ATM_BR2684=m # CONFIG_ATM_BR2684_IPFILTER is not set CONFIG_STP=m CONFIG_BRIDGE=m # CONFIG_NET_DSA is not set CONFIG_VLAN_8021Q=m # CONFIG_VLAN_8021Q_GVRP is not set CONFIG_DECNET=m CONFIG_DECNET_ROUTER=y CONFIG_LLC=y # CONFIG_LLC2 is not set CONFIG_IPX=m # CONFIG_IPX_INTERN is not set CONFIG_ATALK=m CONFIG_DEV_APPLETALK=m CONFIG_IPDDP=m CONFIG_IPDDP_ENCAP=y CONFIG_IPDDP_DECAP=y # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_ECONET is not set CONFIG_WAN_ROUTER=m CONFIG_NET_SCHED=y # # Queueing/Scheduling # CONFIG_NET_SCH_CBQ=m CONFIG_NET_SCH_HTB=m CONFIG_NET_SCH_HFSC=m CONFIG_NET_SCH_ATM=m CONFIG_NET_SCH_PRIO=m # CONFIG_NET_SCH_MULTIQ is not set CONFIG_NET_SCH_RED=m CONFIG_NET_SCH_SFQ=m CONFIG_NET_SCH_TEQL=m CONFIG_NET_SCH_TBF=m CONFIG_NET_SCH_GRED=m CONFIG_NET_SCH_DSMARK=m CONFIG_NET_SCH_NETEM=m CONFIG_NET_SCH_INGRESS=m # # Classification # CONFIG_NET_CLS=y CONFIG_NET_CLS_BASIC=m CONFIG_NET_CLS_TCINDEX=m CONFIG_NET_CLS_ROUTE4=m CONFIG_NET_CLS_ROUTE=y CONFIG_NET_CLS_FW=m CONFIG_NET_CLS_U32=m CONFIG_CLS_U32_PERF=y CONFIG_CLS_U32_MARK=y CONFIG_NET_CLS_RSVP=m CONFIG_NET_CLS_RSVP6=m # CONFIG_NET_CLS_FLOW is not set CONFIG_NET_EMATCH=y CONFIG_NET_EMATCH_STACK=32 CONFIG_NET_EMATCH_CMP=m CONFIG_NET_EMATCH_NBYTE=m CONFIG_NET_EMATCH_U32=m CONFIG_NET_EMATCH_META=m CONFIG_NET_EMATCH_TEXT=m CONFIG_NET_CLS_ACT=y CONFIG_NET_ACT_POLICE=m CONFIG_NET_ACT_GACT=m CONFIG_GACT_PROB=y CONFIG_NET_ACT_MIRRED=m CONFIG_NET_ACT_IPT=m # CONFIG_NET_ACT_NAT is not set CONFIG_NET_ACT_PEDIT=m CONFIG_NET_ACT_SIMP=m # CONFIG_NET_ACT_SKBEDIT is not set CONFIG_NET_CLS_IND=y CONFIG_NET_SCH_FIFO=y # # Network testing # CONFIG_NET_PKTGEN=m # CONFIG_NET_TCPPROBE is not set # CONFIG_HAMRADIO is not set # CONFIG_CAN is not set CONFIG_IRDA=m # # IrDA protocols # CONFIG_IRLAN=m CONFIG_IRNET=m CONFIG_IRCOMM=m # CONFIG_IRDA_ULTRA is not set # # IrDA options # CONFIG_IRDA_CACHE_LAST_LSAP=y CONFIG_IRDA_FAST_RR=y # CONFIG_IRDA_DEBUG is not set # # Infrared-port device drivers # # # SIR device drivers # CONFIG_IRTTY_SIR=m # # Dongle support # CONFIG_DONGLE=y CONFIG_ESI_DONGLE=m CONFIG_ACTISYS_DONGLE=m CONFIG_TEKRAM_DONGLE=m CONFIG_TOIM3232_DONGLE=m CONFIG_LITELINK_DONGLE=m CONFIG_MA600_DONGLE=m CONFIG_GIRBIL_DONGLE=m CONFIG_MCP2120_DONGLE=m CONFIG_OLD_BELKIN_DONGLE=m CONFIG_ACT200L_DONGLE=m # CONFIG_KINGSUN_DONGLE is not set # CONFIG_KSDAZZLE_DONGLE is not set # CONFIG_KS959_DONGLE is not set # # FIR device drivers # CONFIG_USB_IRDA=m CONFIG_SIGMATEL_FIR=m CONFIG_NSC_FIR=m CONFIG_WINBOND_FIR=m CONFIG_SMC_IRCC_FIR=m CONFIG_ALI_FIR=m CONFIG_VLSI_FIR=m CONFIG_VIA_FIR=m CONFIG_MCS_FIR=m CONFIG_BT=m CONFIG_BT_L2CAP=m CONFIG_BT_SCO=m CONFIG_BT_RFCOMM=m CONFIG_BT_RFCOMM_TTY=y CONFIG_BT_BNEP=m CONFIG_BT_BNEP_MC_FILTER=y CONFIG_BT_BNEP_PROTO_FILTER=y CONFIG_BT_HIDP=m # # Bluetooth device drivers # CONFIG_BT_HCIUSB=m CONFIG_BT_HCIUSB_SCO=y # CONFIG_BT_HCIBTUSB is not set # CONFIG_BT_HCIBTSDIO is not set CONFIG_BT_HCIUART=m CONFIG_BT_HCIUART_H4=y CONFIG_BT_HCIUART_BCSP=y # CONFIG_BT_HCIUART_LL is not set CONFIG_BT_HCIBCM203X=m CONFIG_BT_HCIBPA10X=m CONFIG_BT_HCIBFUSB=m CONFIG_BT_HCIDTL1=m CONFIG_BT_HCIBT3C=m CONFIG_BT_HCIBLUECARD=m CONFIG_BT_HCIBTUART=m CONFIG_BT_HCIVHCI=m # CONFIG_AF_RXRPC is not set # CONFIG_PHONET is not set CONFIG_FIB_RULES=y CONFIG_WIRELESS=y # CONFIG_CFG80211 is not set CONFIG_WIRELESS_OLD_REGULATORY=y CONFIG_WIRELESS_EXT=y CONFIG_WIRELESS_EXT_SYSFS=y # CONFIG_MAC80211 is not set CONFIG_IEEE80211=m # CONFIG_IEEE80211_DEBUG is not set CONFIG_IEEE80211_CRYPT_WEP=m CONFIG_IEEE80211_CRYPT_CCMP=m CONFIG_IEEE80211_CRYPT_TKIP=m CONFIG_RFKILL=m # CONFIG_RFKILL_INPUT is not set CONFIG_RFKILL_LEDS=y # CONFIG_NET_9P is not set # # Device Drivers # # # Generic Driver Options # CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug" CONFIG_STANDALONE=y CONFIG_PREVENT_FIRMWARE_BUILD=y CONFIG_FW_LOADER=y CONFIG_FIRMWARE_IN_KERNEL=y CONFIG_EXTRA_FIRMWARE="" # CONFIG_DEBUG_DRIVER is not set # CONFIG_DEBUG_DEVRES is not set # CONFIG_SYS_HYPERVISOR is not set CONFIG_CONNECTOR=y CONFIG_PROC_EVENTS=y # CONFIG_MTD is not set CONFIG_PARPORT=m CONFIG_PARPORT_PC=m CONFIG_PARPORT_SERIAL=m # CONFIG_PARPORT_PC_FIFO is not set # CONFIG_PARPORT_PC_SUPERIO is not set CONFIG_PARPORT_PC_PCMCIA=m # CONFIG_PARPORT_GSC is not set # CONFIG_PARPORT_AX88796 is not set CONFIG_PARPORT_1284=y CONFIG_PARPORT_NOT_PC=y CONFIG_PNP=y CONFIG_PNP_DEBUG_MESSAGES=y # # Protocols # CONFIG_PNPACPI=y CONFIG_BLK_DEV=y CONFIG_BLK_DEV_FD=m # CONFIG_PARIDE is not set CONFIG_BLK_CPQ_DA=y CONFIG_BLK_CPQ_CISS_DA=m CONFIG_CISS_SCSI_TAPE=y CONFIG_BLK_DEV_DAC960=m CONFIG_BLK_DEV_UMEM=m # CONFIG_BLK_DEV_COW_COMMON is not set CONFIG_BLK_DEV_LOOP=m CONFIG_BLK_DEV_CRYPTOLOOP=m CONFIG_BLK_DEV_NBD=m CONFIG_BLK_DEV_SX8=m CONFIG_BLK_DEV_UB=m CONFIG_BLK_DEV_RAM=y CONFIG_BLK_DEV_RAM_COUNT=16 CONFIG_BLK_DEV_RAM_SIZE=16384 # CONFIG_BLK_DEV_XIP is not set CONFIG_CDROM_PKTCDVD=m CONFIG_CDROM_PKTCDVD_BUFFERS=8 # CONFIG_CDROM_PKTCDVD_WCACHE is not set CONFIG_ATA_OVER_ETH=m # CONFIG_BLK_DEV_HD is not set CONFIG_MISC_DEVICES=y # CONFIG_IBM_ASM is not set # CONFIG_PHANTOM is not set # CONFIG_EEPROM_93CX6 is not set CONFIG_SGI_IOC4=m CONFIG_TIFM_CORE=m CONFIG_TIFM_7XX1=m # CONFIG_ACER_WMI is not set CONFIG_ASUS_LAPTOP=m # CONFIG_FUJITSU_LAPTOP is not set # CONFIG_ICS932S401 is not set CONFIG_MSI_LAPTOP=m # CONFIG_PANASONIC_LAPTOP is not set # CONFIG_COMPAL_LAPTOP is not set CONFIG_SONY_LAPTOP=m # CONFIG_SONYPI_COMPAT is not set # CONFIG_THINKPAD_ACPI is not set # CONFIG_INTEL_MENLOW is not set # CONFIG_EEEPC_LAPTOP is not set # CONFIG_ENCLOSURE_SERVICES is not set # CONFIG_SGI_XP is not set # CONFIG_HP_ILO is not set # CONFIG_SGI_GRU is not set # CONFIG_C2PORT is not set CONFIG_HAVE_IDE=y # CONFIG_IDE is not set # # SCSI device support # CONFIG_RAID_ATTRS=m CONFIG_SCSI=y CONFIG_SCSI_DMA=y CONFIG_SCSI_TGT=m CONFIG_SCSI_NETLINK=y CONFIG_SCSI_PROC_FS=y # # SCSI support type (disk, tape, CD-ROM) # CONFIG_BLK_DEV_SD=y CONFIG_CHR_DEV_ST=m CONFIG_CHR_DEV_OSST=m CONFIG_BLK_DEV_SR=m CONFIG_BLK_DEV_SR_VENDOR=y CONFIG_CHR_DEV_SG=m CONFIG_CHR_DEV_SCH=m # # Some SCSI devices (e.g. CD jukebox) support multiple LUNs # CONFIG_SCSI_MULTI_LUN=y CONFIG_SCSI_CONSTANTS=y CONFIG_SCSI_LOGGING=y # CONFIG_SCSI_SCAN_ASYNC is not set CONFIG_SCSI_WAIT_SCAN=m # # SCSI Transports # CONFIG_SCSI_SPI_ATTRS=y CONFIG_SCSI_FC_ATTRS=m # CONFIG_SCSI_FC_TGT_ATTRS is not set CONFIG_SCSI_ISCSI_ATTRS=m CONFIG_SCSI_SAS_ATTRS=m # CONFIG_SCSI_SAS_LIBSAS is not set CONFIG_SCSI_SRP_ATTRS=m # CONFIG_SCSI_SRP_TGT_ATTRS is not set CONFIG_SCSI_LOWLEVEL=y CONFIG_ISCSI_TCP=m CONFIG_BLK_DEV_3W_XXXX_RAID=m CONFIG_SCSI_3W_9XXX=m CONFIG_SCSI_ACARD=m CONFIG_SCSI_AACRAID=m CONFIG_SCSI_AIC7XXX=y CONFIG_AIC7XXX_CMDS_PER_DEVICE=4 CONFIG_AIC7XXX_RESET_DELAY_MS=15000 # CONFIG_AIC7XXX_DEBUG_ENABLE is not set CONFIG_AIC7XXX_DEBUG_MASK=0 # CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set CONFIG_SCSI_AIC7XXX_OLD=m CONFIG_SCSI_AIC79XX=m CONFIG_AIC79XX_CMDS_PER_DEVICE=4 CONFIG_AIC79XX_RESET_DELAY_MS=15000 # CONFIG_AIC79XX_DEBUG_ENABLE is not set CONFIG_AIC79XX_DEBUG_MASK=0 # CONFIG_AIC79XX_REG_PRETTY_PRINT is not set # CONFIG_SCSI_AIC94XX is not set # CONFIG_SCSI_DPT_I2O is not set # CONFIG_SCSI_ADVANSYS is not set CONFIG_SCSI_ARCMSR=m # CONFIG_SCSI_ARCMSR_AER is not set CONFIG_MEGARAID_NEWGEN=y CONFIG_MEGARAID_MM=m CONFIG_MEGARAID_MAILBOX=m CONFIG_MEGARAID_LEGACY=m CONFIG_MEGARAID_SAS=m CONFIG_SCSI_HPTIOP=m CONFIG_SCSI_BUSLOGIC=m # CONFIG_SCSI_DMX3191D is not set # CONFIG_SCSI_EATA is not set # CONFIG_SCSI_FUTURE_DOMAIN is not set CONFIG_SCSI_GDTH=m CONFIG_SCSI_IPS=m CONFIG_SCSI_INITIO=m CONFIG_SCSI_INIA100=m CONFIG_SCSI_PPA=m CONFIG_SCSI_IMM=m # CONFIG_SCSI_IZIP_EPP16 is not set # CONFIG_SCSI_IZIP_SLOW_CTR is not set # CONFIG_SCSI_MVSAS is not set CONFIG_SCSI_STEX=m CONFIG_SCSI_SYM53C8XX_2=m CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1 CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16 CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64 CONFIG_SCSI_SYM53C8XX_MMIO=y # CONFIG_SCSI_IPR is not set CONFIG_SCSI_QLOGIC_1280=m CONFIG_SCSI_QLA_FC=m CONFIG_SCSI_QLA_ISCSI=m CONFIG_SCSI_LPFC=m CONFIG_SCSI_DC395x=m CONFIG_SCSI_DC390T=m # CONFIG_SCSI_DEBUG is not set CONFIG_SCSI_SRP=m # CONFIG_SCSI_LOWLEVEL_PCMCIA is not set # CONFIG_SCSI_DH is not set CONFIG_ATA=y # CONFIG_ATA_NONSTANDARD is not set CONFIG_ATA_ACPI=y CONFIG_SATA_PMP=y CONFIG_SATA_AHCI=y CONFIG_SATA_SIL24=m CONFIG_ATA_SFF=y CONFIG_SATA_SVW=m CONFIG_ATA_PIIX=y CONFIG_SATA_MV=m CONFIG_SATA_NV=y CONFIG_PDC_ADMA=m CONFIG_SATA_QSTOR=m CONFIG_SATA_PROMISE=m CONFIG_SATA_SX4=m CONFIG_SATA_SIL=m CONFIG_SATA_SIS=m CONFIG_SATA_ULI=m CONFIG_SATA_VIA=m CONFIG_SATA_VITESSE=m CONFIG_SATA_INIC162X=m # CONFIG_PATA_ACPI is not set CONFIG_PATA_ALI=m CONFIG_PATA_AMD=y CONFIG_PATA_ARTOP=m CONFIG_PATA_ATIIXP=m # CONFIG_PATA_CMD640_PCI is not set CONFIG_PATA_CMD64X=m CONFIG_PATA_CS5520=m CONFIG_PATA_CS5530=m CONFIG_PATA_CYPRESS=m CONFIG_PATA_EFAR=m CONFIG_ATA_GENERIC=m CONFIG_PATA_HPT366=m CONFIG_PATA_HPT37X=m CONFIG_PATA_HPT3X2N=m CONFIG_PATA_HPT3X3=m # CONFIG_PATA_HPT3X3_DMA is not set CONFIG_PATA_IT821X=m CONFIG_PATA_IT8213=m CONFIG_PATA_JMICRON=m CONFIG_PATA_TRIFLEX=m CONFIG_PATA_MARVELL=m CONFIG_PATA_MPIIX=m CONFIG_PATA_OLDPIIX=y CONFIG_PATA_NETCELL=m # CONFIG_PATA_NINJA32 is not set CONFIG_PATA_NS87410=m # CONFIG_PATA_NS87415 is not set CONFIG_PATA_OPTI=m CONFIG_PATA_OPTIDMA=m CONFIG_PATA_PCMCIA=m CONFIG_PATA_PDC_OLD=m CONFIG_PATA_RADISYS=m CONFIG_PATA_RZ1000=m CONFIG_PATA_SC1200=m CONFIG_PATA_SERVERWORKS=m CONFIG_PATA_PDC2027X=m CONFIG_PATA_SIL680=m CONFIG_PATA_SIS=m CONFIG_PATA_VIA=m CONFIG_PATA_WINBOND=m # CONFIG_PATA_SCH is not set CONFIG_MD=y CONFIG_BLK_DEV_MD=y CONFIG_MD_AUTODETECT=y CONFIG_MD_LINEAR=m CONFIG_MD_RAID0=m CONFIG_MD_RAID1=m CONFIG_MD_RAID10=m CONFIG_MD_RAID456=m CONFIG_MD_RAID5_RESHAPE=y CONFIG_MD_MULTIPATH=m CONFIG_MD_FAULTY=m CONFIG_BLK_DEV_DM=m # CONFIG_DM_DEBUG is not set CONFIG_DM_CRYPT=m CONFIG_DM_SNAPSHOT=m CONFIG_DM_MIRROR=m CONFIG_DM_ZERO=m CONFIG_DM_MULTIPATH=m # CONFIG_DM_DELAY is not set # CONFIG_DM_UEVENT is not set CONFIG_FUSION=y CONFIG_FUSION_SPI=m CONFIG_FUSION_FC=m CONFIG_FUSION_SAS=m CONFIG_FUSION_MAX_SGE=40 CONFIG_FUSION_CTL=m CONFIG_FUSION_LAN=m # CONFIG_FUSION_LOGGING is not set # # IEEE 1394 (FireWire) support # # # Enable only one of the two stacks, unless you know what you are doing # # CONFIG_FIREWIRE is not set CONFIG_IEEE1394=m CONFIG_IEEE1394_OHCI1394=m CONFIG_IEEE1394_PCILYNX=m CONFIG_IEEE1394_SBP2=m # CONFIG_IEEE1394_SBP2_PHYS_DMA is not set CONFIG_IEEE1394_ETH1394_ROM_ENTRY=y CONFIG_IEEE1394_ETH1394=m CONFIG_IEEE1394_RAWIO=m CONFIG_IEEE1394_VIDEO1394=m CONFIG_IEEE1394_DV1394=m # CONFIG_IEEE1394_VERBOSEDEBUG is not set CONFIG_I2O=m # CONFIG_I2O_LCT_NOTIFY_ON_CHANGES is not set CONFIG_I2O_EXT_ADAPTEC=y CONFIG_I2O_EXT_ADAPTEC_DMA64=y # CONFIG_I2O_CONFIG is not set CONFIG_I2O_BUS=m CONFIG_I2O_BLOCK=m CONFIG_I2O_SCSI=m CONFIG_I2O_PROC=m # CONFIG_MACINTOSH_DRIVERS is not set CONFIG_NETDEVICES=y CONFIG_IFB=m CONFIG_DUMMY=m CONFIG_BONDING=m # CONFIG_MACVLAN is not set CONFIG_EQUALIZER=m CONFIG_TUN=m # CONFIG_VETH is not set CONFIG_NET_SB1000=m # CONFIG_ARCNET is not set CONFIG_PHYLIB=y # # MII PHY device drivers # CONFIG_MARVELL_PHY=m CONFIG_DAVICOM_PHY=m CONFIG_QSEMI_PHY=m CONFIG_LXT_PHY=m CONFIG_CICADA_PHY=m CONFIG_VITESSE_PHY=m CONFIG_SMSC_PHY=m CONFIG_BROADCOM_PHY=m # CONFIG_ICPLUS_PHY is not set # CONFIG_REALTEK_PHY is not set # CONFIG_FIXED_PHY is not set # CONFIG_MDIO_BITBANG is not set CONFIG_NET_ETHERNET=y CONFIG_MII=y CONFIG_HAPPYMEAL=m CONFIG_SUNGEM=m CONFIG_CASSINI=m CONFIG_NET_VENDOR_3COM=y CONFIG_VORTEX=y CONFIG_TYPHOON=m CONFIG_NET_TULIP=y CONFIG_DE2104X=m CONFIG_TULIP=m # CONFIG_TULIP_MWI is not set CONFIG_TULIP_MMIO=y # CONFIG_TULIP_NAPI is not set CONFIG_DE4X5=m CONFIG_WINBOND_840=m CONFIG_DM9102=m CONFIG_ULI526X=m CONFIG_PCMCIA_XIRCOM=m # CONFIG_HP100 is not set # CONFIG_IBM_NEW_EMAC_ZMII is not set # CONFIG_IBM_NEW_EMAC_RGMII is not set # CONFIG_IBM_NEW_EMAC_TAH is not set # CONFIG_IBM_NEW_EMAC_EMAC4 is not set # CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set # CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set # CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set CONFIG_NET_PCI=y CONFIG_PCNET32=m CONFIG_AMD8111_ETH=m CONFIG_ADAPTEC_STARFIRE=m CONFIG_B44=m CONFIG_B44_PCI_AUTOSELECT=y CONFIG_B44_PCICORE_AUTOSELECT=y CONFIG_B44_PCI=y CONFIG_FORCEDETH=y CONFIG_FORCEDETH_NAPI=y # CONFIG_EEPRO100 is not set CONFIG_E100=y CONFIG_FEALNX=m CONFIG_NATSEMI=m CONFIG_NE2K_PCI=m CONFIG_8139CP=m CONFIG_8139TOO=y # CONFIG_8139TOO_PIO is not set # CONFIG_8139TOO_TUNE_TWISTER is not set CONFIG_8139TOO_8129=y # CONFIG_8139_OLD_RX_RESET is not set # CONFIG_R6040 is not set CONFIG_SIS900=m CONFIG_EPIC100=m CONFIG_SUNDANCE=m # CONFIG_SUNDANCE_MMIO is not set # CONFIG_TLAN is not set CONFIG_VIA_RHINE=m CONFIG_VIA_RHINE_MMIO=y CONFIG_SC92031=m CONFIG_NET_POCKET=y CONFIG_ATP=m CONFIG_DE600=m CONFIG_DE620=m # CONFIG_ATL2 is not set CONFIG_NETDEV_1000=y CONFIG_ACENIC=m # CONFIG_ACENIC_OMIT_TIGON_I is not set CONFIG_DL2K=m CONFIG_E1000=y CONFIG_E1000E=y # CONFIG_IP1000 is not set # CONFIG_IGB is not set CONFIG_NS83820=m CONFIG_HAMACHI=m CONFIG_YELLOWFIN=m CONFIG_R8169=m CONFIG_R8169_VLAN=y # CONFIG_SIS190 is not set CONFIG_SKGE=m # CONFIG_SKGE_DEBUG is not set CONFIG_SKY2=m # CONFIG_SKY2_DEBUG is not set CONFIG_VIA_VELOCITY=m CONFIG_TIGON3=y CONFIG_BNX2=m CONFIG_QLA3XXX=m CONFIG_ATL1=m # CONFIG_ATL1E is not set # CONFIG_JME is not set CONFIG_NETDEV_10000=y CONFIG_CHELSIO_T1=m CONFIG_CHELSIO_T1_1G=y CONFIG_CHELSIO_T3=m # CONFIG_ENIC is not set # CONFIG_IXGBE is not set CONFIG_IXGB=m CONFIG_S2IO=m CONFIG_MYRI10GE=m CONFIG_NETXEN_NIC=m # CONFIG_NIU is not set # CONFIG_MLX4_EN is not set # CONFIG_MLX4_CORE is not set # CONFIG_TEHUTI is not set # CONFIG_BNX2X is not set # CONFIG_QLGE is not set # CONFIG_SFC is not set CONFIG_TR=y CONFIG_IBMOL=m CONFIG_3C359=m # CONFIG_TMS380TR is not set # # Wireless LAN # # CONFIG_WLAN_PRE80211 is not set # CONFIG_WLAN_80211 is not set # CONFIG_IWLWIFI_LEDS is not set # # USB Network Adapters # CONFIG_USB_CATC=m CONFIG_USB_KAWETH=m CONFIG_USB_PEGASUS=m CONFIG_USB_RTL8150=m CONFIG_USB_USBNET=m CONFIG_USB_NET_AX8817X=m CONFIG_USB_NET_CDCETHER=m CONFIG_USB_NET_DM9601=m # CONFIG_USB_NET_SMSC95XX is not set CONFIG_USB_NET_GL620A=m CONFIG_USB_NET_NET1080=m CONFIG_USB_NET_PLUSB=m CONFIG_USB_NET_MCS7830=m CONFIG_USB_NET_RNDIS_HOST=m CONFIG_USB_NET_CDC_SUBSET=m CONFIG_USB_ALI_M5632=y CONFIG_USB_AN2720=y CONFIG_USB_BELKIN=y CONFIG_USB_ARMLINUX=y CONFIG_USB_EPSON2888=y CONFIG_USB_KC2190=y CONFIG_USB_NET_ZAURUS=m # CONFIG_USB_HSO is not set CONFIG_NET_PCMCIA=y CONFIG_PCMCIA_3C589=m CONFIG_PCMCIA_3C574=m CONFIG_PCMCIA_FMVJ18X=m CONFIG_PCMCIA_PCNET=m CONFIG_PCMCIA_NMCLAN=m CONFIG_PCMCIA_SMC91C92=m CONFIG_PCMCIA_XIRC2PS=m CONFIG_PCMCIA_AXNET=m # CONFIG_PCMCIA_IBMTR is not set # CONFIG_WAN is not set CONFIG_ATM_DRIVERS=y # CONFIG_ATM_DUMMY is not set CONFIG_ATM_TCP=m CONFIG_ATM_LANAI=m CONFIG_ATM_ENI=m # CONFIG_ATM_ENI_DEBUG is not set # CONFIG_ATM_ENI_TUNE_BURST is not set CONFIG_ATM_FIRESTREAM=m # CONFIG_ATM_ZATM is not set CONFIG_ATM_IDT77252=m # CONFIG_ATM_IDT77252_DEBUG is not set # CONFIG_ATM_IDT77252_RCV_ALL is not set CONFIG_ATM_IDT77252_USE_SUNI=y CONFIG_ATM_AMBASSADOR=m # CONFIG_ATM_AMBASSADOR_DEBUG is not set CONFIG_ATM_HORIZON=m # CONFIG_ATM_HORIZON_DEBUG is not set # CONFIG_ATM_IA is not set # CONFIG_ATM_FORE200E is not set CONFIG_ATM_HE=m # CONFIG_ATM_HE_USE_SUNI is not set CONFIG_FDDI=y # CONFIG_DEFXX is not set CONFIG_SKFP=m # CONFIG_HIPPI is not set CONFIG_PLIP=m CONFIG_PPP=m CONFIG_PPP_MULTILINK=y CONFIG_PPP_FILTER=y CONFIG_PPP_ASYNC=m CONFIG_PPP_SYNC_TTY=m CONFIG_PPP_DEFLATE=m # CONFIG_PPP_BSDCOMP is not set CONFIG_PPP_MPPE=m CONFIG_PPPOE=m CONFIG_PPPOATM=m # CONFIG_PPPOL2TP is not set CONFIG_SLIP=m CONFIG_SLIP_COMPRESSED=y CONFIG_SLHC=m CONFIG_SLIP_SMART=y # CONFIG_SLIP_MODE_SLIP6 is not set CONFIG_NET_FC=y CONFIG_NETCONSOLE=y # CONFIG_NETCONSOLE_DYNAMIC is not set CONFIG_NETPOLL=y CONFIG_NETPOLL_TRAP=y CONFIG_NET_POLL_CONTROLLER=y # CONFIG_ISDN is not set # CONFIG_PHONE is not set # # Input device support # CONFIG_INPUT=y CONFIG_INPUT_FF_MEMLESS=y CONFIG_INPUT_POLLDEV=y # # Userland interfaces # CONFIG_INPUT_MOUSEDEV=y # CONFIG_INPUT_MOUSEDEV_PSAUX is not set CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024 CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768 CONFIG_INPUT_JOYDEV=m CONFIG_INPUT_EVDEV=y # CONFIG_INPUT_EVBUG is not set # # Input Device Drivers # CONFIG_INPUT_KEYBOARD=y CONFIG_KEYBOARD_ATKBD=y # CONFIG_KEYBOARD_SUNKBD is not set # CONFIG_KEYBOARD_LKKBD is not set # CONFIG_KEYBOARD_XTKBD is not set # CONFIG_KEYBOARD_NEWTON is not set CONFIG_KEYBOARD_STOWAWAY=m CONFIG_INPUT_MOUSE=y CONFIG_MOUSE_PS2=y CONFIG_MOUSE_PS2_ALPS=y CONFIG_MOUSE_PS2_LOGIPS2PP=y CONFIG_MOUSE_PS2_SYNAPTICS=y CONFIG_MOUSE_PS2_LIFEBOOK=y CONFIG_MOUSE_PS2_TRACKPOINT=y # CONFIG_MOUSE_PS2_ELANTECH is not set # CONFIG_MOUSE_PS2_TOUCHKIT is not set CONFIG_MOUSE_SERIAL=m # CONFIG_MOUSE_APPLETOUCH is not set # CONFIG_MOUSE_BCM5974 is not set CONFIG_MOUSE_VSXXXAA=m CONFIG_INPUT_JOYSTICK=y CONFIG_JOYSTICK_ANALOG=m CONFIG_JOYSTICK_A3D=m CONFIG_JOYSTICK_ADI=m CONFIG_JOYSTICK_COBRA=m CONFIG_JOYSTICK_GF2K=m CONFIG_JOYSTICK_GRIP=m CONFIG_JOYSTICK_GRIP_MP=m CONFIG_JOYSTICK_GUILLEMOT=m CONFIG_JOYSTICK_INTERACT=m CONFIG_JOYSTICK_SIDEWINDER=m CONFIG_JOYSTICK_TMDC=m CONFIG_JOYSTICK_IFORCE=m CONFIG_JOYSTICK_IFORCE_USB=y CONFIG_JOYSTICK_IFORCE_232=y CONFIG_JOYSTICK_WARRIOR=m CONFIG_JOYSTICK_MAGELLAN=m CONFIG_JOYSTICK_SPACEORB=m CONFIG_JOYSTICK_SPACEBALL=m CONFIG_JOYSTICK_STINGER=m CONFIG_JOYSTICK_TWIDJOY=m # CONFIG_JOYSTICK_ZHENHUA is not set CONFIG_JOYSTICK_DB9=m CONFIG_JOYSTICK_GAMECON=m CONFIG_JOYSTICK_TURBOGRAFX=m CONFIG_JOYSTICK_JOYDUMP=m # CONFIG_JOYSTICK_XPAD is not set # CONFIG_INPUT_TABLET is not set CONFIG_INPUT_TOUCHSCREEN=y # CONFIG_TOUCHSCREEN_FUJITSU is not set CONFIG_TOUCHSCREEN_GUNZE=m CONFIG_TOUCHSCREEN_ELO=m CONFIG_TOUCHSCREEN_MTOUCH=m # CONFIG_TOUCHSCREEN_INEXIO is not set CONFIG_TOUCHSCREEN_MK712=m CONFIG_TOUCHSCREEN_PENMOUNT=m CONFIG_TOUCHSCREEN_TOUCHRIGHT=m CONFIG_TOUCHSCREEN_TOUCHWIN=m # CONFIG_TOUCHSCREEN_WM97XX is not set # CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set # CONFIG_TOUCHSCREEN_TOUCHIT213 is not set CONFIG_INPUT_MISC=y CONFIG_INPUT_PCSPKR=m # CONFIG_INPUT_APANEL is not set CONFIG_INPUT_ATLAS_BTNS=m # CONFIG_INPUT_ATI_REMOTE is not set # CONFIG_INPUT_ATI_REMOTE2 is not set # CONFIG_INPUT_KEYSPAN_REMOTE is not set # CONFIG_INPUT_POWERMATE is not set # CONFIG_INPUT_YEALINK is not set # CONFIG_INPUT_CM109 is not set CONFIG_INPUT_UINPUT=m # # Hardware I/O ports # CONFIG_SERIO=y CONFIG_SERIO_I8042=y CONFIG_SERIO_SERPORT=y # CONFIG_SERIO_CT82C710 is not set # CONFIG_SERIO_PARKBD is not set # CONFIG_SERIO_PCIPS2 is not set CONFIG_SERIO_LIBPS2=y CONFIG_SERIO_RAW=m CONFIG_GAMEPORT=m CONFIG_GAMEPORT_NS558=m CONFIG_GAMEPORT_L4=m CONFIG_GAMEPORT_EMU10K1=m CONFIG_GAMEPORT_FM801=m # # Character devices # CONFIG_VT=y CONFIG_CONSOLE_TRANSLATIONS=y CONFIG_VT_CONSOLE=y CONFIG_HW_CONSOLE=y CONFIG_VT_HW_CONSOLE_BINDING=y CONFIG_DEVKMEM=y CONFIG_SERIAL_NONSTANDARD=y # CONFIG_COMPUTONE is not set # CONFIG_ROCKETPORT is not set CONFIG_CYCLADES=m # CONFIG_CYZ_INTR is not set # CONFIG_DIGIEPCA is not set # CONFIG_MOXA_INTELLIO is not set # CONFIG_MOXA_SMARTIO is not set # CONFIG_ISI is not set CONFIG_SYNCLINK=m CONFIG_SYNCLINKMP=m CONFIG_SYNCLINK_GT=m CONFIG_N_HDLC=m # CONFIG_RISCOM8 is not set # CONFIG_SPECIALIX is not set # CONFIG_SX is not set # CONFIG_RIO is not set # CONFIG_STALDRV is not set # CONFIG_NOZOMI is not set # # Serial drivers # CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_SERIAL_8250_PCI=y CONFIG_SERIAL_8250_PNP=y CONFIG_SERIAL_8250_CS=m CONFIG_SERIAL_8250_NR_UARTS=32 CONFIG_SERIAL_8250_RUNTIME_UARTS=4 CONFIG_SERIAL_8250_EXTENDED=y CONFIG_SERIAL_8250_MANY_PORTS=y CONFIG_SERIAL_8250_SHARE_IRQ=y CONFIG_SERIAL_8250_DETECT_IRQ=y CONFIG_SERIAL_8250_RSA=y # # Non-8250 serial port support # CONFIG_SERIAL_CORE=y CONFIG_SERIAL_CORE_CONSOLE=y CONFIG_SERIAL_JSM=m CONFIG_UNIX98_PTYS=y # CONFIG_LEGACY_PTYS is not set CONFIG_PRINTER=m CONFIG_LP_CONSOLE=y CONFIG_PPDEV=m CONFIG_IPMI_HANDLER=m # CONFIG_IPMI_PANIC_EVENT is not set CONFIG_IPMI_DEVICE_INTERFACE=m CONFIG_IPMI_SI=m CONFIG_IPMI_WATCHDOG=m CONFIG_IPMI_POWEROFF=m CONFIG_HW_RANDOM=y CONFIG_HW_RANDOM_INTEL=m CONFIG_HW_RANDOM_AMD=m CONFIG_NVRAM=y CONFIG_R3964=m # CONFIG_APPLICOM is not set # # PCMCIA character devices # # CONFIG_SYNCLINK_CS is not set CONFIG_CARDMAN_4000=m CONFIG_CARDMAN_4040=m # CONFIG_IPWIRELESS is not set CONFIG_MWAVE=m CONFIG_PC8736x_GPIO=m CONFIG_NSC_GPIO=m # CONFIG_RAW_DRIVER is not set # CONFIG_HPET is not set CONFIG_HANGCHECK_TIMER=m CONFIG_TCG_TPM=m CONFIG_TCG_TIS=m CONFIG_TCG_NSC=m CONFIG_TCG_ATMEL=m CONFIG_TCG_INFINEON=m # CONFIG_TELCLOCK is not set CONFIG_DEVPORT=y CONFIG_I2C=y CONFIG_I2C_BOARDINFO=y CONFIG_I2C_CHARDEV=m CONFIG_I2C_HELPER_AUTO=y CONFIG_I2C_ALGOBIT=m # # I2C Hardware Bus support # # # PC SMBus host controller drivers # # CONFIG_I2C_ALI1535 is not set # CONFIG_I2C_ALI1563 is not set # CONFIG_I2C_ALI15X3 is not set CONFIG_I2C_AMD756=m # CONFIG_I2C_AMD756_S4882 is not set CONFIG_I2C_AMD8111=m CONFIG_I2C_I801=m # CONFIG_I2C_ISCH is not set CONFIG_I2C_PIIX4=y CONFIG_I2C_NFORCE2=y # CONFIG_I2C_NFORCE2_S4985 is not set # CONFIG_I2C_SIS5595 is not set # CONFIG_I2C_SIS630 is not set CONFIG_I2C_SIS96X=m CONFIG_I2C_VIA=m CONFIG_I2C_VIAPRO=m # # I2C system bus drivers (mostly embedded / system-on-chip) # # CONFIG_I2C_OCORES is not set # CONFIG_I2C_SIMTEC is not set # # External I2C/SMBus adapter drivers # CONFIG_I2C_PARPORT=m CONFIG_I2C_PARPORT_LIGHT=m # CONFIG_I2C_TAOS_EVM is not set # CONFIG_I2C_TINY_USB is not set # # Graphics adapter I2C/DDC channel drivers # CONFIG_I2C_VOODOO3=m # # Other I2C/SMBus bus drivers # # CONFIG_I2C_PCA_PLATFORM is not set CONFIG_I2C_STUB=m # # Miscellaneous I2C Chip support # # CONFIG_DS1682 is not set # CONFIG_AT24 is not set CONFIG_SENSORS_EEPROM=m CONFIG_SENSORS_PCF8574=m # CONFIG_PCF8575 is not set # CONFIG_SENSORS_PCA9539 is not set CONFIG_SENSORS_PCF8591=m CONFIG_SENSORS_MAX6875=m # CONFIG_SENSORS_TSL2550 is not set # CONFIG_I2C_DEBUG_CORE is not set # CONFIG_I2C_DEBUG_ALGO is not set # CONFIG_I2C_DEBUG_BUS is not set # CONFIG_I2C_DEBUG_CHIP is not set # CONFIG_SPI is not set CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y # CONFIG_GPIOLIB is not set CONFIG_W1=m CONFIG_W1_CON=y # # 1-wire Bus Masters # CONFIG_W1_MASTER_MATROX=m CONFIG_W1_MASTER_DS2490=m CONFIG_W1_MASTER_DS2482=m # # 1-wire Slaves # CONFIG_W1_SLAVE_THERM=m CONFIG_W1_SLAVE_SMEM=m CONFIG_W1_SLAVE_DS2433=m CONFIG_W1_SLAVE_DS2433_CRC=y # CONFIG_W1_SLAVE_DS2760 is not set # CONFIG_W1_SLAVE_BQ27000 is not set CONFIG_POWER_SUPPLY=y # CONFIG_POWER_SUPPLY_DEBUG is not set # CONFIG_PDA_POWER is not set # CONFIG_BATTERY_DS2760 is not set # CONFIG_BATTERY_BQ27x00 is not set CONFIG_HWMON=y CONFIG_HWMON_VID=m CONFIG_SENSORS_ABITUGURU=m # CONFIG_SENSORS_ABITUGURU3 is not set # CONFIG_SENSORS_AD7414 is not set # CONFIG_SENSORS_AD7418 is not set CONFIG_SENSORS_ADM1021=m CONFIG_SENSORS_ADM1025=m CONFIG_SENSORS_ADM1026=m CONFIG_SENSORS_ADM1029=m CONFIG_SENSORS_ADM1031=m CONFIG_SENSORS_ADM9240=m # CONFIG_SENSORS_ADT7462 is not set # CONFIG_SENSORS_ADT7470 is not set # CONFIG_SENSORS_ADT7473 is not set CONFIG_SENSORS_K8TEMP=m CONFIG_SENSORS_ASB100=m CONFIG_SENSORS_ATXP1=m CONFIG_SENSORS_DS1621=m # CONFIG_SENSORS_I5K_AMB is not set CONFIG_SENSORS_F71805F=m # CONFIG_SENSORS_F71882FG is not set # CONFIG_SENSORS_F75375S is not set CONFIG_SENSORS_FSCHER=m CONFIG_SENSORS_FSCPOS=m # CONFIG_SENSORS_FSCHMD is not set CONFIG_SENSORS_GL518SM=m CONFIG_SENSORS_GL520SM=m # CONFIG_SENSORS_CORETEMP is not set # CONFIG_SENSORS_IBMAEM is not set # CONFIG_SENSORS_IBMPEX is not set CONFIG_SENSORS_IT87=m CONFIG_SENSORS_LM63=m CONFIG_SENSORS_LM75=m CONFIG_SENSORS_LM77=m CONFIG_SENSORS_LM78=m CONFIG_SENSORS_LM80=m CONFIG_SENSORS_LM83=m CONFIG_SENSORS_LM85=m CONFIG_SENSORS_LM87=m CONFIG_SENSORS_LM90=m CONFIG_SENSORS_LM92=m # CONFIG_SENSORS_LM93 is not set CONFIG_SENSORS_MAX1619=m # CONFIG_SENSORS_MAX6650 is not set CONFIG_SENSORS_PC87360=m CONFIG_SENSORS_PC87427=m CONFIG_SENSORS_SIS5595=m # CONFIG_SENSORS_DME1737 is not set CONFIG_SENSORS_SMSC47M1=m CONFIG_SENSORS_SMSC47M192=m CONFIG_SENSORS_SMSC47B397=m # CONFIG_SENSORS_ADS7828 is not set # CONFIG_SENSORS_THMC50 is not set CONFIG_SENSORS_VIA686A=m CONFIG_SENSORS_VT1211=m CONFIG_SENSORS_VT8231=m CONFIG_SENSORS_W83781D=m CONFIG_SENSORS_W83791D=m CONFIG_SENSORS_W83792D=m CONFIG_SENSORS_W83793=m CONFIG_SENSORS_W83L785TS=m # CONFIG_SENSORS_W83L786NG is not set CONFIG_SENSORS_W83627HF=m CONFIG_SENSORS_W83627EHF=m CONFIG_SENSORS_HDAPS=m # CONFIG_SENSORS_LIS3LV02D is not set # CONFIG_SENSORS_APPLESMC is not set # CONFIG_HWMON_DEBUG_CHIP is not set CONFIG_THERMAL=y # CONFIG_THERMAL_HWMON is not set CONFIG_WATCHDOG=y # CONFIG_WATCHDOG_NOWAYOUT is not set # # Watchdog Device Drivers # CONFIG_SOFT_WATCHDOG=m # CONFIG_ACQUIRE_WDT is not set # CONFIG_ADVANTECH_WDT is not set CONFIG_ALIM1535_WDT=m CONFIG_ALIM7101_WDT=m # CONFIG_SC520_WDT is not set # CONFIG_EUROTECH_WDT is not set # CONFIG_IB700_WDT is not set CONFIG_IBMASR=m # CONFIG_WAFER_WDT is not set CONFIG_I6300ESB_WDT=m CONFIG_ITCO_WDT=m CONFIG_ITCO_VENDOR_SUPPORT=y # CONFIG_IT8712F_WDT is not set # CONFIG_IT87_WDT is not set # CONFIG_HP_WATCHDOG is not set # CONFIG_SC1200_WDT is not set CONFIG_PC87413_WDT=m # CONFIG_60XX_WDT is not set # CONFIG_SBC8360_WDT is not set # CONFIG_CPU5_WDT is not set # CONFIG_SMSC37B787_WDT is not set CONFIG_W83627HF_WDT=m CONFIG_W83697HF_WDT=m # CONFIG_W83697UG_WDT is not set CONFIG_W83877F_WDT=m CONFIG_W83977F_WDT=m CONFIG_MACHZ_WDT=m # CONFIG_SBC_EPX_C3_WATCHDOG is not set # # PCI-based Watchdog Cards # CONFIG_PCIPCWATCHDOG=m CONFIG_WDTPCI=m CONFIG_WDT_501_PCI=y # # USB-based Watchdog Cards # CONFIG_USBPCWATCHDOG=m CONFIG_SSB_POSSIBLE=y # # Sonics Silicon Backplane # CONFIG_SSB=m CONFIG_SSB_SPROM=y CONFIG_SSB_PCIHOST_POSSIBLE=y CONFIG_SSB_PCIHOST=y # CONFIG_SSB_B43_PCI_BRIDGE is not set CONFIG_SSB_PCMCIAHOST_POSSIBLE=y # CONFIG_SSB_PCMCIAHOST is not set # CONFIG_SSB_DEBUG is not set CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y CONFIG_SSB_DRIVER_PCICORE=y # # Multifunction device drivers # # CONFIG_MFD_CORE is not set CONFIG_MFD_SM501=m # CONFIG_HTC_PASIC3 is not set # CONFIG_MFD_TMIO is not set # CONFIG_PMIC_DA903X is not set # CONFIG_MFD_WM8400 is not set # CONFIG_MFD_WM8350_I2C is not set # CONFIG_REGULATOR is not set # # Multimedia devices # # # Multimedia core support # # CONFIG_VIDEO_DEV is not set CONFIG_DVB_CORE=m CONFIG_VIDEO_MEDIA=m # # Multimedia drivers # # CONFIG_MEDIA_ATTACH is not set CONFIG_MEDIA_TUNER=m # CONFIG_MEDIA_TUNER_CUSTOMIZE is not set CONFIG_MEDIA_TUNER_SIMPLE=m CONFIG_MEDIA_TUNER_TDA8290=m CONFIG_MEDIA_TUNER_TDA9887=m CONFIG_MEDIA_TUNER_TEA5761=m CONFIG_MEDIA_TUNER_TEA5767=m CONFIG_MEDIA_TUNER_MT20XX=m CONFIG_MEDIA_TUNER_XC2028=m CONFIG_MEDIA_TUNER_XC5000=m CONFIG_DVB_CAPTURE_DRIVERS=y # # Supported SAA7146 based PCI Adapters # # CONFIG_TTPCI_EEPROM is not set # CONFIG_DVB_BUDGET_CORE is not set # # Supported USB Adapters # # CONFIG_DVB_USB is not set CONFIG_DVB_TTUSB_BUDGET=m CONFIG_DVB_TTUSB_DEC=m # CONFIG_DVB_SIANO_SMS1XXX is not set # # Supported FlexCopII (B2C2) Adapters # CONFIG_DVB_B2C2_FLEXCOP=m CONFIG_DVB_B2C2_FLEXCOP_PCI=m CONFIG_DVB_B2C2_FLEXCOP_USB=m # CONFIG_DVB_B2C2_FLEXCOP_DEBUG is not set # # Supported BT878 Adapters # # # Supported Pluto2 Adapters # CONFIG_DVB_PLUTO2=m # # Supported SDMC DM1105 Adapters # # CONFIG_DVB_DM1105 is not set # # Supported DVB Frontends # # # Customise DVB Frontends # # CONFIG_DVB_FE_CUSTOMISE is not set # # DVB-S (satellite) frontends # CONFIG_DVB_CX24110=m CONFIG_DVB_CX24123=m CONFIG_DVB_MT312=m CONFIG_DVB_S5H1420=m # CONFIG_DVB_STV0288 is not set # CONFIG_DVB_STB6000 is not set CONFIG_DVB_STV0299=m CONFIG_DVB_TDA8083=m CONFIG_DVB_TDA10086=m CONFIG_DVB_VES1X93=m CONFIG_DVB_TUNER_ITD1000=m CONFIG_DVB_TDA826X=m CONFIG_DVB_TUA6100=m # CONFIG_DVB_CX24116 is not set # CONFIG_DVB_SI21XX is not set # # DVB-T (terrestrial) frontends # CONFIG_DVB_SP8870=m CONFIG_DVB_SP887X=m CONFIG_DVB_CX22700=m CONFIG_DVB_CX22702=m # CONFIG_DVB_DRX397XD is not set CONFIG_DVB_L64781=m CONFIG_DVB_TDA1004X=m CONFIG_DVB_NXT6000=m CONFIG_DVB_MT352=m CONFIG_DVB_ZL10353=m CONFIG_DVB_DIB3000MB=m CONFIG_DVB_DIB3000MC=m CONFIG_DVB_DIB7000M=m CONFIG_DVB_DIB7000P=m # CONFIG_DVB_TDA10048 is not set # # DVB-C (cable) frontends # CONFIG_DVB_VES1820=m CONFIG_DVB_TDA10021=m # CONFIG_DVB_TDA10023 is not set CONFIG_DVB_STV0297=m # # ATSC (North American/Korean Terrestrial/Cable DTV) frontends # CONFIG_DVB_NXT200X=m CONFIG_DVB_OR51211=m CONFIG_DVB_OR51132=m CONFIG_DVB_BCM3510=m CONFIG_DVB_LGDT330X=m # CONFIG_DVB_S5H1409 is not set # CONFIG_DVB_AU8522 is not set # CONFIG_DVB_S5H1411 is not set # # Digital terrestrial only tuners/PLL # CONFIG_DVB_PLL=m CONFIG_DVB_TUNER_DIB0070=m # # SEC control devices for DVB-S # CONFIG_DVB_LNBP21=m # CONFIG_DVB_ISL6405 is not set CONFIG_DVB_ISL6421=m # CONFIG_DVB_LGS8GL5 is not set # # Tools to develop new frontends # # CONFIG_DVB_DUMMY_FE is not set # CONFIG_DVB_AF9013 is not set # CONFIG_DAB is not set # # Graphics support # CONFIG_AGP=y CONFIG_AGP_AMD64=y CONFIG_AGP_INTEL=y CONFIG_AGP_SIS=y CONFIG_AGP_VIA=y CONFIG_DRM=m CONFIG_DRM_TDFX=m CONFIG_DRM_R128=m CONFIG_DRM_RADEON=m CONFIG_DRM_I810=m CONFIG_DRM_I830=m CONFIG_DRM_I915=m CONFIG_DRM_MGA=m # CONFIG_DRM_SIS is not set CONFIG_DRM_VIA=m CONFIG_DRM_SAVAGE=m CONFIG_VGASTATE=m # CONFIG_VIDEO_OUTPUT_CONTROL is not set CONFIG_FB=y # CONFIG_FIRMWARE_EDID is not set CONFIG_FB_DDC=m CONFIG_FB_BOOT_VESA_SUPPORT=y CONFIG_FB_CFB_FILLRECT=m CONFIG_FB_CFB_COPYAREA=m CONFIG_FB_CFB_IMAGEBLIT=m # CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set # CONFIG_FB_SYS_FILLRECT is not set # CONFIG_FB_SYS_COPYAREA is not set # CONFIG_FB_SYS_IMAGEBLIT is not set # CONFIG_FB_FOREIGN_ENDIAN is not set # CONFIG_FB_SYS_FOPS is not set CONFIG_FB_SVGALIB=m # CONFIG_FB_MACMODES is not set CONFIG_FB_BACKLIGHT=y CONFIG_FB_MODE_HELPERS=y CONFIG_FB_TILEBLITTING=y # # Frame buffer hardware drivers # # CONFIG_FB_CIRRUS is not set # CONFIG_FB_PM2 is not set # CONFIG_FB_CYBER2000 is not set # CONFIG_FB_ARC is not set # CONFIG_FB_ASILIANT is not set # CONFIG_FB_IMSTT is not set # CONFIG_FB_VGA16 is not set # CONFIG_FB_UVESA is not set # CONFIG_FB_VESA is not set # CONFIG_FB_N411 is not set # CONFIG_FB_HGA is not set # CONFIG_FB_S1D13XXX is not set CONFIG_FB_NVIDIA=m CONFIG_FB_NVIDIA_I2C=y # CONFIG_FB_NVIDIA_DEBUG is not set CONFIG_FB_NVIDIA_BACKLIGHT=y CONFIG_FB_RIVA=m # CONFIG_FB_RIVA_I2C is not set # CONFIG_FB_RIVA_DEBUG is not set CONFIG_FB_RIVA_BACKLIGHT=y # CONFIG_FB_LE80578 is not set CONFIG_FB_INTEL=m # CONFIG_FB_INTEL_DEBUG is not set CONFIG_FB_INTEL_I2C=y CONFIG_FB_MATROX=m CONFIG_FB_MATROX_MILLENIUM=y CONFIG_FB_MATROX_MYSTIQUE=y CONFIG_FB_MATROX_G=y CONFIG_FB_MATROX_I2C=m CONFIG_FB_MATROX_MAVEN=m CONFIG_FB_MATROX_MULTIHEAD=y # CONFIG_FB_RADEON is not set CONFIG_FB_ATY128=m CONFIG_FB_ATY128_BACKLIGHT=y CONFIG_FB_ATY=m CONFIG_FB_ATY_CT=y CONFIG_FB_ATY_GENERIC_LCD=y CONFIG_FB_ATY_GX=y CONFIG_FB_ATY_BACKLIGHT=y CONFIG_FB_S3=m CONFIG_FB_SAVAGE=m CONFIG_FB_SAVAGE_I2C=y CONFIG_FB_SAVAGE_ACCEL=y # CONFIG_FB_SIS is not set # CONFIG_FB_VIA is not set CONFIG_FB_NEOMAGIC=m CONFIG_FB_KYRO=m CONFIG_FB_3DFX=m CONFIG_FB_3DFX_ACCEL=y CONFIG_FB_VOODOO1=m # CONFIG_FB_VT8623 is not set CONFIG_FB_TRIDENT=m CONFIG_FB_TRIDENT_ACCEL=y # CONFIG_FB_ARK is not set # CONFIG_FB_PM3 is not set # CONFIG_FB_CARMINE is not set # CONFIG_FB_GEODE is not set CONFIG_FB_SM501=m # CONFIG_FB_VIRTUAL is not set # CONFIG_FB_METRONOME is not set # CONFIG_FB_MB862XX is not set CONFIG_BACKLIGHT_LCD_SUPPORT=y CONFIG_LCD_CLASS_DEVICE=m # CONFIG_LCD_ILI9320 is not set # CONFIG_LCD_PLATFORM is not set CONFIG_BACKLIGHT_CLASS_DEVICE=y # CONFIG_BACKLIGHT_CORGI is not set CONFIG_BACKLIGHT_PROGEAR=m # CONFIG_BACKLIGHT_MBP_NVIDIA is not set # CONFIG_BACKLIGHT_SAHARA is not set # # Display device support # # CONFIG_DISPLAY_SUPPORT is not set # # Console display driver support # CONFIG_VGA_CONSOLE=y CONFIG_VGACON_SOFT_SCROLLBACK=y CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64 CONFIG_DUMMY_CONSOLE=y # CONFIG_FRAMEBUFFER_CONSOLE is not set CONFIG_FONT_8x16=y CONFIG_LOGO=y # CONFIG_LOGO_LINUX_MONO is not set # CONFIG_LOGO_LINUX_VGA16 is not set CONFIG_LOGO_LINUX_CLUT224=y CONFIG_SOUND=m CONFIG_SOUND_OSS_CORE=y CONFIG_SND=m CONFIG_SND_TIMER=m CONFIG_SND_PCM=m CONFIG_SND_HWDEP=m CONFIG_SND_RAWMIDI=m CONFIG_SND_SEQUENCER=m CONFIG_SND_SEQ_DUMMY=m CONFIG_SND_OSSEMUL=y CONFIG_SND_MIXER_OSS=m CONFIG_SND_PCM_OSS=m CONFIG_SND_PCM_OSS_PLUGINS=y CONFIG_SND_SEQUENCER_OSS=y CONFIG_SND_DYNAMIC_MINORS=y # CONFIG_SND_SUPPORT_OLD_API is not set CONFIG_SND_VERBOSE_PROCFS=y # CONFIG_SND_VERBOSE_PRINTK is not set # CONFIG_SND_DEBUG is not set CONFIG_SND_VMASTER=y CONFIG_SND_MPU401_UART=m CONFIG_SND_OPL3_LIB=m CONFIG_SND_VX_LIB=m CONFIG_SND_AC97_CODEC=m CONFIG_SND_DRIVERS=y CONFIG_SND_DUMMY=m CONFIG_SND_VIRMIDI=m # CONFIG_SND_MTPAV is not set CONFIG_SND_MTS64=m # CONFIG_SND_SERIAL_U16550 is not set CONFIG_SND_MPU401=m CONFIG_SND_PORTMAN2X4=m CONFIG_SND_AC97_POWER_SAVE=y CONFIG_SND_AC97_POWER_SAVE_DEFAULT=0 CONFIG_SND_SB_COMMON=m CONFIG_SND_PCI=y CONFIG_SND_AD1889=m CONFIG_SND_ALS300=m CONFIG_SND_ALS4000=m CONFIG_SND_ALI5451=m CONFIG_SND_ATIIXP=m CONFIG_SND_ATIIXP_MODEM=m CONFIG_SND_AU8810=m CONFIG_SND_AU8820=m CONFIG_SND_AU8830=m # CONFIG_SND_AW2 is not set CONFIG_SND_AZT3328=m CONFIG_SND_BT87X=m # CONFIG_SND_BT87X_OVERCLOCK is not set CONFIG_SND_CA0106=m CONFIG_SND_CMIPCI=m # CONFIG_SND_OXYGEN is not set CONFIG_SND_CS4281=m CONFIG_SND_CS46XX=m CONFIG_SND_CS46XX_NEW_DSP=y # CONFIG_SND_CS5530 is not set CONFIG_SND_DARLA20=m CONFIG_SND_GINA20=m CONFIG_SND_LAYLA20=m CONFIG_SND_DARLA24=m CONFIG_SND_GINA24=m CONFIG_SND_LAYLA24=m CONFIG_SND_MONA=m CONFIG_SND_MIA=m CONFIG_SND_ECHO3G=m CONFIG_SND_INDIGO=m CONFIG_SND_INDIGOIO=m CONFIG_SND_INDIGODJ=m CONFIG_SND_EMU10K1=m CONFIG_SND_EMU10K1X=m CONFIG_SND_ENS1370=m CONFIG_SND_ENS1371=m CONFIG_SND_ES1938=m CONFIG_SND_ES1968=m CONFIG_SND_FM801=m CONFIG_SND_HDA_INTEL=m # CONFIG_SND_HDA_HWDEP is not set # CONFIG_SND_HDA_INPUT_BEEP is not set CONFIG_SND_HDA_CODEC_REALTEK=y CONFIG_SND_HDA_CODEC_ANALOG=y CONFIG_SND_HDA_CODEC_SIGMATEL=y CONFIG_SND_HDA_CODEC_VIA=y CONFIG_SND_HDA_CODEC_ATIHDMI=y CONFIG_SND_HDA_CODEC_NVHDMI=y CONFIG_SND_HDA_CODEC_CONEXANT=y CONFIG_SND_HDA_CODEC_CMEDIA=y CONFIG_SND_HDA_CODEC_SI3054=y CONFIG_SND_HDA_GENERIC=y # CONFIG_SND_HDA_POWER_SAVE is not set CONFIG_SND_HDSP=m CONFIG_SND_HDSPM=m # CONFIG_SND_HIFIER is not set CONFIG_SND_ICE1712=m CONFIG_SND_ICE1724=m CONFIG_SND_INTEL8X0=m CONFIG_SND_INTEL8X0M=m CONFIG_SND_KORG1212=m CONFIG_SND_MAESTRO3=m CONFIG_SND_MIXART=m CONFIG_SND_NM256=m CONFIG_SND_PCXHR=m CONFIG_SND_RIPTIDE=m CONFIG_SND_RME32=m CONFIG_SND_RME96=m CONFIG_SND_RME9652=m CONFIG_SND_SONICVIBES=m CONFIG_SND_TRIDENT=m CONFIG_SND_VIA82XX=m CONFIG_SND_VIA82XX_MODEM=m # CONFIG_SND_VIRTUOSO is not set CONFIG_SND_VX222=m CONFIG_SND_YMFPCI=m CONFIG_SND_USB=y CONFIG_SND_USB_AUDIO=m CONFIG_SND_USB_USX2Y=m # CONFIG_SND_USB_CAIAQ is not set # CONFIG_SND_USB_US122L is not set CONFIG_SND_PCMCIA=y # CONFIG_SND_VXPOCKET is not set # CONFIG_SND_PDAUDIOCF is not set CONFIG_SND_SOC=m # CONFIG_SND_SOC_ALL_CODECS is not set # CONFIG_SOUND_PRIME is not set CONFIG_AC97_BUS=m CONFIG_HID_SUPPORT=y CONFIG_HID=y # CONFIG_HID_DEBUG is not set # CONFIG_HIDRAW is not set # # USB Input Devices # CONFIG_USB_HID=y CONFIG_HID_PID=y CONFIG_USB_HIDDEV=y # # Special HID drivers # CONFIG_HID_COMPAT=y CONFIG_HID_A4TECH=y CONFIG_HID_APPLE=y CONFIG_HID_BELKIN=y CONFIG_HID_BRIGHT=y CONFIG_HID_CHERRY=y CONFIG_HID_CHICONY=y CONFIG_HID_CYPRESS=y CONFIG_HID_DELL=y CONFIG_HID_EZKEY=y CONFIG_HID_GYRATION=y CONFIG_HID_LOGITECH=y CONFIG_LOGITECH_FF=y # CONFIG_LOGIRUMBLEPAD2_FF is not set CONFIG_HID_MICROSOFT=y CONFIG_HID_MONTEREY=y CONFIG_HID_PANTHERLORD=y CONFIG_PANTHERLORD_FF=y CONFIG_HID_PETALYNX=y CONFIG_HID_SAMSUNG=y CONFIG_HID_SONY=y CONFIG_HID_SUNPLUS=y CONFIG_THRUSTMASTER_FF=y CONFIG_ZEROPLUS_FF=y CONFIG_USB_SUPPORT=y CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y CONFIG_USB_ARCH_HAS_EHCI=y CONFIG_USB=y # CONFIG_USB_DEBUG is not set # CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set # # Miscellaneous USB options # CONFIG_USB_DEVICEFS=y CONFIG_USB_DEVICE_CLASS=y # CONFIG_USB_DYNAMIC_MINORS is not set # CONFIG_USB_SUSPEND is not set # CONFIG_USB_OTG is not set CONFIG_USB_MON=y # CONFIG_USB_WUSB is not set # CONFIG_USB_WUSB_CBAF is not set # # USB Host Controller Drivers # # CONFIG_USB_C67X00_HCD is not set CONFIG_USB_EHCI_HCD=y CONFIG_USB_EHCI_ROOT_HUB_TT=y CONFIG_USB_EHCI_TT_NEWSCHED=y CONFIG_USB_ISP116X_HCD=m # CONFIG_USB_ISP1760_HCD is not set CONFIG_USB_OHCI_HCD=y # CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set # CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set CONFIG_USB_OHCI_LITTLE_ENDIAN=y CONFIG_USB_UHCI_HCD=y CONFIG_USB_U132_HCD=m CONFIG_USB_SL811_HCD=m CONFIG_USB_SL811_CS=m # CONFIG_USB_R8A66597_HCD is not set # CONFIG_USB_WHCI_HCD is not set # CONFIG_USB_HWA_HCD is not set # # USB Device Class drivers # CONFIG_USB_ACM=m CONFIG_USB_PRINTER=m # CONFIG_USB_WDM is not set # CONFIG_USB_TMC is not set # # NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may also be needed; # # # see USB_STORAGE Help for more information # CONFIG_USB_STORAGE=m # CONFIG_USB_STORAGE_DEBUG is not set CONFIG_USB_STORAGE_DATAFAB=y CONFIG_USB_STORAGE_FREECOM=y CONFIG_USB_STORAGE_ISD200=y CONFIG_USB_STORAGE_DPCM=y CONFIG_USB_STORAGE_USBAT=y CONFIG_USB_STORAGE_SDDR09=y CONFIG_USB_STORAGE_SDDR55=y CONFIG_USB_STORAGE_JUMPSHOT=y CONFIG_USB_STORAGE_ALAUDA=y # CONFIG_USB_STORAGE_ONETOUCH is not set CONFIG_USB_STORAGE_KARMA=y # CONFIG_USB_STORAGE_CYPRESS_ATACB is not set CONFIG_USB_LIBUSUAL=y # # USB Imaging devices # CONFIG_USB_MDC800=m CONFIG_USB_MICROTEK=m # # USB port drivers # CONFIG_USB_USS720=m CONFIG_USB_SERIAL=m CONFIG_USB_EZUSB=y CONFIG_USB_SERIAL_GENERIC=y CONFIG_USB_SERIAL_AIRCABLE=m CONFIG_USB_SERIAL_ARK3116=m CONFIG_USB_SERIAL_BELKIN=m # CONFIG_USB_SERIAL_CH341 is not set CONFIG_USB_SERIAL_WHITEHEAT=m CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m CONFIG_USB_SERIAL_CP2101=m CONFIG_USB_SERIAL_CYPRESS_M8=m CONFIG_USB_SERIAL_EMPEG=m CONFIG_USB_SERIAL_FTDI_SIO=m CONFIG_USB_SERIAL_FUNSOFT=m CONFIG_USB_SERIAL_VISOR=m CONFIG_USB_SERIAL_IPAQ=m CONFIG_USB_SERIAL_IR=m CONFIG_USB_SERIAL_EDGEPORT=m CONFIG_USB_SERIAL_EDGEPORT_TI=m CONFIG_USB_SERIAL_GARMIN=m CONFIG_USB_SERIAL_IPW=m # CONFIG_USB_SERIAL_IUU is not set CONFIG_USB_SERIAL_KEYSPAN_PDA=m CONFIG_USB_SERIAL_KEYSPAN=m CONFIG_USB_SERIAL_KEYSPAN_MPR=y CONFIG_USB_SERIAL_KEYSPAN_USA28=y CONFIG_USB_SERIAL_KEYSPAN_USA28X=y CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y CONFIG_USB_SERIAL_KEYSPAN_USA19=y CONFIG_USB_SERIAL_KEYSPAN_USA18X=y CONFIG_USB_SERIAL_KEYSPAN_USA19W=y CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y CONFIG_USB_SERIAL_KEYSPAN_USA49W=y CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y CONFIG_USB_SERIAL_KLSI=m CONFIG_USB_SERIAL_KOBIL_SCT=m CONFIG_USB_SERIAL_MCT_U232=m CONFIG_USB_SERIAL_MOS7720=m CONFIG_USB_SERIAL_MOS7840=m # CONFIG_USB_SERIAL_MOTOROLA is not set CONFIG_USB_SERIAL_NAVMAN=m CONFIG_USB_SERIAL_PL2303=m # CONFIG_USB_SERIAL_OTI6858 is not set # CONFIG_USB_SERIAL_SPCP8X5 is not set CONFIG_USB_SERIAL_HP4X=m CONFIG_USB_SERIAL_SAFE=m CONFIG_USB_SERIAL_SAFE_PADDED=y CONFIG_USB_SERIAL_SIERRAWIRELESS=m CONFIG_USB_SERIAL_TI=m CONFIG_USB_SERIAL_CYBERJACK=m CONFIG_USB_SERIAL_XIRCOM=m CONFIG_USB_SERIAL_OPTION=m CONFIG_USB_SERIAL_OMNINET=m CONFIG_USB_SERIAL_DEBUG=m # # USB Miscellaneous drivers # CONFIG_USB_EMI62=m CONFIG_USB_EMI26=m CONFIG_USB_ADUTUX=m # CONFIG_USB_SEVSEG is not set CONFIG_USB_RIO500=m CONFIG_USB_LEGOTOWER=m CONFIG_USB_LCD=m CONFIG_USB_BERRY_CHARGE=m CONFIG_USB_LED=m # CONFIG_USB_CYPRESS_CY7C63 is not set # CONFIG_USB_CYTHERM is not set CONFIG_USB_PHIDGET=m CONFIG_USB_PHIDGETKIT=m CONFIG_USB_PHIDGETMOTORCONTROL=m CONFIG_USB_PHIDGETSERVO=m CONFIG_USB_IDMOUSE=m CONFIG_USB_FTDI_ELAN=m CONFIG_USB_APPLEDISPLAY=m CONFIG_USB_SISUSBVGA=m CONFIG_USB_SISUSBVGA_CON=y CONFIG_USB_LD=m CONFIG_USB_TRANCEVIBRATOR=m CONFIG_USB_IOWARRIOR=m CONFIG_USB_TEST=m # CONFIG_USB_ISIGHTFW is not set # CONFIG_USB_VST is not set CONFIG_USB_ATM=m CONFIG_USB_SPEEDTOUCH=m CONFIG_USB_CXACRU=m CONFIG_USB_UEAGLEATM=m CONFIG_USB_XUSBATM=m # CONFIG_USB_GADGET is not set # CONFIG_UWB is not set CONFIG_MMC=m # CONFIG_MMC_DEBUG is not set # CONFIG_MMC_UNSAFE_RESUME is not set # # MMC/SD/SDIO Card Drivers # CONFIG_MMC_BLOCK=m CONFIG_MMC_BLOCK_BOUNCE=y # CONFIG_SDIO_UART is not set # CONFIG_MMC_TEST is not set # # MMC/SD/SDIO Host Controller Drivers # CONFIG_MMC_SDHCI=m # CONFIG_MMC_SDHCI_PCI is not set CONFIG_MMC_WBSD=m CONFIG_MMC_TIFM_SD=m # CONFIG_MMC_SDRICOH_CS is not set # CONFIG_MEMSTICK is not set CONFIG_NEW_LEDS=y CONFIG_LEDS_CLASS=y # # LED drivers # # CONFIG_LEDS_PCA9532 is not set # CONFIG_LEDS_HP_DISK is not set # CONFIG_LEDS_CLEVO_MAIL is not set # CONFIG_LEDS_PCA955X is not set # # LED Triggers # CONFIG_LEDS_TRIGGERS=y CONFIG_LEDS_TRIGGER_TIMER=m CONFIG_LEDS_TRIGGER_HEARTBEAT=m # CONFIG_LEDS_TRIGGER_BACKLIGHT is not set # CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set # CONFIG_ACCESSIBILITY is not set CONFIG_INFINIBAND=m CONFIG_INFINIBAND_USER_MAD=m CONFIG_INFINIBAND_USER_ACCESS=m CONFIG_INFINIBAND_USER_MEM=y CONFIG_INFINIBAND_ADDR_TRANS=y CONFIG_INFINIBAND_MTHCA=m CONFIG_INFINIBAND_MTHCA_DEBUG=y CONFIG_INFINIBAND_IPATH=m # CONFIG_INFINIBAND_AMSO1100 is not set CONFIG_INFINIBAND_CXGB3=m # CONFIG_INFINIBAND_CXGB3_DEBUG is not set # CONFIG_MLX4_INFINIBAND is not set # CONFIG_INFINIBAND_NES is not set CONFIG_INFINIBAND_IPOIB=m CONFIG_INFINIBAND_IPOIB_CM=y CONFIG_INFINIBAND_IPOIB_DEBUG=y CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y CONFIG_INFINIBAND_SRP=m CONFIG_INFINIBAND_ISER=m CONFIG_EDAC=y # # Reporting subsystems # # CONFIG_EDAC_DEBUG is not set CONFIG_EDAC_MM_EDAC=m CONFIG_EDAC_E752X=m # CONFIG_EDAC_I82975X is not set # CONFIG_EDAC_I3000 is not set # CONFIG_EDAC_X38 is not set # CONFIG_EDAC_I5000 is not set # CONFIG_EDAC_I5100 is not set CONFIG_RTC_LIB=m CONFIG_RTC_CLASS=m # # RTC interfaces # CONFIG_RTC_INTF_SYSFS=y CONFIG_RTC_INTF_PROC=y CONFIG_RTC_INTF_DEV=y # CONFIG_RTC_INTF_DEV_UIE_EMUL is not set # CONFIG_RTC_DRV_TEST is not set # # I2C RTC drivers # CONFIG_RTC_DRV_DS1307=m # CONFIG_RTC_DRV_DS1374 is not set CONFIG_RTC_DRV_DS1672=m # CONFIG_RTC_DRV_MAX6900 is not set CONFIG_RTC_DRV_RS5C372=m CONFIG_RTC_DRV_ISL1208=m CONFIG_RTC_DRV_X1205=m CONFIG_RTC_DRV_PCF8563=m # CONFIG_RTC_DRV_PCF8583 is not set # CONFIG_RTC_DRV_M41T80 is not set # CONFIG_RTC_DRV_S35390A is not set # CONFIG_RTC_DRV_FM3130 is not set # CONFIG_RTC_DRV_RX8581 is not set # # SPI RTC drivers # # # Platform RTC drivers # CONFIG_RTC_DRV_CMOS=m # CONFIG_RTC_DRV_DS1286 is not set # CONFIG_RTC_DRV_DS1511 is not set CONFIG_RTC_DRV_DS1553=m CONFIG_RTC_DRV_DS1742=m # CONFIG_RTC_DRV_STK17TA8 is not set # CONFIG_RTC_DRV_M48T86 is not set # CONFIG_RTC_DRV_M48T35 is not set # CONFIG_RTC_DRV_M48T59 is not set # CONFIG_RTC_DRV_BQ4802 is not set CONFIG_RTC_DRV_V3020=m # # on-CPU RTC drivers # # CONFIG_DMADEVICES is not set # CONFIG_AUXDISPLAY is not set # CONFIG_UIO is not set # CONFIG_STAGING is not set CONFIG_STAGING_EXCLUDE_BUILD=y # # Firmware Drivers # CONFIG_EDD=m # CONFIG_EDD_OFF is not set CONFIG_FIRMWARE_MEMMAP=y CONFIG_DELL_RBU=m CONFIG_DCDBAS=m CONFIG_DMIID=y # CONFIG_ISCSI_IBFT_FIND is not set # # File systems # CONFIG_EXT2_FS=y CONFIG_EXT2_FS_XATTR=y CONFIG_EXT2_FS_POSIX_ACL=y CONFIG_EXT2_FS_SECURITY=y CONFIG_EXT2_FS_XIP=y CONFIG_EXT3_FS=y CONFIG_EXT3_FS_XATTR=y CONFIG_EXT3_FS_POSIX_ACL=y CONFIG_EXT3_FS_SECURITY=y # CONFIG_EXT4_FS is not set CONFIG_FS_XIP=y CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set CONFIG_JBD2=m # CONFIG_JBD2_DEBUG is not set CONFIG_FS_MBCACHE=y CONFIG_REISERFS_FS=m # CONFIG_REISERFS_CHECK is not set CONFIG_REISERFS_PROC_INFO=y CONFIG_REISERFS_FS_XATTR=y CONFIG_REISERFS_FS_POSIX_ACL=y CONFIG_REISERFS_FS_SECURITY=y CONFIG_JFS_FS=m CONFIG_JFS_POSIX_ACL=y CONFIG_JFS_SECURITY=y # CONFIG_JFS_DEBUG is not set # CONFIG_JFS_STATISTICS is not set CONFIG_FS_POSIX_ACL=y CONFIG_FILE_LOCKING=y CONFIG_XFS_FS=m CONFIG_XFS_QUOTA=y CONFIG_XFS_POSIX_ACL=y # CONFIG_XFS_RT is not set # CONFIG_XFS_DEBUG is not set CONFIG_GFS2_FS=m CONFIG_GFS2_FS_LOCKING_DLM=m CONFIG_OCFS2_FS=m CONFIG_OCFS2_FS_O2CB=m CONFIG_OCFS2_FS_USERSPACE_CLUSTER=m CONFIG_OCFS2_FS_STATS=y # CONFIG_OCFS2_DEBUG_MASKLOG is not set # CONFIG_OCFS2_DEBUG_FS is not set # CONFIG_OCFS2_COMPAT_JBD is not set CONFIG_DNOTIFY=y CONFIG_INOTIFY=y CONFIG_INOTIFY_USER=y CONFIG_QUOTA=y # CONFIG_QUOTA_NETLINK_INTERFACE is not set CONFIG_PRINT_QUOTA_WARNING=y # CONFIG_QFMT_V1 is not set CONFIG_QFMT_V2=y CONFIG_QUOTACTL=y CONFIG_AUTOFS_FS=m CONFIG_AUTOFS4_FS=m CONFIG_FUSE_FS=m CONFIG_GENERIC_ACL=y # # CD-ROM/DVD Filesystems # CONFIG_ISO9660_FS=y CONFIG_JOLIET=y CONFIG_ZISOFS=y CONFIG_UDF_FS=m CONFIG_UDF_NLS=y # # DOS/FAT/NT Filesystems # CONFIG_FAT_FS=m CONFIG_MSDOS_FS=m CONFIG_VFAT_FS=m CONFIG_FAT_DEFAULT_CODEPAGE=437 CONFIG_FAT_DEFAULT_IOCHARSET="ascii" # CONFIG_NTFS_FS is not set # # Pseudo filesystems # CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_PROC_SYSCTL=y CONFIG_PROC_PAGE_MONITOR=y CONFIG_SYSFS=y CONFIG_TMPFS=y CONFIG_TMPFS_POSIX_ACL=y CONFIG_HUGETLBFS=y CONFIG_HUGETLB_PAGE=y CONFIG_CONFIGFS_FS=m # # Miscellaneous filesystems # # CONFIG_ADFS_FS is not set CONFIG_AFFS_FS=m # CONFIG_ECRYPT_FS is not set CONFIG_HFS_FS=m CONFIG_HFSPLUS_FS=m CONFIG_BEFS_FS=m # CONFIG_BEFS_DEBUG is not set CONFIG_BFS_FS=m CONFIG_EFS_FS=m CONFIG_CRAMFS=m CONFIG_VXFS_FS=m CONFIG_MINIX_FS=m # CONFIG_OMFS_FS is not set # CONFIG_HPFS_FS is not set CONFIG_QNX4FS_FS=m CONFIG_ROMFS_FS=m CONFIG_SYSV_FS=m CONFIG_UFS_FS=m # CONFIG_UFS_FS_WRITE is not set # CONFIG_UFS_DEBUG is not set CONFIG_NETWORK_FILESYSTEMS=y CONFIG_NFS_FS=m CONFIG_NFS_V3=y CONFIG_NFS_V3_ACL=y CONFIG_NFS_V4=y CONFIG_NFSD=m CONFIG_NFSD_V2_ACL=y CONFIG_NFSD_V3=y CONFIG_NFSD_V3_ACL=y CONFIG_NFSD_V4=y CONFIG_LOCKD=m CONFIG_LOCKD_V4=y CONFIG_EXPORTFS=m CONFIG_NFS_ACL_SUPPORT=m CONFIG_NFS_COMMON=y CONFIG_SUNRPC=m CONFIG_SUNRPC_GSS=m CONFIG_SUNRPC_XPRT_RDMA=m # CONFIG_SUNRPC_REGISTER_V4 is not set CONFIG_RPCSEC_GSS_KRB5=m CONFIG_RPCSEC_GSS_SPKM3=m # CONFIG_SMB_FS is not set CONFIG_CIFS=m # CONFIG_CIFS_STATS is not set CONFIG_CIFS_WEAK_PW_HASH=y # CONFIG_CIFS_UPCALL is not set CONFIG_CIFS_XATTR=y CONFIG_CIFS_POSIX=y # CONFIG_CIFS_DEBUG2 is not set # CONFIG_CIFS_EXPERIMENTAL is not set CONFIG_NCP_FS=m CONFIG_NCPFS_PACKET_SIGNING=y CONFIG_NCPFS_IOCTL_LOCKING=y CONFIG_NCPFS_STRONG=y CONFIG_NCPFS_NFS_NS=y CONFIG_NCPFS_OS2_NS=y CONFIG_NCPFS_SMALLDOS=y CONFIG_NCPFS_NLS=y CONFIG_NCPFS_EXTRAS=y CONFIG_CODA_FS=m # CONFIG_AFS_FS is not set # # Partition Types # CONFIG_PARTITION_ADVANCED=y # CONFIG_ACORN_PARTITION is not set CONFIG_OSF_PARTITION=y CONFIG_AMIGA_PARTITION=y # CONFIG_ATARI_PARTITION is not set CONFIG_MAC_PARTITION=y CONFIG_MSDOS_PARTITION=y CONFIG_BSD_DISKLABEL=y CONFIG_MINIX_SUBPARTITION=y CONFIG_SOLARIS_X86_PARTITION=y CONFIG_UNIXWARE_DISKLABEL=y # CONFIG_LDM_PARTITION is not set CONFIG_SGI_PARTITION=y # CONFIG_ULTRIX_PARTITION is not set CONFIG_SUN_PARTITION=y CONFIG_KARMA_PARTITION=y CONFIG_EFI_PARTITION=y # CONFIG_SYSV68_PARTITION is not set CONFIG_NLS=y CONFIG_NLS_DEFAULT="utf8" CONFIG_NLS_CODEPAGE_437=y CONFIG_NLS_CODEPAGE_737=m CONFIG_NLS_CODEPAGE_775=m CONFIG_NLS_CODEPAGE_850=m CONFIG_NLS_CODEPAGE_852=m CONFIG_NLS_CODEPAGE_855=m CONFIG_NLS_CODEPAGE_857=m CONFIG_NLS_CODEPAGE_860=m CONFIG_NLS_CODEPAGE_861=m CONFIG_NLS_CODEPAGE_862=m CONFIG_NLS_CODEPAGE_863=m CONFIG_NLS_CODEPAGE_864=m CONFIG_NLS_CODEPAGE_865=m CONFIG_NLS_CODEPAGE_866=m CONFIG_NLS_CODEPAGE_869=m CONFIG_NLS_CODEPAGE_936=m CONFIG_NLS_CODEPAGE_950=m CONFIG_NLS_CODEPAGE_932=m CONFIG_NLS_CODEPAGE_949=m CONFIG_NLS_CODEPAGE_874=m CONFIG_NLS_ISO8859_8=m CONFIG_NLS_CODEPAGE_1250=m CONFIG_NLS_CODEPAGE_1251=m CONFIG_NLS_ASCII=y CONFIG_NLS_ISO8859_1=m CONFIG_NLS_ISO8859_2=m CONFIG_NLS_ISO8859_3=m CONFIG_NLS_ISO8859_4=m CONFIG_NLS_ISO8859_5=m CONFIG_NLS_ISO8859_6=m CONFIG_NLS_ISO8859_7=m CONFIG_NLS_ISO8859_9=m CONFIG_NLS_ISO8859_13=m CONFIG_NLS_ISO8859_14=m CONFIG_NLS_ISO8859_15=m CONFIG_NLS_KOI8_R=m CONFIG_NLS_KOI8_U=m CONFIG_NLS_UTF8=m CONFIG_DLM=m CONFIG_DLM_DEBUG=y # # Kernel hacking # CONFIG_TRACE_IRQFLAGS_SUPPORT=y # CONFIG_PRINTK_TIME is not set CONFIG_ENABLE_WARN_DEPRECATED=y # CONFIG_ENABLE_MUST_CHECK is not set CONFIG_FRAME_WARN=2048 CONFIG_MAGIC_SYSRQ=y # CONFIG_UNUSED_SYMBOLS is not set CONFIG_DEBUG_FS=y # CONFIG_HEADERS_CHECK is not set CONFIG_DEBUG_KERNEL=y # CONFIG_DEBUG_SHIRQ is not set # CONFIG_DETECT_SOFTLOCKUP is not set # CONFIG_SCHED_DEBUG is not set # CONFIG_SCHEDSTATS is not set # CONFIG_TIMER_STATS is not set # CONFIG_DEBUG_OBJECTS is not set # CONFIG_DEBUG_SLAB is not set # CONFIG_DEBUG_RT_MUTEXES is not set CONFIG_RT_MUTEX_TESTER=y # CONFIG_DEBUG_SPINLOCK is not set # CONFIG_DEBUG_MUTEXES is not set # CONFIG_DEBUG_LOCK_ALLOC is not set # CONFIG_PROVE_LOCKING is not set # CONFIG_LOCK_STAT is not set # CONFIG_DEBUG_SPINLOCK_SLEEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set # CONFIG_DEBUG_KOBJECT is not set CONFIG_DEBUG_BUGVERBOSE=y # CONFIG_DEBUG_INFO is not set # CONFIG_DEBUG_VM is not set # CONFIG_DEBUG_VIRTUAL is not set # CONFIG_DEBUG_WRITECOUNT is not set CONFIG_DEBUG_MEMORY_INIT=y # CONFIG_DEBUG_LIST is not set # CONFIG_DEBUG_SG is not set CONFIG_FRAME_POINTER=y # CONFIG_BOOT_PRINTK_DELAY is not set # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_RCU_CPU_STALL_DETECTOR is not set # CONFIG_KPROBES_SANITY_TEST is not set # CONFIG_BACKTRACE_SELF_TEST is not set # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set # CONFIG_LKDTM is not set # CONFIG_FAULT_INJECTION is not set # CONFIG_LATENCYTOP is not set CONFIG_SYSCTL_SYSCALL_CHECK=y CONFIG_HAVE_FUNCTION_TRACER=y CONFIG_HAVE_DYNAMIC_FTRACE=y CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y # # Tracers # # CONFIG_FUNCTION_TRACER is not set # CONFIG_IRQSOFF_TRACER is not set # CONFIG_SYSPROF_TRACER is not set # CONFIG_SCHED_TRACER is not set # CONFIG_CONTEXT_SWITCH_TRACER is not set # CONFIG_BOOT_TRACER is not set # CONFIG_STACK_TRACER is not set # CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set # CONFIG_DYNAMIC_PRINTK_DEBUG is not set # CONFIG_SAMPLES is not set CONFIG_HAVE_ARCH_KGDB=y # CONFIG_KGDB is not set # CONFIG_STRICT_DEVMEM is not set CONFIG_X86_VERBOSE_BOOTUP=y CONFIG_EARLY_PRINTK=y # CONFIG_EARLY_PRINTK_DBGP is not set # CONFIG_DEBUG_STACKOVERFLOW is not set # CONFIG_DEBUG_STACK_USAGE is not set # CONFIG_DEBUG_PAGEALLOC is not set # CONFIG_DEBUG_PER_CPU_MAPS is not set # CONFIG_X86_PTDUMP is not set CONFIG_DEBUG_RODATA=y CONFIG_DIRECT_GBPAGES=y # CONFIG_DEBUG_RODATA_TEST is not set # CONFIG_DEBUG_NX_TEST is not set # CONFIG_IOMMU_DEBUG is not set # CONFIG_MMIOTRACE is not set CONFIG_IO_DELAY_TYPE_0X80=0 CONFIG_IO_DELAY_TYPE_0XED=1 CONFIG_IO_DELAY_TYPE_UDELAY=2 CONFIG_IO_DELAY_TYPE_NONE=3 CONFIG_IO_DELAY_0X80=y # CONFIG_IO_DELAY_0XED is not set # CONFIG_IO_DELAY_UDELAY is not set # CONFIG_IO_DELAY_NONE is not set CONFIG_DEFAULT_IO_DELAY_TYPE=0 # CONFIG_DEBUG_BOOT_PARAMS is not set # CONFIG_CPA_DEBUG is not set CONFIG_OPTIMIZE_INLINING=y # # Security options # CONFIG_KEYS=y CONFIG_KEYS_DEBUG_PROC_KEYS=y CONFIG_SECURITY=y CONFIG_SECURITYFS=y CONFIG_SECURITY_NETWORK=y # CONFIG_SECURITY_NETWORK_XFRM is not set # CONFIG_SECURITY_FILE_CAPABILITIES is not set # CONFIG_SECURITY_ROOTPLUG is not set CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=0 CONFIG_SECURITY_SELINUX=y CONFIG_SECURITY_SELINUX_BOOTPARAM=y CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1 CONFIG_SECURITY_SELINUX_DISABLE=y CONFIG_SECURITY_SELINUX_DEVELOP=y CONFIG_SECURITY_SELINUX_AVC_STATS=y CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1 # CONFIG_SECURITY_SELINUX_ENABLE_SECMARK_DEFAULT is not set # CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set # CONFIG_SECURITY_SMACK is not set CONFIG_XOR_BLOCKS=m CONFIG_ASYNC_CORE=m CONFIG_ASYNC_MEMCPY=m CONFIG_ASYNC_XOR=m CONFIG_CRYPTO=y # # Crypto core or helper # # CONFIG_CRYPTO_FIPS is not set CONFIG_CRYPTO_ALGAPI=y CONFIG_CRYPTO_AEAD=y CONFIG_CRYPTO_BLKCIPHER=y CONFIG_CRYPTO_HASH=y CONFIG_CRYPTO_RNG=y CONFIG_CRYPTO_MANAGER=y CONFIG_CRYPTO_GF128MUL=m CONFIG_CRYPTO_NULL=m # CONFIG_CRYPTO_CRYPTD is not set CONFIG_CRYPTO_AUTHENC=m # CONFIG_CRYPTO_TEST is not set # # Authenticated Encryption with Associated Data # # CONFIG_CRYPTO_CCM is not set # CONFIG_CRYPTO_GCM is not set # CONFIG_CRYPTO_SEQIV is not set # # Block modes # CONFIG_CRYPTO_CBC=m # CONFIG_CRYPTO_CTR is not set # CONFIG_CRYPTO_CTS is not set CONFIG_CRYPTO_ECB=m CONFIG_CRYPTO_LRW=m CONFIG_CRYPTO_PCBC=m # CONFIG_CRYPTO_XTS is not set # # Hash modes # CONFIG_CRYPTO_HMAC=y CONFIG_CRYPTO_XCBC=m # # Digest # CONFIG_CRYPTO_CRC32C=y # CONFIG_CRYPTO_CRC32C_INTEL is not set CONFIG_CRYPTO_MD4=m CONFIG_CRYPTO_MD5=y CONFIG_CRYPTO_MICHAEL_MIC=m # CONFIG_CRYPTO_RMD128 is not set # CONFIG_CRYPTO_RMD160 is not set # CONFIG_CRYPTO_RMD256 is not set # CONFIG_CRYPTO_RMD320 is not set CONFIG_CRYPTO_SHA1=y CONFIG_CRYPTO_SHA256=m CONFIG_CRYPTO_SHA512=m CONFIG_CRYPTO_TGR192=m CONFIG_CRYPTO_WP512=m # # Ciphers # CONFIG_CRYPTO_AES=m CONFIG_CRYPTO_AES_X86_64=m CONFIG_CRYPTO_ANUBIS=m CONFIG_CRYPTO_ARC4=m CONFIG_CRYPTO_BLOWFISH=m CONFIG_CRYPTO_CAMELLIA=m CONFIG_CRYPTO_CAST5=m CONFIG_CRYPTO_CAST6=m CONFIG_CRYPTO_DES=m CONFIG_CRYPTO_FCRYPT=m CONFIG_CRYPTO_KHAZAD=m # CONFIG_CRYPTO_SALSA20 is not set # CONFIG_CRYPTO_SALSA20_X86_64 is not set # CONFIG_CRYPTO_SEED is not set CONFIG_CRYPTO_SERPENT=m CONFIG_CRYPTO_TEA=m CONFIG_CRYPTO_TWOFISH=m CONFIG_CRYPTO_TWOFISH_COMMON=m CONFIG_CRYPTO_TWOFISH_X86_64=m # # Compression # CONFIG_CRYPTO_DEFLATE=m # CONFIG_CRYPTO_LZO is not set # # Random Number Generation # # CONFIG_CRYPTO_ANSI_CPRNG is not set CONFIG_CRYPTO_HW=y # CONFIG_CRYPTO_DEV_HIFN_795X is not set CONFIG_HAVE_KVM=y CONFIG_VIRTUALIZATION=y CONFIG_KVM=m CONFIG_KVM_INTEL=m CONFIG_KVM_AMD=m # CONFIG_VIRTIO_PCI is not set # CONFIG_VIRTIO_BALLOON is not set # # Library routines # CONFIG_BITREVERSE=y CONFIG_GENERIC_FIND_FIRST_BIT=y CONFIG_GENERIC_FIND_NEXT_BIT=y CONFIG_CRC_CCITT=m CONFIG_CRC16=m CONFIG_CRC_T10DIF=y CONFIG_CRC_ITU_T=m CONFIG_CRC32=y # CONFIG_CRC7 is not set CONFIG_LIBCRC32C=y CONFIG_ZLIB_INFLATE=y CONFIG_ZLIB_DEFLATE=m CONFIG_GENERIC_ALLOCATOR=y CONFIG_TEXTSEARCH=y CONFIG_TEXTSEARCH_KMP=m CONFIG_TEXTSEARCH_BM=m CONFIG_TEXTSEARCH_FSM=m CONFIG_PLIST=y CONFIG_HAS_IOMEM=y CONFIG_HAS_IOPORT=y CONFIG_HAS_DMA=y ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 11:01 ` Ingo Molnar @ 2008-11-17 11:20 ` Eric Dumazet 2008-11-17 16:11 ` Ingo Molnar 2008-11-17 19:21 ` David Miller 1 sibling, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 11:20 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Linus Torvalds Ingo Molnar a écrit : > * David Miller <davem@davemloft.net> wrote: > >> From: Ingo Molnar <mingo@elte.hu> >> Date: Mon, 17 Nov 2008 10:06:48 +0100 >> >>> * Rafael J. Wysocki <rjw@sisk.pl> wrote: >>> >>>> This message has been generated automatically as a part of a report >>>> of regressions introduced between 2.6.26 and 2.6.27. >>>> >>>> The following bug entry is on the current list of known regressions >>>> introduced between 2.6.26 and 2.6.27. Please verify if it still should >>>> be listed and let me know (either way). >>>> >>>> >>>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308 >>>> Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28 >>>> Submitter : Christoph Lameter <cl@linux-foundation.org> >>>> Date : 2008-08-11 18:36 (98 days old) >>>> References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4 >>>> http://marc.info/?l=linux-kernel&m=122125737421332&w=4 >>> Christoph, as per the recent analysis of Mike: >>> >>> http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html >>> >>> all scheduler components of this regression have been eliminated. >>> >>> In fact his numbers show that scheduler speedups since 2.6.22 have >>> offset and hidden most other sources of tbench regression. (i.e. the >>> scheduler portion got 5% faster, hence it was able to offset a >>> slowdown of 5% in other areas of the kernel that tbench triggers) >> Although I respect the improvements, wake_up() is still several >> orders of magnitude slower than it was in 2.6.22 and wake_up() is at >> the top of the profiles in tbench runs. > > hm, several orders of magnitude slower? That contradicts Mike's > numbers and my own numbers and profiles as well: see below. > > The scheduler's overhead barely even registers on a 16-way x86 system > i'm running tbench on. Here's the NMI profile during 64 threads tbench > on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]: > > Throughput 3437.65 MB/sec 64 procs > ================================== > 21570252 total > ........ > 1494803 copy_user_generic_string > 998232 sock_rfree > 491471 tcp_ack > 482405 ip_dont_fragment > 470685 ip_local_deliver > 436325 constant_test_bit [ called by napi_disable_pending() ] > 375469 avc_has_perm_noaudit > 347663 tcp_sendmsg > 310383 tcp_recvmsg > 300412 __inet_lookup_established > 294377 system_call > 286603 tcp_transmit_skb > 251782 selinux_ip_postroute > 236028 tcp_current_mss > 235631 schedule > 234013 netif_rx > 229854 _local_bh_enable_ip > 219501 tcp_v4_rcv > > [ etc. - see full profile attached further below ] > > Note that the scheduler does not even show up in the profile up to > entry #15! > > I've also summarized NMI profiler output by major subsystems: > > NET overhead (12603450/21570252): 58.43% > security overhead ( 1903598/21570252): 8.83% > usercopy overhead ( 1753617/21570252): 8.13% > sched overhead ( 1599406/21570252): 7.41% > syscall overhead ( 560487/21570252): 2.60% > IRQ overhead ( 555439/21570252): 2.58% > slab overhead ( 492421/21570252): 2.28% > timer overhead ( 226573/21570252): 1.05% > pagealloc overhead ( 192681/21570252): 0.89% > PID overhead ( 115123/21570252): 0.53% > VFS overhead ( 107926/21570252): 0.50% > pagecache overhead ( 62552/21570252): 0.29% > gtod overhead ( 38651/21570252): 0.18% > IDLE overhead ( 0/21570252): 0.00% > --------------------------------------------------------- > left ( 1349494/21570252): 6.26% > > The scheduler's functions are absolutely flat, and consistent with an > extreme context-switching rate of 1.35 million per second. The > scheduler can go up to about 20 million context switches per second on > this system: > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ > r b swpd free buff cache si so bi bo in cs us sy id wa st > 32 0 0 32229696 29308 649880 0 0 0 0 164135 20026853 24 76 0 0 0 > 32 0 0 32229752 29308 649880 0 0 0 0 164203 20032770 24 76 0 0 0 > 32 0 0 32229752 29308 649880 0 0 0 0 164201 20036492 25 75 0 0 0 > > ... and 7% scheduling overhead is roughly consistent with 1.35/20.0. > > Wake up affinities and data flow caching is just fine in this workload > - we've got scheduler statistics for that and they look good too. > > It all looks like pure old-fashioned straight overhead in the > networking layer to me. Do we still touch the same global cacheline > for every localhost packet we process? Anything like that would show > up big time. Yes we do, I find strange we dont see dst_release() in your NMI profile I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387 net: make sure struct dst_entry refcount is aligned on 64 bytes) (in net-next-2.6 tree) to properly align struct dst_entry refcounter and got 4% speedup on tbench on my machine. Small speedups too with commit ef711cf1d156428d4c2911b8c86c6ce90519dc45 (net: speedup dst_release()) Also on net-next-2.6, patches avoid dirtying last_rx on netdevices (loopback for example) , it helps a lot tbench too. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 11:20 ` Eric Dumazet @ 2008-11-17 16:11 ` Ingo Molnar 2008-11-17 16:35 ` Eric Dumazet 2008-11-17 19:31 ` David Miller 0 siblings, 2 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 16:11 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Linus Torvalds * Eric Dumazet <dada1@cosmosbay.com> wrote: >> It all looks like pure old-fashioned straight overhead in the >> networking layer to me. Do we still touch the same global cacheline >> for every localhost packet we process? Anything like that would >> show up big time. > > Yes we do, I find strange we dont see dst_release() in your NMI > profile > > I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387 > net: make sure struct dst_entry refcount is aligned on 64 bytes) (in > net-next-2.6 tree) to properly align struct dst_entry refcounter and > got 4% speedup on tbench on my machine. Ouch, +4% from a oneliner networking change? That's a _huge_ speedup compared to the things we were after in scheduler land. A lot of scheduler folks worked hard to squeeze the last 1-2% out of the scheduler fastpath (which was not trivial at all). The _full_ scheduler accounts for only about 7% of the total system overhead here on a 16-way box... So why should we be handling this anything but a plain networking performance regression/weakness? The localhost scalability bottleneck has been reported a _long_ time ago. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 16:11 ` Ingo Molnar @ 2008-11-17 16:35 ` Eric Dumazet 2008-11-17 17:08 ` Ingo Molnar 2008-11-17 19:31 ` David Miller 1 sibling, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 16:35 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Linus Torvalds, Stephen Hemminger Ingo Molnar a écrit : > * Eric Dumazet <dada1@cosmosbay.com> wrote: > >>> It all looks like pure old-fashioned straight overhead in the >>> networking layer to me. Do we still touch the same global cacheline >>> for every localhost packet we process? Anything like that would >>> show up big time. >> Yes we do, I find strange we dont see dst_release() in your NMI >> profile >> >> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387 >> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in >> net-next-2.6 tree) to properly align struct dst_entry refcounter and >> got 4% speedup on tbench on my machine. > > Ouch, +4% from a oneliner networking change? That's a _huge_ speedup > compared to the things we were after in scheduler land. A lot of > scheduler folks worked hard to squeeze the last 1-2% out of the > scheduler fastpath (which was not trivial at all). The _full_ > scheduler accounts for only about 7% of the total system overhead here > on a 16-way box... 4% on my machine, but apparently my machine is sooooo special (see oprofile thread), so maybe its cpus have a hard time playing with a contended cache line. It definitly needs more testing on other machines. Maybe you'll discover patch is bad on your machines, this is why it's in net-next-2.6 > > So why should we be handling this anything but a plain networking > performance regression/weakness? The localhost scalability bottleneck > has been reported a _long_ time ago. > struct dst_entry problem was already discovered a _long_ time ago and probably solved at this time. (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17 Thu, 13 Mar 2008 05:52:37 +0000 (22:52 -0700) [NET]: Fix tbench regression in 2.6.25-rc1) Then, a gremlin came and broke the thing. They are many contended cache lines in the system, we can do our best to try to make them disappear. Thats not always possible. Another contended cache line is the rwlock in iptables. I remember Stephen had a patch to make the thing use RCU. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 16:35 ` Eric Dumazet @ 2008-11-17 17:08 ` Ingo Molnar 2008-11-17 17:25 ` Ingo Molnar 2008-11-17 19:36 ` David Miller 0 siblings, 2 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 17:08 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Linus Torvalds, Stephen Hemminger * Eric Dumazet <dada1@cosmosbay.com> wrote: > Ingo Molnar a écrit : >> * Eric Dumazet <dada1@cosmosbay.com> wrote: >> >>>> It all looks like pure old-fashioned straight overhead in the >>>> networking layer to me. Do we still touch the same global cacheline >>>> for every localhost packet we process? Anything like that would >>>> show up big time. >>> Yes we do, I find strange we dont see dst_release() in your NMI >>> profile >>> >>> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387 >>> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in >>> net-next-2.6 tree) to properly align struct dst_entry refcounter and >>> got 4% speedup on tbench on my machine. >> >> Ouch, +4% from a oneliner networking change? That's a _huge_ speedup >> compared to the things we were after in scheduler land. A lot of >> scheduler folks worked hard to squeeze the last 1-2% out of the >> scheduler fastpath (which was not trivial at all). The _full_ >> scheduler accounts for only about 7% of the total system overhead here >> on a 16-way box... > > 4% on my machine, but apparently my machine is sooooo special (see > oprofile thread), so maybe its cpus have a hard time playing with a > contended cache line. > > It definitly needs more testing on other machines. > > Maybe you'll discover patch is bad on your machines, this is why > it's in net-next-2.6 ok, i'll try it on my testbox too, to check whether it has any effect - find below the port to -git. tbench _is_ very sensitive to seemingly small details - it seems to be hoovering at around some sort of CPU cache boundary and penalizing random alignment changes, as we drop in and out of the sweet spot. Mike Galbraith has been spending months trying to pin down all the issues. Ingo -------------> >From 8fbd307d402647b07c3c2662fdac589494d16e5e Mon Sep 17 00:00:00 2001 From: Eric Dumazet <dada1@cosmosbay.com> Date: Sun, 16 Nov 2008 19:46:36 -0800 Subject: [PATCH] net: make sure struct dst_entry refcount is aligned on 64 bytes As found in the past (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17 [NET]: Fix tbench regression in 2.6.25-rc1), it is really important that struct dst_entry refcount is aligned on a cache line. We cannot use __atribute((aligned)), so manually pad the structure for 32 and 64 bit arches. for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80 for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0 As it is not possible to guess at compile time cache line size, we use a generic value of 64 bytes, that satisfies many current arches. (Using 128 bytes alignment on 64bit arches would waste 64 bytes) Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont break this alignment. "tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz (2350 MB/s instead of 2250 MB/s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net> --- include/net/dst.h | 21 +++++++++++++++++++++ 1 files changed, 21 insertions(+), 0 deletions(-) diff --git a/include/net/dst.h b/include/net/dst.h index 8a8b71e..1b4de18 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -59,7 +59,11 @@ struct dst_entry struct neighbour *neighbour; struct hh_cache *hh; +#ifdef CONFIG_XFRM struct xfrm_state *xfrm; +#else + void *__pad1; +#endif int (*input)(struct sk_buff*); int (*output)(struct sk_buff*); @@ -70,8 +74,20 @@ struct dst_entry #ifdef CONFIG_NET_CLS_ROUTE __u32 tclassid; +#else + __u32 __pad2; #endif + + /* + * Align __refcnt to a 64 bytes alignment + * (L1_CACHE_SIZE would be too much) + */ +#ifdef CONFIG_64BIT + long __pad_to_align_refcnt[2]; +#else + long __pad_to_align_refcnt[1]; +#endif /* * __refcnt wants to be on a different cache line from * input/output/ops or performance tanks badly @@ -157,6 +173,11 @@ dst_metric_locked(struct dst_entry *dst, int metric) static inline void dst_hold(struct dst_entry * dst) { + /* + * If your kernel compilation stops here, please check + * __pad_to_align_refcnt declaration in struct dst_entry + */ + BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63); atomic_inc(&dst->__refcnt); } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 17:08 ` Ingo Molnar @ 2008-11-17 17:25 ` Ingo Molnar 2008-11-17 17:33 ` Eric Dumazet 2008-11-17 19:36 ` David Miller 1 sibling, 1 reply; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 17:25 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Linus Torvalds, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > > 4% on my machine, but apparently my machine is sooooo special (see > > oprofile thread), so maybe its cpus have a hard time playing with > > a contended cache line. > > > > It definitly needs more testing on other machines. > > > > Maybe you'll discover patch is bad on your machines, this is why > > it's in net-next-2.6 > > ok, i'll try it on my testbox too, to check whether it has any effect > - find below the port to -git. it gives a small speedup of ~1% on my box: before: Throughput 3437.65 MB/sec 64 procs after: Throughput 3473.99 MB/sec 64 procs ... although that's still a bit close to the natural tbench noise range so it's not conclusive and not like a smoking gun IMO. But i think this change might just be papering over the real scalability problem that this workload has in my opinion: that there's a single localhost route/dst/device that millions of packets are squeezed through every second: phoenix:~> ifconfig lo lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:258001524 errors:0 dropped:0 overruns:0 frame:0 TX packets:258001524 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:679809512144 (633.1 GiB) TX bytes:679809512144 (633.1 GiB) There does not seem to be any per CPU ness in localhost networking - it has a globally single-threaded rx/tx queue AFAICS even if both the client and server task is on the same CPU - how is that supposed to perform well? (but i might be missing something) What kind of test-system do you have - one with P4 style Xeon CPUs perhaps where dirty-cacheline cachemisses to DRAM were particularly expensive? Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 17:25 ` Ingo Molnar @ 2008-11-17 17:33 ` Eric Dumazet 2008-11-17 17:38 ` Linus Torvalds 0 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 17:33 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Linus Torvalds, Stephen Hemminger Ingo Molnar a écrit : > * Ingo Molnar <mingo@elte.hu> wrote: > >>> 4% on my machine, but apparently my machine is sooooo special (see >>> oprofile thread), so maybe its cpus have a hard time playing with >>> a contended cache line. >>> >>> It definitly needs more testing on other machines. >>> >>> Maybe you'll discover patch is bad on your machines, this is why >>> it's in net-next-2.6 >> ok, i'll try it on my testbox too, to check whether it has any effect >> - find below the port to -git. > > it gives a small speedup of ~1% on my box: > > before: Throughput 3437.65 MB/sec 64 procs > after: Throughput 3473.99 MB/sec 64 procs Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8" > > ... although that's still a bit close to the natural tbench noise > range so it's not conclusive and not like a smoking gun IMO. > > But i think this change might just be papering over the real > scalability problem that this workload has in my opinion: that there's > a single localhost route/dst/device that millions of packets are > squeezed through every second: Yes, this point was mentioned on netdev a while back. > > phoenix:~> ifconfig lo > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:258001524 errors:0 dropped:0 overruns:0 frame:0 > TX packets:258001524 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:679809512144 (633.1 GiB) TX bytes:679809512144 (633.1 GiB) > > There does not seem to be any per CPU ness in localhost networking - > it has a globally single-threaded rx/tx queue AFAICS even if both the > client and server task is on the same CPU - how is that supposed to > perform well? (but i might be missing something) Stephen had a patch for this one too, but we got tbench noise too with this patch http://kerneltrap.org/mailarchive/linux-netdev/2008/11/5/3926034 > > What kind of test-system do you have - one with P4 style Xeon CPUs > perhaps where dirty-cacheline cachemisses to DRAM were particularly > expensive? Its a HP BL460c g1 Dual quad-core cpus Intel E5450 @3.00GHz So 8 logical cpus. My bench was "tbench 8" ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 17:33 ` Eric Dumazet @ 2008-11-17 17:38 ` Linus Torvalds 2008-11-17 17:42 ` Eric Dumazet 2008-11-17 18:23 ` Ingo Molnar 0 siblings, 2 replies; 191+ messages in thread From: Linus Torvalds @ 2008-11-17 17:38 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger On Mon, 17 Nov 2008, Eric Dumazet wrote: > Ingo Molnar a écrit : > > it gives a small speedup of ~1% on my box: > > > > before: Throughput 3437.65 MB/sec 64 procs > > after: Throughput 3473.99 MB/sec 64 procs > > Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8" I think Ingo may have a Nehalem. Let's just say that those things rock, and have rather good memory throughput. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 17:38 ` Linus Torvalds @ 2008-11-17 17:42 ` Eric Dumazet 2008-11-17 18:23 ` Ingo Molnar 1 sibling, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 17:42 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger Linus Torvalds a écrit : > > On Mon, 17 Nov 2008, Eric Dumazet wrote: > >> Ingo Molnar a écrit : > >>> it gives a small speedup of ~1% on my box: >>> >>> before: Throughput 3437.65 MB/sec 64 procs >>> after: Throughput 3473.99 MB/sec 64 procs >> Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8" > > I think Ingo may have a Nehalem. Let's just say that those things rock, > and have rather good memory throughput. > I want one :) Or even two of them :) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 17:38 ` Linus Torvalds 2008-11-17 17:42 ` Eric Dumazet @ 2008-11-17 18:23 ` Ingo Molnar 2008-11-17 18:33 ` Linus Torvalds 2008-11-17 18:49 ` Ingo Molnar 1 sibling, 2 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 18:23 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 17 Nov 2008, Eric Dumazet wrote: > > > Ingo Molnar a écrit : > > > > it gives a small speedup of ~1% on my box: > > > > > > before: Throughput 3437.65 MB/sec 64 procs > > > after: Throughput 3473.99 MB/sec 64 procs > > > > Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8" > > I think Ingo may have a Nehalem. Let's just say that those things > rock, and have rather good memory throughput. hm, i'm not sure whether i can post benchmarks from the Nehalem box - but i can confirm it in general terms that it's rather nice ;-) This was run on another testbox (4x4 Barcelona) that rocks similarly well in terms of memory subsystem latencies: which seems to be tbench's main current critical path. For the tbench bragging rights i'd probably turn off CONFIG_SECURITY and a few other options. Plus i'd run with 16 threads only - in this test i ran with 4x overload (64 tbench threads, not 16) to stress the scheduler harder. Although we degrade very gently with overload so the numbers arent all that much different: 16 threads: Throughput 3463.14 MB/sec 16 procs 64 threads: Throughput 3473.99 MB/sec 64 procs 256 threads: Throughput 3457.67 MB/sec 256 procs 1024 threads: Throughput 3448.85 MB/sec 1024 procs [ so it's the same within noise range. ] 1024 threads is already a massive 64x overload so beyond any reasonable limit of workload sanity. Which suggests that the main limitation factor is cacheline ping-pong that is already in full effect at 16 threads. Which is supported by the "most expensive instructions" top-10 sorted list: RIP #hits .......................... [ usercopy ] ffffffff80350fcd: 1373300 f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi) ffffffff804a2f33: <sock_rfree>: ffffffff804a2f34: 985253 48 89 e5 mov %rsp,%rbp ffffffff804d2eb7: <ip_local_deliver>: ffffffff804d2eb8: 432659 48 89 e5 mov %rsp,%rbp ffffffff804aa23c: <constant_test_bit>: [ => napi_disable_pending() ] ffffffff804aa24c: 374052 89 d1 mov %edx,%ecx ffffffff804d5076: <ip_dont_fragment>: ffffffff804d5076: 310051 8a 97 56 02 00 00 mov 0x256(%rdi),%dl ffffffff804d9b17: <__inet_lookup_established>: ffffffff804d9bdf: 247224 eb ba jmp ffffffff804d9b9b <__inet_lookup_established+0x84> ffffffff80321529: <selinux_ip_postroute>: ffffffff8032152a: 183700 48 89 e5 mov %rsp,%rbp ffffffff8020c020: <system_call>: ffffffff8020c020: 183600 0f 01 f8 swapgs ffffffff8051884a: <netlbl_enabled>: ffffffff8051884a: 179538 55 push %rbp The usual profiling caveat applies: it's not _these_ instructions that matter, but the surrounding code that calls them. Profiling overhead is delayed by a couple of instructions - the more out-of-order a CPU is, the larger this delay can be. But even a quick look to the list above shows that all of the heavy cachemisses are generated by networking. Beyond the usual suspects of syscall entry and memcpy, it's only networking. We dont even have the mov %cr3 TLB flush overhead in this list, load_cr3() is a distant #30: ffffffff8023049f: 0 0f 22 d8 mov %rax,%cr3 ffffffff802304a2: 126303 c9 leaveq The place for the sock_rfree() hit looks a bit weird, and i'll investigate it now a bit more to place the real overhead point properly. (i already mapped the test-bit overhead: that comes from napi_disable_pending()) The first entry is 10x the cost of the last entry in the list so clearly we've got 1-2 brutal cacheline ping-pongs that dominate the overhead of this workload. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:23 ` Ingo Molnar @ 2008-11-17 18:33 ` Linus Torvalds 2008-11-17 18:49 ` Ingo Molnar 1 sibling, 0 replies; 191+ messages in thread From: Linus Torvalds @ 2008-11-17 18:33 UTC (permalink / raw) To: Ingo Molnar Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger On Mon, 17 Nov 2008, Ingo Molnar wrote: > > hm, i'm not sure whether i can post benchmarks from the Nehalem box - > but i can confirm it in general terms that it's rather nice ;-) Intel released the NDA from various web sites a week or two ago, and Intel is now selling it in the US (I think today was in fact the official launch), so I think benchmarks are safe - you can buy the dang things on the street. I don't know what availability is, of course. But I doubt that Intel would mind Nehalem benchmarks even if it were a paper launch - at least from my personal experience, I've not seen any bad behavior (and plenty of good). > This was run on another testbox (4x4 Barcelona) that rocks similarly > well in terms of memory subsystem latencies: which seems to be > tbench's main current critical path. Ahh, ok. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:23 ` Ingo Molnar 2008-11-17 18:33 ` Linus Torvalds @ 2008-11-17 18:49 ` Ingo Molnar 2008-11-17 19:30 ` Eric Dumazet ` (14 more replies) 1 sibling, 15 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 18:49 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: 4> The place for the sock_rfree() hit looks a bit weird, and i'll > investigate it now a bit more to place the real overhead point > properly. (i already mapped the test-bit overhead: that comes from > napi_disable_pending()) ok, here's a new set of profiles. (again for tbench 64-thread on a 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i posted before.) Here are the per major subsystem percentages: NET overhead ( 5786945/10096751): 57.31% security overhead ( 925933/10096751): 9.17% usercopy overhead ( 837887/10096751): 8.30% sched overhead ( 753662/10096751): 7.46% syscall overhead ( 268809/10096751): 2.66% IRQ overhead ( 266500/10096751): 2.64% slab overhead ( 180258/10096751): 1.79% timer overhead ( 92986/10096751): 0.92% pagealloc overhead ( 87381/10096751): 0.87% VFS overhead ( 53295/10096751): 0.53% PID overhead ( 44469/10096751): 0.44% pagecache overhead ( 33452/10096751): 0.33% gtod overhead ( 11064/10096751): 0.11% IDLE overhead ( 0/10096751): 0.00% --------------------------------------------------------- left ( 753878/10096751): 7.47% The breakdown is very similar to what i sent before, within noise. [ 'left' is random overhead from all around the place - i categorized the 500 most expensive functions in the profile per subsystem. I stopped short of doing it for all 1300+ functions: it's rather laborous manual work even with hefty use of regex patterns. It's also less meaningful in practice: the trend in the first 500 functions is present in the remaining 800 functions as well. I watched the breakdown evolve as i increased the coverage - in practice it is the first 100 functions that matter - it just doesnt change after that. ] The readprofile output below seems structured in a more useful way now - i tweaked compiler options to have the profiler hits spread out in a more meaningful way. I collected 10 million NMI profiler hits, and normalized the readprofile output up to 100%. [ I'll post per function analysis as i complete them, as a reply to this mail. ] Ingo 100.000000 total ................ 7.253355 copy_user_generic_string 3.934833 avc_has_perm_noaudit 3.356152 ip_queue_xmit 3.038025 skb_release_data 2.118525 skb_release_head_state 1.997533 tcp_ack 1.833688 tcp_recvmsg 1.717771 eth_type_trans 1.673249 __inet_lookup_established 1.508888 system_call 1.469183 tcp_current_mss 1.431553 tcp_transmit_skb 1.385125 tcp_sendmsg 1.327643 tcp_v4_rcv 1.292328 nf_hook_thresh 1.203205 schedule 1.059501 nf_hook_slow 1.027373 constant_test_bit 0.945183 sock_rfree 0.922748 __switch_to 0.911605 netif_rx 0.876270 register_gifconf 0.788200 ip_local_deliver_finish 0.781467 dev_queue_xmit 0.766530 constant_test_bit 0.758208 _local_bh_enable_ip 0.747184 load_cr3 0.704341 memset_c 0.671260 sysret_check 0.651845 ip_finish_output2 0.620204 audit_free_names 0.617781 audit_syscall_exit 0.615149 skb_copy_datagram_iovec 0.613848 selinux_socket_sock_rcv_skb 0.606995 constant_test_bit 0.593936 __tcp_push_pending_frames 0.592198 tcp_cleanup_rbuf 0.574093 ip_rcv 0.567886 netif_receive_skb 0.563377 get_page_from_freelist 0.557657 tcp_event_data_recv 0.539274 ip_local_deliver 0.534130 sys_recvfrom 0.512321 __tcp_select_window 0.498427 tcp_rcv_established 0.494862 sys_sendto 0.487473 audit_syscall_entry 0.478495 sched_clock_cpu 0.474861 kfree 0.466310 tcp_established_options 0.461384 net_rx_action 0.447162 __mod_timer 0.442078 ip_rcv_finish 0.441631 find_pid_ns 0.441124 sk_wait_data 0.423943 __sock_recvmsg 0.422126 selinux_parse_skb 0.417975 __napi_schedule 0.414082 __do_softirq 0.403604 task_rq_lock 0.380792 nf_iterate 0.377614 select_task_rq_fair 0.374973 sock_sendmsg 0.374635 kmem_cache_alloc_node 0.368775 avc_has_perm 0.368706 local_bh_disable 0.361834 release_sock 0.346400 sock_common_recvmsg 0.342825 skb_clone 0.338704 __alloc_skb 0.326488 do_softirq 0.323410 lock_sock_nested 0.322129 __copy_skb_header 0.316835 put_page 0.310966 selinux_ip_postroute 0.306229 sel_netport_sid 0.299863 try_to_wake_up 0.296288 process_backlog 0.294818 __inet_lookup 0.294778 thread_return 0.293219 cfs_rq_of 0.292315 internal_add_timer 0.292305 tcp_rcv_space_adjust 0.281053 constant_test_bit 0.278779 local_bh_enable 0.272910 *unknown* 0.269593 schedule_timeout 0.261846 tcp_v4_md5_lookup 0.260992 __ip_local_out 0.255868 __enqueue_entity 0.253931 avc_audit 0.252004 finish_task_switch 0.249263 audit_get_context 0.248290 sockfd_lookup_light 0.247416 virt_to_head_page 0.244149 tcp_options_write 0.243603 memcpy_toiovec 0.243434 sock_recvmsg 0.242599 call_softirq 0.242391 __unlazy_fpu 0.236412 fput_light 0.235628 ret_from_sys_call 0.234933 sk_reset_timer 0.228358 math_state_restore 0.227117 socket_has_perm 0.223492 virt_to_cache 0.219063 __cache_free 0.216401 update_curr 0.216232 tcp_v4_send_check 0.213978 audit_free_aux 0.213223 tcp_v4_do_rcv 0.212975 __kfree_skb 0.211137 dev_hard_start_xmit 0.209052 tcp_rtt_estimator 0.207999 netif_needs_gso 0.207662 __update_sched_clock 0.207284 rb_erase 0.204861 enqueue_task_fair 0.203490 skb_release_all 0.203252 tcp_send_delayed_ack 0.203232 inet_ehashfn 0.199846 sel_netport_find 0.195396 system_call_after_swapgs 0.186756 lock_timer_base 0.186687 pick_next_task_fair 0.183986 mod_timer 0.182982 loopback_xmit 0.182605 native_read_tsc 0.181195 skb_set_owner_r 0.179248 switch_mm 0.175584 set_next_entity 0.173329 raw_local_deliver 0.171641 sys_kill 0.164510 dequeue_task_fair 0.161938 clear_bit 0.160528 sock_def_readable 0.157628 __tcp_ack_snd_check 0.156893 skb_can_coalesce 0.156556 tcp_snd_wnd_test 0.155662 ip_output 0.150627 sk_stream_alloc_skb 0.150219 cpu_sdc 0.149425 sysret_careful 0.148760 tcp_data_snd_check 0.147816 auditsys 0.147419 pskb_may_pull 0.147151 fget_light 0.143774 tcp_cwnd_test 0.143029 rb_insert_color 0.142265 __wake_up 0.141808 tcp_bound_to_half_wnd 0.138600 __sk_dst_check 0.138431 free_hot_cold_page 0.137954 unroll_tree_refs 0.137080 __skb_unlink 0.135124 __sock_sendmsg 0.135064 get_pageblock_flags_group 0.132701 kmem_cache_free 0.128152 bictcp_cong_avoid 0.127874 __napi_complete 0.127527 ____cache_alloc 0.127368 tcp_is_cwnd_limited 0.127278 find_vpid 0.126941 constant_test_bit 0.126504 sk_mem_charge 0.126255 __alloc_pages_internal 0.125977 dst_release 0.125521 hash_64 0.124895 put_prev_task_fair 0.123802 netlbl_enabled 0.122829 sched_clock 0.122640 skb_push 0.122035 __phys_addr 0.121161 dput 0.120515 tcp_prequeue_process 0.118916 __skb_dequeue 0.117715 selinux_socket_sendmsg 0.117536 __inc_zone_state 0.115907 sk_wake_async 0.113504 selinux_ipv4_output 0.113017 sel_netif_sid 0.112431 skb_reset_network_header 0.111170 check_preempt_wakeup 0.111061 bictcp_acked 0.110882 sel_netnode_find 0.109978 update_min_vruntime 0.109889 resched_task 0.109879 current_kernel_time 0.109432 tcp_checksum_complete_user 0.107476 ip_dont_fragment 0.107386 sysret_audit 0.106979 inet_csk_reset_xmit_timer 0.106006 skb_entail 0.105777 sysret_signal 0.105420 avc_hash 0.105251 __skb_clone 0.105211 tcp_init_tso_segs 0.103523 __dequeue_entity 0.101715 PageLRU 0.101378 tcp_parse_aligned_timestamp 0.101219 __xchg 0.100544 constant_test_bit 0.097991 __kmalloc 0.097584 test_tsk_thread_flag 0.097475 autoremove_wake_function 0.095747 selinux_task_kill 0.094416 get_page 0.093353 dequeue_task 0.092728 __local_bh_disable 0.091943 selinux_netlbl_sock_rcv_skb 0.091655 path_put 0.090970 skb_headroom 0.090950 PageTail 0.090642 dst_destroy 0.090523 netpoll_rx 0.089589 skb_header_pointer 0.085935 security_socket_recvmsg 0.084008 alloc_pages_current 0.083184 compare_ether_addr 0.082479 rb_next 0.082439 sk_wmem_schedule 0.081635 next_zones_zonelist 0.080135 tcp_cwnd_validate 0.079877 tcp_event_new_data_sent 0.079817 fcheck_files 0.079082 ip_skb_dst_mtu 0.078804 ip_finish_output 0.078278 wakeup_preempt_entity 0.077026 sel_netif_find 0.076788 __skb_queue_tail 0.076570 sock_flag 0.076520 tcp_win_from_space 0.076510 zone_watermark_ok 0.076282 sel_netnode_sid 0.076162 policy_zonelist 0.074732 __wake_up_common 0.074613 compound_head 0.074593 task_has_perm 0.073243 __find_general_cachep 0.073064 tcp_push 0.072925 skb_cloned 0.072309 pskb_may_pull 0.071852 TCP_ECN_check_ce 0.071495 cap_task_to_inode 0.070770 default_wake_function 0.069429 xfrm4_policy_check 0.069091 tcp_parse_md5sig_option 0.068287 tcp_v4_md5_do_lookup 0.068059 tcp_v4_tw_remember_stamp 0.067344 tcp_ca_event 0.067125 tcp_ca_event 0.065457 place_entity 0.065318 write_seqlock 0.065089 device_not_available 0.065069 test_ti_thread_flag 0.063878 tcp_set_skb_tso_segs 0.063550 selinux_netlbl_inode_permission 0.063391 sock_wfree 0.063311 prepare_to_wait 0.058872 pid_vnr 0.058803 __cycles_2_ns 0.057631 ip_local_out 0.057333 tcp_ack_saw_tstamp 0.056896 copy_to_user 0.056628 set_bit 0.055913 free_pages_check 0.054969 tcp_rcv_rtt_measure_ts 0.053797 init_rootdomain 0.053708 selinux_socket_recvmsg 0.053698 pid_nr_ns 0.053629 sk_eat_skb 0.052814 _local_bh_enable 0.052645 nf_hook_thresh 0.052516 sched_info_queued 0.052457 enqueue_task 0.052228 sk_filter 0.052159 __cpu_clear 0.051980 local_bh_enable_ip 0.050292 update_rq_clock 0.048981 task_tgid_vnr 0.048881 copy_from_user 0.048782 tcp_parse_options 0.048484 lock_sock 0.047779 net_timestamp 0.047044 open_softirq 0.046955 tcp_win_from_space 0.045981 __skb_dequeue 0.043846 getboottime 0.043777 account_group_exec_runtime 0.043519 can_checksum_protocol 0.043469 set_user_nice 0.042784 skb_fill_page_desc 0.042247 security_socket_sendmsg 0.041989 read_profile 0.041930 tcp_validate_incoming 0.041612 check_preempt_curr 0.041413 skb_pull 0.041026 generic_smp_call_function_interrupt 0.041016 calc_delta_fair 0.040936 clear_buddies 0.040768 tcp_data_queue 0.040698 page_count 0.039695 lock_sock 0.039099 skb_headroom 0.038851 system_call_fastpath 0.038622 zone_statistics 0.037500 tcp_sack_extend 0.037381 __kmalloc_node 0.036587 first_zones_zonelist 0.036497 mntput 0.036179 pick_next_task 0.035991 kmap 0.035911 sock_put 0.035613 deactivate_task 0.035027 __nr_to_section 0.033985 page_zone 0.033190 native_load_tls 0.032882 netif_tx_queue_stopped 0.032713 __skb_insert 0.032187 sock_flag 0.031988 check_kill_permission 0.031790 policy_nodemask 0.031621 detach_timer 0.030558 inet_csk_clear_xmit_timer 0.030469 task_rq_unlock 0.029883 tcp_nagle_test 0.029744 tracesys 0.028383 virt_to_slab 0.028115 tcp_v4_check 0.028046 __cpu_set 0.027658 page_get_cache 0.027063 tcp_store_ts_recent 0.027053 __skb_pull 0.026953 gfp_zone 0.026586 sock_rcvlowat 0.026576 csum_partial 0.026397 init_waitqueue_head 0.026109 finish_wait 0.026040 kill_pid_info 0.025404 tcp_full_space 0.024888 __skb_queue_before 0.024550 dst_confirm 0.022603 inet_ehash_bucket 0.021888 activate_task 0.021650 tcp_rto_min 0.021283 d_callback 0.020965 signal_pending 0.020925 avc_node_free 0.020915 empty_bucket 0.020746 group_send_sig_info 0.020657 skb_reset_transport_header 0.020061 sock_put 0.019992 signal_pending_state 0.019684 tcp_sync_mss 0.019346 skb_network_offset 0.019276 skb_split 0.018988 tcp_adjust_fackets_out 0.018204 tcp_fast_path_check 0.017727 __skb_unlink 0.017687 napi_disable_pending 0.017678 sg_set_page 0.017022 get_pageblock_bitmap 0.016972 tcp_cong_avoid 0.016962 pid_task 0.016754 skb_set_tail_pointer 0.016039 selinux_ipv4_postroute 0.015930 idle_cpu 0.015632 skb_reset_network_header 0.015552 __count_vm_events 0.015483 source_load 0.014867 __skb_unlink 0.014738 skb_reset_transport_header 0.014599 set_bit 0.014241 audit_zero_context 0.014231 zone_page_state 0.014152 clear_bit 0.013874 PageSlab 0.013546 __memset 0.013238 get_pageblock_migratetype 0.012623 __rb_rotate_right 0.012543 kmem_find_general_cachep 0.012414 __kprobes_text_start 0.012344 security_sock_rcv_skb 0.012344 node_zonelist 0.012335 dnotify_parent 0.012096 skb_headroom 0.011778 tcp_push_one 0.011540 mnt_want_write 0.011143 kmalloc 0.011073 retint_swapgs 0.010954 __rb_rotate_left 0.010805 check_pgd_range 0.010785 tcp_mss_split_point 0.010755 migrate_timer_list 0.010338 __send_IPI_dest_field 0.010229 reschedule_interrupt 0.010179 sock_flag 0.009882 smp_call_function_mask 0.009673 test_tsk_need_resched 0.009564 tcp_urg 0.009504 generic_file_aio_read 0.009176 PageReserved 0.009147 net_invalid_timestamp 0.009087 __node_set 0.008749 do_tcp_setsockopt 0.008730 set_tsk_thread_flag 0.008720 tcp_enter_loss 0.008422 sock_error 0.008362 target_load 0.008302 crypto_hash_update 0.008104 PageReadahead 0.008044 tcp_poll 0.007915 tcp_checksum_complete 0.007329 tcp_snd_test 0.007309 selinux_file_permission 0.007290 sel_netif_destroy 0.007220 put_pages_list 0.006992 dst_output 0.006743 prepare_to_copy 0.006694 tcp_init_cwnd 0.006555 clear_bit 0.006535 set_bit 0.006425 normal_prio 0.006366 msleep 0.006346 error_sti 0.006336 tcp_rcv_rtt_update 0.006167 tcp_send_ack 0.005989 tcp_init_nondata_skb 0.005720 kfree_skb 0.005502 call_function_interrupt 0.005413 __count_vm_event 0.005403 __skb_checksum_complete_head 0.005363 page_cache_get_speculative 0.005323 dev_kfree_skb_irq 0.005174 skb_store_bits 0.004956 cpu_avg_load_per_task 0.004916 dev_cpu_callback 0.004807 __kmem_cache_destroy 0.004777 tcp_init_metrics 0.004777 io_schedule 0.004777 find_get_page 0.004707 eth_header_parse 0.004688 cap_task_kill 0.004678 error_exit 0.004668 rb_prev 0.004658 tso_fragment 0.004648 mmdrop 0.004628 skb_reset_tail_pointer 0.004598 apic_timer_interrupt 0.004588 clear_bit 0.004519 tcp_simple_retransmit 0.004449 get_max_files 0.004370 sk_stop_timer 0.004340 tcp_reset 0.004251 netlbl_cache_add 0.004201 tcp_add_reno_sack 0.004151 __pskb_trim_head 0.004102 __profile_flip_buffers 0.004092 sk_common_release 0.004052 audit_copy_inode 0.003953 eth_change_mtu 0.003943 vfs_read 0.003923 run_timer_softirq 0.003843 mnt_drop_write 0.003814 clear_page_c 0.003804 do_sync_read 0.003744 unset_migratetype_isolate 0.003714 sk_stream_moderate_sndbuf 0.003545 tcp_try_rmem_schedule 0.003476 native_apic_mem_write 0.003466 sys_read 0.003446 skb_checksum 0.003436 timer_set_base 0.003426 security_task_kill 0.003416 __flow_cache_shrink 0.003406 __skb_checksum_complete 0.003277 alloc_skb 0.003267 physflat_send_IPI_mask 0.003218 skb_gso_ok 0.003178 constant_test_bit 0.003168 find_next_bit 0.003158 selinux_netlbl_skbuff_getsid 0.003118 constant_test_bit 0.003099 pull_task 0.003079 hrtimer_run_queues 0.003049 free_hot_page 0.003009 scheduler_tick 0.002900 set_32bit_tls 0.002890 tcp_acceptable_seq 0.002811 rw_verify_area 0.002751 radix_tree_lookup_slot 0.002731 zero_user_segment 0.002731 sock_common_setsockopt 0.002612 __load_balance_iterator 0.002473 run_posix_cpu_timers 0.002264 task_utime 0.002254 switched_to_fair 0.002185 fsnotify_access 0.002145 __rmqueue_smallest 0.002125 __schedule_bug 0.002095 __task_rq_lock 0.002086 tcp_may_update_window 0.002076 restore_args 0.002066 hrtimer_run_pending 0.002056 generic_segment_checks 0.002026 getnstimeofday 0.002006 idle_task 0.001976 touch_atime 0.001956 __wake_up_locked 0.001927 sk_mem_charge 0.001877 smp_apic_timer_interrupt 0.001827 native_smp_send_reschedule 0.001798 __tcp_fast_path_on 0.001788 file_read_actor 0.001768 _cond_resched 0.001738 avc_policy_seqno 0.001718 tcp_ack_snd_check 0.001629 ip_send_check 0.001619 account_system_time 0.001579 __xapic_wait_icr_idle 0.001579 get_stats 0.001539 tcp_set_state 0.001539 bictcp_state 0.001529 tcp_fast_path_on 0.001519 file_accessed 0.001480 get_seconds 0.001450 kernel_math_error 0.001410 ktime_set 0.001331 kmap_atomic 0.001281 printk_tick 0.001281 __next_cpu_nr 0.001271 account_group_system_time 0.001261 __mod_zone_page_state 0.001222 weighted_cpuload 0.001192 security_file_permission 0.001162 ack_APIC_irq 0.001152 __free_one_page 0.001142 rcu_pending 0.001142 drain_array 0.001122 sched_clock_tick 0.001122 csum_fold 0.001102 ret_from_intr 0.001083 retint_careful 0.001073 need_resched 0.001073 calc_delta_mine 0.001043 tcp_v4_md5_do_del 0.001043 PageActive 0.001033 mark_page_accessed 0.001033 ktime_get_ts 0.001023 tcp_insert_write_queue_after 0.001013 tcp_delack_timer 0.001013 task_tick_fair 0.000973 delay_tsc 0.000963 nv_nic_irq_optimized 0.000904 tick_periodic 0.000894 skb_reserve 0.000884 cache_reap 0.000874 timespec_trunc 0.000864 skb_header_release 0.000854 zone_page_state_add 0.000844 update_process_times 0.000834 sk_rmem_schedule 0.000824 find_busiest_group 0.000804 current_fs_time 0.000785 tick_handle_periodic 0.000785 __sk_mem_schedule 0.000785 irq_enter 0.000755 use_cpu_writer_for_mount 0.000755 tcp_ratehalving_spur_to_response 0.000745 update_wall_time 0.000745 tcp_sendpage 0.000745 __alloc_pages_nodemask 0.000725 ktime_get 0.000725 irq_exit 0.000705 inotify_inode_queue_event 0.000665 set_pageblock_flags_group 0.000646 inotify_dentry_parent_queue_event 0.000626 ack_APIC_irq 0.000606 write_profile 0.000566 set_normalized_timespec 0.000566 raise_softirq 0.000526 task_cputime_zero 0.000516 smp_reschedule_interrupt 0.000516 __skb_insert 0.000497 page_fault 0.000497 __copy_user_nocache 0.000487 run_local_timers 0.000487 read_tsc 0.000487 nf_unregister_hook 0.000477 __rcu_pending 0.000477 jiffies_to_usecs 0.000457 timespec_to_ktime 0.000437 __skb_trim 0.000427 __call_rcu 0.000417 free_pages_bulk 0.000407 smp_call_function_interrupt 0.000397 set_irq_regs 0.000397 radix_tree_deref_slot 0.000397 expand 0.000387 handle_mm_fault 0.000387 handle_IRQ_event 0.000387 fput_light 0.000377 refresh_cpu_vm_stats 0.000377 n_tty_write 0.000367 get_page 0.000358 run_rebalance_domains 0.000358 get_cpu_mask 0.000348 task_hot 0.000348 __skb_queue_after 0.000348 retint_check 0.000348 do_select 0.000338 PageUptodate 0.000338 copy_page_c 0.000328 cond_resched 0.000318 unmap_vmas 0.000318 sk_mem_reclaim 0.000318 rmqueue_bulk 0.000318 reciprocal_value 0.000318 irq_return 0.000308 rb_first 0.000308 alloc_skb 0.000308 account_process_tick 0.000298 net_enable_timestamp 0.000298 clocksource_read 0.000298 account_system_time_scaled 0.000288 sched_slice 0.000278 ip_compute_csum 0.000278 constant_test_bit 0.000278 constant_test_bit 0.000268 set_curr_task_fair 0.000268 note_interrupt 0.000268 exit_idle 0.000258 native_apic_mem_write 0.000258 exit_intr 0.000248 PageReferenced 0.000238 usb_hcd_irq 0.000238 __mnt_is_readonly 0.000238 constant_test_bit 0.000218 IRQ0xba_interrupt 0.000218 handle_fasteoi_irq 0.000209 raise_softirq_irqoff 0.000209 __find_get_block 0.000199 tcp_current_ssthresh 0.000199 n_tty_receive_buf 0.000189 wake_up_page 0.000189 vgacon_save_screen 0.000189 free_block 0.000189 constant_test_bit 0.000179 pagefault_disable 0.000169 clocksource_get_next 0.000169 __bitmap_weight 0.000159 tty_ldisc_deref 0.000159 tcp_write_timer 0.000159 kmem_cache_alloc 0.000159 free_alien_cache 0.000159 ext3_mark_iloc_dirty 0.000159 constant_test_bit 0.000159 __bitmap_equal 0.000149 transfer_objects 0.000149 __rcu_process_callbacks 0.000149 page_waitqueue 0.000149 constant_test_bit 0.000139 __rmqueue 0.000139 release_pages 0.000139 constant_test_bit 0.000129 __tcp_checksum_complete 0.000129 run_workqueue 0.000129 poll_freewait 0.000129 n_tty_read 0.000129 iommu_area_free 0.000129 generic_file_llseek 0.000129 __cpus_setall 0.000129 cond_resched_softirq 0.000129 avc_node_populate 0.000129 add_to_page_cache_lru 0.000129 account_user_time 0.000119 wait_consider_task 0.000119 sys_select 0.000119 round_jiffies_common 0.000119 nv_start_xmit_optimized 0.000119 core_sys_select 0.000109 tcp_tso_segment 0.000109 sigprocmask 0.000109 proc_reg_read 0.000109 path_to_nameidata 0.000109 PageBuddy 0.000109 ohci_irq 0.000109 nv_tx_done_optimized 0.000109 nv_msi_workaround 0.000109 IRQ0xc2_interrupt 0.000109 __ext3_get_inode_loc 0.000109 account_group_user_time 0.000099 __wake_up_sync 0.000099 __up_read 0.000099 update_vsyscall 0.000099 memmove 0.000099 kmalloc 0.000099 ext3_get_blocks_handle 0.000099 do_device_not_available 0.000099 constant_test_bit 0.000089 tcp_incr_quickack 0.000089 smp_send_reschedule 0.000089 remove_from_page_cache 0.000089 rcu_process_callbacks 0.000089 prepare_to_wait_exclusive 0.000089 pde_users_dec 0.000089 find_first_bit 0.000089 constant_test_bit 0.000089 common_interrupt 0.000089 add_wait_queue 0.000079 task_gtime 0.000079 sys_lseek 0.000079 start_this_handle 0.000079 schedule_hrtimeout_range 0.000079 __sched_fork 0.000079 journal_put_journal_head 0.000079 find_first_zero_bit 0.000079 do_syslog 0.000079 do_sync_write 0.000079 constant_test_bit 0.000079 ack_apic_level 0.000070 write_seqlock 0.000070 slab_get_obj 0.000070 remove_wait_queue 0.000070 pty_chars_in_buffer 0.000070 ____pagevec_lru_add 0.000070 lock_hrtimer_base 0.000070 kstat_incr_irqs_this_cpu 0.000070 journal_dirty_data 0.000070 journal_add_journal_head 0.000070 find_lock_page 0.000070 copy_from_read_buf 0.000070 bit_waitqueue 0.000070 alloc_page_vma 0.000060 vfs_write 0.000060 tty_write 0.000060 __strnlen_user 0.000060 sk_mem_uncharge 0.000060 rt_worker_func 0.000060 radix_tree_preload 0.000060 poll_select_copy_remaining 0.000060 pagefault_enable 0.000060 __mark_inode_dirty 0.000060 lru_add_drain_all 0.000060 lock_page 0.000060 list_replace_init 0.000060 journal_stop 0.000060 iowrite8 0.000060 hrtimer_forward 0.000060 gart_unmap_single 0.000060 find_vma 0.000060 __down_read_trylock 0.000060 do_page_fault 0.000060 do_IRQ 0.000060 create_empty_buffers 0.000060 constant_test_bit 0.000060 constant_test_bit 0.000060 alloc_iommu 0.000060 add_to_page_cache_locked 0.000050 zero_fd_set 0.000050 vsnprintf 0.000050 unlock_page 0.000050 tty_read 0.000050 tty_poll 0.000050 sock_poll 0.000050 sock_def_error_report 0.000050 set_wq_data 0.000050 rcu_check_callbacks 0.000050 radix_tree_node_rcu_free 0.000050 pipe_poll 0.000050 opost 0.000050 n_tty_chars_in_buffer 0.000050 __next_cpu 0.000050 mutex_trylock 0.000050 msecs_to_jiffies 0.000050 mempool_alloc_slab 0.000050 load_elf_binary 0.000050 __link_path_walk 0.000050 __journal_remove_journal_head 0.000050 journal_commit_transaction 0.000050 journal_cancel_revoke 0.000050 irq_complete_move 0.000050 irq_cfg 0.000050 fsnotify_modify 0.000050 __first_cpu 0.000050 file_update_time 0.000050 filemap_fault 0.000050 ext3_new_blocks 0.000050 ext3_mark_inode_dirty 0.000050 do_wp_page 0.000050 __do_fault 0.000050 buffer_dirty 0.000050 anon_vma_prepare 0.000040 yield 0.000040 wq_per_cpu 0.000040 walk_page_buffers 0.000040 __wake_up_bit 0.000040 vma_adjust 0.000040 tty_put_char 0.000040 tty_paranoia_check 0.000040 tcp_current_ssthresh 0.000040 sys_write 0.000040 sys_rt_sigprocmask 0.000040 sock_no_bind 0.000040 show_stat 0.000040 SetPageSwapBacked 0.000040 set_irq_regs 0.000040 set_buffer_write_io_error 0.000040 recalc_sigpending 0.000040 radix_tree_delete 0.000040 queue_delayed_work_on 0.000040 pty_write 0.000040 __pollwait 0.000040 physflat_send_IPI_allbutself 0.000040 page_zone 0.000040 page_remove_rmap 0.000040 page_is_file_cache 0.000040 page_evictable 0.000040 nv_get_empty_tx_slots 0.000040 n_tty_poll 0.000040 next_zone 0.000040 next_online_pgdat 0.000040 need_resched 0.000040 mutex_unlock 0.000040 mpol_needs_cond_ref 0.000040 __lookup 0.000040 journal_invalidatepage 0.000040 journal_dirty_metadata 0.000040 ioread8 0.000040 input_available_p 0.000040 inet_csk_reset_xmit_timer 0.000040 get_fd_set 0.000040 generic_write_checks 0.000040 free_poll_entry 0.000040 fput 0.000040 __ext3_journal_stop 0.000040 ext3_get_group_desc 0.000040 ext3_get_block 0.000040 do_mpage_readpage 0.000040 __d_lookup 0.000040 del_page_from_lru 0.000040 __dec_zone_state 0.000040 copy_user_generic 0.000040 __bitmap_and 0.000040 add_page_to_lru_list 0.000040 account_user_time_scaled 0.000040 account_steal_time 0.000030 worker_thread 0.000030 wake_up_bit 0.000030 vmstat_update 0.000030 vm_normal_page 0.000030 tty_write_unlock 0.000030 tty_write_lock 0.000030 tty_wakeup 0.000030 tty_ldisc_try 0.000030 tty_ioctl 0.000030 tag_get 0.000030 sys_pread64 0.000030 submit_bh 0.000030 stop_this_cpu 0.000030 sock_aio_write 0.000030 sk_mem_reclaim 0.000030 sk_backlog_rcv 0.000030 show_interrupts 0.000030 sg_next 0.000030 seq_printf 0.000030 send_remote_softirq 0.000030 remove_vma 0.000030 reg_delay 0.000030 radix_tree_lookup 0.000030 radix_tree_insert 0.000030 proc_lookup_de 0.000030 pipe_write 0.000030 __percpu_counter_add 0.000030 pci_map_single 0.000030 nv_napi_poll 0.000030 __next_node 0.000030 native_send_call_func_ipi 0.000030 mpage_readpages 0.000030 mix_pool_bytes_extract 0.000030 mii_rw 0.000030 mempool_alloc 0.000030 __make_request 0.000030 jbd_lock_bh_state 0.000030 iov_iter_copy_from_user_atomic 0.000030 insert_work 0.000030 hrtimer_try_to_cancel 0.000030 get_dma_ops 0.000030 __generic_file_aio_write_nolock 0.000030 gart_map_sg 0.000030 __fput 0.000030 fixup_irqs 0.000030 __find_get_block_slow 0.000030 filp_close 0.000030 ext3_get_branch 0.000030 ext3_dirty_inode 0.000030 ext3_block_to_path 0.000030 do_get_write_access 0.000030 delayed_work_timer_fn 0.000030 csum_block_add 0.000030 copy_process 0.000030 copy_page_range 0.000030 constant_test_bit 0.000030 constant_test_bit 0.000030 check_irqs_on 0.000030 call_rcu 0.000030 __brelse 0.000030 _atomic_dec_and_lock 0.000020 __xchg 0.000020 vm_stat_account 0.000020 vma_prio_tree_remove 0.000020 tty_mode_ioctl 0.000020 tty_audit_add_data 0.000020 try_to_free_buffers 0.000020 truncate_inode_pages_range 0.000020 tcp_slow_start 0.000020 task_curr 0.000020 sys_setpgid 0.000020 sys_rt_sigreturn 0.000020 sys_getppid 0.000020 strncpy_from_user 0.000020 sock_put 0.000020 smp_call_function 0.000020 __sk_mem_reclaim 0.000020 signal_wake_up 0.000020 signal_pending 0.000020 set_termios 0.000020 SetPageUptodate 0.000020 SetPageLRU 0.000020 set_fd_set 0.000020 set_bit 0.000020 __send_IPI_shortcut 0.000020 security_inode_need_killpriv 0.000020 scsi_request_fn 0.000020 sb_bread 0.000020 restore_i387_xstate 0.000020 __qdisc_run 0.000020 pud_alloc 0.000020 pmd_alloc 0.000020 pfn_pte 0.000020 pfifo_fast_enqueue 0.000020 pfifo_fast_dequeue 0.000020 pci_map_page 0.000020 path_get 0.000020 __pagevec_free 0.000020 pagevec_add 0.000020 PageUnevictable 0.000020 page_mapping 0.000020 nv_get_hw_stats 0.000020 number 0.000020 normalize_rt_tasks 0.000020 __netif_tx_lock 0.000020 mk_pid 0.000020 memscan 0.000020 memcpy_c 0.000020 __lru_cache_add 0.000020 __lookup_mnt 0.000020 load_balance_rt 0.000020 kthread_should_stop 0.000020 journal_start 0.000020 journal_remove_journal_head 0.000020 __journal_file_buffer 0.000020 jbd_unlock_bh_journal_head 0.000020 itimer_get_remtime 0.000020 irq_to_desc 0.000020 iowrite32 0.000020 inotify_remove_watch_locked 0.000020 inode_permission 0.000020 inode_has_perm 0.000020 init_timer 0.000020 goal_in_my_reservation 0.000020 get_vma_policy 0.000020 __get_free_pages 0.000020 generic_sync_sb_inodes 0.000020 gart_map_single 0.000020 freezing 0.000020 free_pgtables 0.000020 free_pages_and_swap_cache 0.000020 free_buffer_head 0.000020 __follow_mount 0.000020 flush_tlb_page 0.000020 find_busiest_queue 0.000020 file_has_perm 0.000020 ext3_try_to_allocate 0.000020 ext3_journal_start 0.000020 __ext3_journal_dirty_metadata 0.000020 ext3_file_write 0.000020 enqueue_hrtimer 0.000020 dup_mm 0.000020 do_wait 0.000020 do_vfs_ioctl 0.000020 do_path_lookup 0.000020 do_munmap 0.000020 do_machine_check 0.000020 do_lookup 0.000020 do_follow_link 0.000020 dma_unmap_single 0.000020 __dec_zone_page_state 0.000020 count_vm_event 0.000020 constant_test_bit 0.000020 constant_test_bit 0.000020 compound_head 0.000020 clear_buffer_jbddirty 0.000020 clear_buffer_delay 0.000020 claim_block 0.000020 cascade 0.000020 cancel_dirty_page 0.000020 cache_grow 0.000020 brelse 0.000020 __block_prepare_write 0.000020 __blocking_notifier_call_chain 0.000020 blk_rq_map_sg 0.000020 __bitmap_empty 0.000020 __bitmap_andnot 0.000020 anon_vma_unlink 0.000010 zone_page_state 0.000010 zero_user_segments 0.000010 __xchg 0.000010 __vma_link_rb 0.000010 vma_link 0.000010 vfs_llseek 0.000010 __up_write 0.000010 update_xtime_cache 0.000010 unmap_underlying_metadata 0.000010 unmap_region 0.000010 unix_poll 0.000010 tty_write_room 0.000010 tty_unthrottle 0.000010 tty_ldisc_ref_wait 0.000010 tty_ldisc_ref 0.000010 tty_fasync 0.000010 tty_check_change 0.000010 tty_chars_in_buffer 0.000010 tty_audit_fork 0.000010 truncate_complete_page 0.000010 test_tsk_thread_flag 0.000010 taskstats_exit 0.000010 sys_writev 0.000010 sys_readahead 0.000010 sys_poll 0.000010 sys_newstat 0.000010 sys_nanosleep 0.000010 sys_ioctl 0.000010 syscall_trace_leave 0.000010 sync_supers 0.000010 stub_execve 0.000010 split_page 0.000010 sock_kfree_s 0.000010 __sleep_on_page_lock 0.000010 skip_atoi 0.000010 signal_pending 0.000010 signal_pending 0.000010 sg_init_table 0.000010 set_task_cpu 0.000010 __set_page_dirty 0.000010 SetPageActive 0.000010 set_bit 0.000010 seq_puts 0.000010 selinux_task_setpgid 0.000010 selinux_secctx_to_secid 0.000010 selinux_sb_show_options 0.000010 selinux_inode_permission 0.000010 selinux_inode_need_killpriv 0.000010 selinux_inode_free_security 0.000010 selinux_inode_alloc_security 0.000010 selinux_d_instantiate 0.000010 security_vm_enough_memory 0.000010 second_overflow 0.000010 scsi_run_queue 0.000010 __scsi_put_command 0.000010 scsi_init_sgtable 0.000010 scsi_end_request 0.000010 schedule_tail 0.000010 schedule_delayed_work 0.000010 sb_any_quota_enabled 0.000010 rt_hash 0.000010 round_jiffies_relative 0.000010 remove_hrtimer 0.000010 __remove_hrtimer 0.000010 __remove_from_page_cache 0.000010 rcu_bh_qsctr_inc 0.000010 radix_tree_tag_clear 0.000010 radix_tree_gang_lookup_tag_slot 0.000010 radix_tree_gang_lookup_slot 0.000010 queue_delayed_work 0.000010 qdisc_run 0.000010 put_tty_queue_nolock 0.000010 put_io_context 0.000010 pty_write_room 0.000010 pty_open 0.000010 ptep_set_access_flags 0.000010 profile_munmap 0.000010 proc_pident_lookup 0.000010 proc_get_inode 0.000010 prio_tree_replace 0.000010 prio_tree_remove 0.000010 prio_tree_insert 0.000010 pmd_none_or_clear_bad 0.000010 pipe_release 0.000010 pipe_read 0.000010 pid_revalidate 0.000010 pgd_alloc 0.000010 pci_unmap_single 0.000010 pci_read_config_dword 0.000010 pci_conf1_write 0.000010 pci_bus_read_config_dword 0.000010 path_walk 0.000010 page_zone 0.000010 PageSwapCache 0.000010 PageSwapCache 0.000010 PageSwapCache 0.000010 __page_set_anon_rmap 0.000010 PagePrivate 0.000010 PagePrivate 0.000010 PagePrivate 0.000010 page_add_file_rmap 0.000010 on_each_cpu 0.000010 nv_do_interrupt 0.000010 net_tx_action 0.000010 netif_start_queue 0.000010 netif_carrier_ok 0.000010 need_resched 0.000010 need_iommu 0.000010 native_pte_clear 0.000010 native_io_delay 0.000010 mutex_lock 0.000010 mprotect_fixup 0.000010 mod_zone_page_state 0.000010 mntput_no_expire 0.000010 mm_init 0.000010 mmap_region 0.000010 mempool_free 0.000010 memcmp 0.000010 mcheck_check_cpu 0.000010 may_open 0.000010 __lookup_tag 0.000010 locks_remove_posix 0.000010 locks_remove_flock 0.000010 lock_buffer 0.000010 load_elf_binary 0.000010 load_balance_fair 0.000010 ll_back_merge_fn 0.000010 kzalloc 0.000010 ktime_add_safe 0.000010 kill_fasync 0.000010 __journal_temp_unlink_buffer 0.000010 journal_switch_revoke_table 0.000010 __journal_remove_checkpoint 0.000010 journal_get_write_access 0.000010 journal_get_undo_access 0.000010 journal_get_descriptor_buffer 0.000010 journal_bmap 0.000010 jbd_unlock_bh_state 0.000010 jbd_unlock_bh_state 0.000010 IRQ0xd2_interrupt 0.000010 ip_append_data 0.000010 iov_iter_advance 0.000010 iov_fault_in_pages_read 0.000010 iommu_area_alloc 0.000010 inode_sub_bytes 0.000010 inode_doinit_with_dentry 0.000010 inode_add_bytes 0.000010 __inc_zone_page_state 0.000010 inc_zone_page_state 0.000010 hweight_long 0.000010 hweight64 0.000010 hrtimer_wakeup 0.000010 hrtimer_init 0.000010 hash_64 0.000010 half_md4_transform 0.000010 __grab_cache_page 0.000010 get_user_pages 0.000010 get_signal_to_deliver 0.000010 get_random_int 0.000010 getname 0.000010 get_empty_filp 0.000010 __getblk 0.000010 generic_permission 0.000010 generic_make_request 0.000010 generic_fillattr 0.000010 generic_file_open 0.000010 generic_file_llseek_unlocked 0.000010 generic_file_buffered_write 0.000010 generic_file_aio_write 0.000010 generic_cont_expand_simple 0.000010 generic_block_bmap 0.000010 freezing 0.000010 free_swap_cache 0.000010 free_pid 0.000010 free_pgd_range 0.000010 free_pages 0.000010 flush_old_exec 0.000010 first_online_pgdat 0.000010 find_vma_prepare 0.000010 find_task_by_pid_type_ns 0.000010 find_next_zero_bit 0.000010 find_inode_fast 0.000010 file_remove_suid 0.000010 file_mask_to_av 0.000010 file_free_rcu 0.000010 __FD_CLR 0.000010 ext3_write_begin 0.000010 ext3_try_to_allocate_with_rsv 0.000010 ext3_ordered_write_end 0.000010 ext3_journalled_set_page_dirty 0.000010 ext3_invalidatepage 0.000010 ext3_iget_acl 0.000010 ext3_get_inode_flags 0.000010 ext3_free_data 0.000010 ext3_discard_reservation 0.000010 exit_thread 0.000010 exit_task_namespaces 0.000010 exit_sem 0.000010 end_that_request_last 0.000010 end_buffer_write_sync 0.000010 end_buffer_async_write 0.000010 elv_rb_del 0.000010 elv_queue_empty 0.000010 elv_merged_request 0.000010 elv_completed_request 0.000010 elf_map 0.000010 echo_char 0.000010 e1000_watchdog 0.000010 e1000_read_phy_reg 0.000010 __drain_alien_cache 0.000010 __d_path 0.000010 __down_write_nested 0.000010 __down_write 0.000010 double_rq_lock 0.000010 do_timer 0.000010 do_sys_open 0.000010 do_sigaltstack 0.000010 do_sigaction 0.000010 do_setitimer 0.000010 do_pipe_flags 0.000010 __do_page_cache_readahead 0.000010 do_notify_parent 0.000010 do_filp_open 0.000010 do_exit 0.000010 dnotify_flush 0.000010 d_kill 0.000010 destroy_inode 0.000010 dequeue_signal 0.000010 de_put 0.000010 delayacct_end 0.000010 create_write_pipe 0.000010 create_workqueue_thread 0.000010 __cpus_equal 0.000010 cpu_quiet 0.000010 __cpu_clear 0.000010 __cpu_clear 0.000010 count 0.000010 copy_thread 0.000010 copy_namespaces 0.000010 constant_test_bit 0.000010 constant_test_bit 0.000010 constant_test_bit 0.000010 constant_test_bit 0.000010 constant_test_bit 0.000010 __cond_resched 0.000010 clocksource_forward_now 0.000010 __clear_user 0.000010 clear_inode 0.000010 clear_buffer_new 0.000010 clear_bit 0.000010 clear_bit 0.000010 check_for_bios_corruption 0.000010 __cfq_slice_expired 0.000010 cfq_set_request 0.000010 cfq_dispatch_requests 0.000010 cfq_completed_request 0.000010 cap_set_effective 0.000010 can_share_swap_page 0.000010 bvec_alloc_bs 0.000010 buffer_uptodate 0.000010 buffer_mapped 0.000010 buffer_locked 0.000010 buffer_jbd 0.000010 buffer_jbd 0.000010 brelse 0.000010 __bread 0.000010 blk_invoke_request_fn 0.000010 __blk_complete_request 0.000010 blk_add_trace_generic 0.000010 blk_add_trace_bio 0.000010 bit_spin_lock 0.000010 bio_put 0.000010 bio_alloc_bioset 0.000010 bdi_read_congested 0.000010 balance_runtime 0.000010 balance_dirty_pages_ratelimited_nr 0.000010 audit_log_task_context 0.000010 ata_sff_qc_prep 0.000010 ata_scsi_queuecmd 0.000010 ata_link_max_devices 0.000010 ata_get_xlat_func 0.000010 arp_process 0.000010 arch_pick_mmap_layout 0.000010 arch_irq_stat_cpu 0.000010 arch_dup_task_struct 0.000010 alloc_pid 0.000010 alloc_fdtable 0.000010 alloc_fd 0.000010 add_mm_rss 0.000010 acct_collect ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar @ 2008-11-17 19:30 ` Eric Dumazet 2008-11-17 19:39 ` David Miller ` (13 subsequent siblings) 14 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 19:30 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger Ingo Molnar a écrit : > * Ingo Molnar <mingo@elte.hu> wrote: > > 4> The place for the sock_rfree() hit looks a bit weird, and i'll >> investigate it now a bit more to place the real overhead point >> properly. (i already mapped the test-bit overhead: that comes from >> napi_disable_pending()) > > ok, here's a new set of profiles. (again for tbench 64-thread on a > 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i > posted before.) > > Here are the per major subsystem percentages: > > NET overhead ( 5786945/10096751): 57.31% > security overhead ( 925933/10096751): 9.17% > usercopy overhead ( 837887/10096751): 8.30% > sched overhead ( 753662/10096751): 7.46% > syscall overhead ( 268809/10096751): 2.66% > IRQ overhead ( 266500/10096751): 2.64% > slab overhead ( 180258/10096751): 1.79% > timer overhead ( 92986/10096751): 0.92% > pagealloc overhead ( 87381/10096751): 0.87% > VFS overhead ( 53295/10096751): 0.53% > PID overhead ( 44469/10096751): 0.44% > pagecache overhead ( 33452/10096751): 0.33% > gtod overhead ( 11064/10096751): 0.11% > IDLE overhead ( 0/10096751): 0.00% > --------------------------------------------------------- > left ( 753878/10096751): 7.47% > > The breakdown is very similar to what i sent before, within noise. > > [ 'left' is random overhead from all around the place - i categorized > the 500 most expensive functions in the profile per subsystem. > I stopped short of doing it for all 1300+ functions: it's rather > laborous manual work even with hefty use of regex patterns. > It's also less meaningful in practice: the trend in the first 500 > functions is present in the remaining 800 functions as well. I > watched the breakdown evolve as i increased the coverage - in > practice it is the first 100 functions that matter - it just doesnt > change after that. ] > > The readprofile output below seems structured in a more useful way now > - i tweaked compiler options to have the profiler hits spread out in a > more meaningful way. I collected 10 million NMI profiler hits, and > normalized the readprofile output up to 100%. > > [ I'll post per function analysis as i complete them, as a reply to > this mail. ] > > Ingo > > 100.000000 total > ................ > 7.253355 copy_user_generic_string > 3.934833 avc_has_perm_noaudit > 3.356152 ip_queue_xmit > 3.038025 skb_release_data > 2.118525 skb_release_head_state > 1.997533 tcp_ack > 1.833688 tcp_recvmsg > 1.717771 eth_type_trans Strange, in my profile, eth_type_trans is not in the top 20 Maybe an alignment problem ? Oh, I understand, you hit the netdevice->last_rx update probblem, already corrected on net-next-2.6 > 1.673249 __inet_lookup_established TCP established/timewait table is now RCUified (for linux-2.6.29), this one should go down in profiles. > 1.508888 system_call > 1.469183 tcp_current_mss Yes there is a divide that might be expensive. discussion on netdev. > 1.431553 tcp_transmit_skb > 1.385125 tcp_sendmsg > 1.327643 tcp_v4_rcv > 1.292328 nf_hook_thresh > 1.203205 schedule > 1.059501 nf_hook_slow > 1.027373 constant_test_bit > 0.945183 sock_rfree > 0.922748 __switch_to > 0.911605 netif_rx > 0.876270 register_gifconf > 0.788200 ip_local_deliver_finish > 0.781467 dev_queue_xmit > 0.766530 constant_test_bit > 0.758208 _local_bh_enable_ip > 0.747184 load_cr3 > 0.704341 memset_c > 0.671260 sysret_check > 0.651845 ip_finish_output2 > 0.620204 audit_free_names ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar 2008-11-17 19:30 ` Eric Dumazet @ 2008-11-17 19:39 ` David Miller 2008-11-17 19:43 ` Eric Dumazet ` (2 more replies) 2008-11-17 19:57 ` Ingo Molnar ` (12 subsequent siblings) 14 siblings, 3 replies; 191+ messages in thread From: David Miller @ 2008-11-17 19:39 UTC (permalink / raw) To: mingo Cc: torvalds, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger From: Ingo Molnar <mingo@elte.hu> Date: Mon, 17 Nov 2008 19:49:51 +0100 > > * Ingo Molnar <mingo@elte.hu> wrote: > > 4> The place for the sock_rfree() hit looks a bit weird, and i'll > > investigate it now a bit more to place the real overhead point > > properly. (i already mapped the test-bit overhead: that comes from > > napi_disable_pending()) > > ok, here's a new set of profiles. (again for tbench 64-thread on a > 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i > posted before.) Again, do a non-NMI profile and the top (at least for me) looks like this: samples % app name symbol name 473 6.3928 vmlinux finish_task_switch 349 4.7169 vmlinux tcp_v4_rcv 327 4.4195 vmlinux U3copy_from_user 322 4.3519 vmlinux tl0_linux32 178 2.4057 vmlinux tcp_ack 170 2.2976 vmlinux tcp_sendmsg 167 2.2571 vmlinux U3copy_to_user That tcp_v4_rcv() hit is %98 on the wake_up() call it does. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:39 ` David Miller @ 2008-11-17 19:43 ` Eric Dumazet 2008-11-17 19:55 ` Linus Torvalds 2008-11-18 12:29 ` Mike Galbraith 2 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 19:43 UTC (permalink / raw) To: David Miller Cc: mingo, torvalds, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger David Miller a écrit : > From: Ingo Molnar <mingo@elte.hu> > Date: Mon, 17 Nov 2008 19:49:51 +0100 > >> * Ingo Molnar <mingo@elte.hu> wrote: >> >> 4> The place for the sock_rfree() hit looks a bit weird, and i'll >>> investigate it now a bit more to place the real overhead point >>> properly. (i already mapped the test-bit overhead: that comes from >>> napi_disable_pending()) >> ok, here's a new set of profiles. (again for tbench 64-thread on a >> 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i >> posted before.) > > Again, do a non-NMI profile and the top (at least for me) > looks like this: > > samples % app name symbol name > 473 6.3928 vmlinux finish_task_switch > 349 4.7169 vmlinux tcp_v4_rcv > 327 4.4195 vmlinux U3copy_from_user > 322 4.3519 vmlinux tl0_linux32 > 178 2.4057 vmlinux tcp_ack > 170 2.2976 vmlinux tcp_sendmsg > 167 2.2571 vmlinux U3copy_to_user > > That tcp_v4_rcv() hit is %98 on the wake_up() call it does. > > Another profile from my tree (net-next-2.6 + some patches), on my machine CPU: Core 2, speed 3000.22 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % symbol name 223265 9.2711 __copy_user_zeroing_intel 87525 3.6345 __copy_user_intel 73203 3.0398 tcp_sendmsg 53229 2.2103 netif_rx 53041 2.2025 tcp_recvmsg 47241 1.9617 sysenter_past_esp 42888 1.7809 __copy_from_user_ll 40858 1.6966 tcp_transmit_skb 39390 1.6357 __switch_to 37363 1.5515 dst_release 36823 1.5291 __sk_dst_check_get 36050 1.4970 tcp_v4_rcv 35829 1.4878 __do_softirq 32333 1.3426 tcp_rcv_established 30451 1.2645 tcp_clean_rtx_queue 29758 1.2357 ip_queue_xmit 28497 1.1833 __copy_to_user_ll 28119 1.1676 release_sock 25218 1.0472 lock_sock_nested 23701 0.9842 __inet_lookup_established 23463 0.9743 tcp_ack 22989 0.9546 netif_receive_skb 21880 0.9086 sched_clock_cpu 20730 0.8608 tcp_write_xmit 20372 0.8460 ip_rcv 20336 0.8445 local_bh_enable 19153 0.7953 __update_sched_clock 18603 0.7725 skb_release_data 17020 0.7068 local_bh_enable_ip 16932 0.7031 process_backlog 16299 0.6768 ip_finish_output 16279 0.6760 dev_queue_xmit 15858 0.6585 sock_recvmsg 15641 0.6495 native_read_tsc 15454 0.6417 sock_wfree 15366 0.6381 update_curr 14585 0.6056 sys_socketcall 14564 0.6048 __alloc_skb 14519 0.6029 __tcp_select_window 14417 0.5987 tcp_current_mss 14391 0.5976 nf_iterate 14221 0.5905 page_address 14122 0.5864 local_bh_disable ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:39 ` David Miller 2008-11-17 19:43 ` Eric Dumazet @ 2008-11-17 19:55 ` Linus Torvalds 2008-11-17 20:16 ` David Miller 2008-11-18 12:29 ` Mike Galbraith 2 siblings, 1 reply; 191+ messages in thread From: Linus Torvalds @ 2008-11-17 19:55 UTC (permalink / raw) To: David Miller Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger On Mon, 17 Nov 2008, David Miller wrote: > > Again, do a non-NMI profile and the top (at least for me) > looks like this: Can _you_ please do a NMI profile and see what your real problem is? I can't imagine that Niagara (or whatever) is so weak that it can't do NMI's. The fact is, David, that Ingo just posted a profile that was _better_ than anything you have ever posted, and it doesn't show what you complain about. So he's not seeing it. Asking him to do a _stupid_ profile is just that: stupid. So try to figure out why his (better) profile doesn't match your (inferior) one, instead of asking him to do stupid things. It's some difference in architectures, likely: maybe the sparc timekeeping is crap, maybe it's a cache issue and sparc caches are crap, maybe it's something where Niagara (is it niagara) has some oddness that shows up because it has that odd four-threads+four-cores or whatever. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:55 ` Linus Torvalds @ 2008-11-17 20:16 ` David Miller 2008-11-17 20:30 ` Linus Torvalds 0 siblings, 1 reply; 191+ messages in thread From: David Miller @ 2008-11-17 20:16 UTC (permalink / raw) To: torvalds Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 17 Nov 2008 11:55:35 -0800 (PST) > So try to figure out why his (better) profile doesn't match your > (inferior) one, instead of asking him to do stupid things. It's some > difference in architectures, likely: maybe the sparc timekeeping is crap, > maybe it's a cache issue and sparc caches are crap, maybe it's something > where Niagara (is it niagara) has some oddness that shows up because it > has that odd four-threads+four-cores or whatever. It's on my workstation which is a much simpler 2 processor UltraSPARC-IIIi (1.5Ghz) system. And yes I will investigate, it's all I've been doing in my spare time these past few weeks. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 20:16 ` David Miller @ 2008-11-17 20:30 ` Linus Torvalds 2008-11-17 20:58 ` David Miller 0 siblings, 1 reply; 191+ messages in thread From: Linus Torvalds @ 2008-11-17 20:30 UTC (permalink / raw) To: David Miller Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger On Mon, 17 Nov 2008, David Miller wrote: > > It's on my workstation which is a much simpler 2 processor > UltraSPARC-IIIi (1.5Ghz) system. Ok. It could easily be something like a cache footprint issue. And while I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is super- scalar but does no out-of-order and speculation, no? So I could easily see that the indirect branches in the scheduler hurt much more, and might explain why the x86 profile looks so different. One thing that non-NMI profiles also tend to show is "clumping", which in turn tends to rather excessively pinpoint code sequences that release the irq flag - just because those points show up in profiles, rather than being a spread-out-mush. So it's possible that Ingo's profile did show the scheduler more, but it was in the form of much more spread out "noise" rather than the single spike you saw. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 20:30 ` Linus Torvalds @ 2008-11-17 20:58 ` David Miller 2008-11-18 9:44 ` Nick Piggin 0 siblings, 1 reply; 191+ messages in thread From: David Miller @ 2008-11-17 20:58 UTC (permalink / raw) To: torvalds Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST) > On Mon, 17 Nov 2008, David Miller wrote: > > > > It's on my workstation which is a much simpler 2 processor > > UltraSPARC-IIIi (1.5Ghz) system. > > Ok. It could easily be something like a cache footprint issue. And while I > don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is super- > scalar but does no out-of-order and speculation, no? I does only very simple speculation, but you're description is accurate. > So I could easily see that the indirect branches in the scheduler > hurt much more, and might explain why the x86 profile looks so > different. Right. > One thing that non-NMI profiles also tend to show is "clumping", which in > turn tends to rather excessively pinpoint code sequences that release the > irq flag - just because those points show up in profiles, rather than > being a spread-out-mush. So it's possible that Ingo's profile did show the > scheduler more, but it was in the form of much more spread out "noise" > rather than the single spike you saw. Sure. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 20:58 ` David Miller @ 2008-11-18 9:44 ` Nick Piggin 2008-11-18 15:58 ` Linus Torvalds 2008-11-20 9:06 ` David Miller 0 siblings, 2 replies; 191+ messages in thread From: Nick Piggin @ 2008-11-18 9:44 UTC (permalink / raw) To: David Miller Cc: torvalds, mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger On Tuesday 18 November 2008 07:58, David Miller wrote: > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST) > > > On Mon, 17 Nov 2008, David Miller wrote: > > > It's on my workstation which is a much simpler 2 processor > > > UltraSPARC-IIIi (1.5Ghz) system. > > > > Ok. It could easily be something like a cache footprint issue. And while > > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is > > super- scalar but does no out-of-order and speculation, no? > > I does only very simple speculation, but you're description is accurate. Surely it would do branch prediction, but maybe not indirect branch? I did wonder why those indirect function calls were added everywhere in the scheduler... They didn't show up in the newest generation of x86 CPUs, but simpler implementations won't handle them as well. I wouldn't expect that to cause such a big regression on its own, but it would still be interesting to test changing them to direct calls. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-18 9:44 ` Nick Piggin @ 2008-11-18 15:58 ` Linus Torvalds 2008-11-19 4:31 ` Nick Piggin 2008-11-20 9:14 ` David Miller 2008-11-20 9:06 ` David Miller 1 sibling, 2 replies; 191+ messages in thread From: Linus Torvalds @ 2008-11-18 15:58 UTC (permalink / raw) To: Nick Piggin Cc: David Miller, mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger On Tue, 18 Nov 2008, Nick Piggin wrote: > On Tuesday 18 November 2008 07:58, David Miller wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > > > > > > Ok. It could easily be something like a cache footprint issue. And while > > > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is > > > super- scalar but does no out-of-order and speculation, no? > > > > I does only very simple speculation, but you're description is accurate. > > Surely it would do branch prediction, but maybe not indirect branch? That would be "branch target prediction" (and a BTB - "Branch Target Buffer" to hold it), and no, I don't think Sparc does that. You can certainly do it for in-order machines too, but I think it's fairly rare. It's sufficiently different from the regular "pick up the address from the static instruction stream, and also yank the kill-chain on mispredicted direction" to be real work to do. Unlike a compare or test instruction, it's not at all likely that you can resolve the final address in just a single pipeline stage, and without that, it's usually too late to yank the kill-chain. (And perhaps equally importantly, indirect branches are relatively rare on old-style Unix benchmarks - ie SpecInt/FP - or in databases. So it's not something that Sparc would necessarily have spent the effort on.) There is obviously one very special indirect jump: "ret". That's the one that is common, and that tends to have a special branch target buffer that is a pure stack. And for that, there is usually a special branch target register that needs to be set up 'x' cycles before the ret in order to avoid the stall (then the predition is checking that register against the branch target stack, which is somewhat akin to a regular conditional branch comparison). So I strongly suspect that an indirect (non-ret) branch flushes the pipeline on sparc. It is possible that there is a "prepare to jump" instruction that prepares the indirect branch stack (kind of a "push prediction information"). I suspect Java sees a lot more indirect branches than traditional Unix loads, so maybe Sun did do that. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-18 15:58 ` Linus Torvalds @ 2008-11-19 4:31 ` Nick Piggin 2008-11-20 9:14 ` David Miller 1 sibling, 0 replies; 191+ messages in thread From: Nick Piggin @ 2008-11-19 4:31 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger On Wednesday 19 November 2008 02:58, Linus Torvalds wrote: > On Tue, 18 Nov 2008, Nick Piggin wrote: > > On Tuesday 18 November 2008 07:58, David Miller wrote: > > > From: Linus Torvalds <torvalds@linux-foundation.org> > > > > > > > Ok. It could easily be something like a cache footprint issue. And > > > > while I don't know my sparc cpu's very well, I think the > > > > Ultrasparc-IIIi is super- scalar but does no out-of-order and > > > > speculation, no? > > > > > > I does only very simple speculation, but you're description is > > > accurate. > > > > Surely it would do branch prediction, but maybe not indirect branch? > > That would be "branch target prediction" (and a BTB - "Branch Target > Buffer" to hold it), and no, I don't think Sparc does that. You can > certainly do it for in-order machines too, but I think it's fairly rare. > > It's sufficiently different from the regular "pick up the address from the > static instruction stream, and also yank the kill-chain on mispredicted > direction" to be real work to do. Unlike a compare or test instruction, > it's not at all likely that you can resolve the final address in just a > single pipeline stage, and without that, it's usually too late to yank the > kill-chain. > > (And perhaps equally importantly, indirect branches are relatively rare on > old-style Unix benchmarks - ie SpecInt/FP - or in databases. So it's not > something that Sparc would necessarily have spent the effort on.) > > There is obviously one very special indirect jump: "ret". That's the one > that is common, and that tends to have a special branch target buffer that > is a pure stack. And for that, there is usually a special branch target > register that needs to be set up 'x' cycles before the ret in order to > avoid the stall (then the predition is checking that register against the > branch target stack, which is somewhat akin to a regular conditional > branch comparison). > > So I strongly suspect that an indirect (non-ret) branch flushes the > pipeline on sparc. It is possible that there is a "prepare to jump" > instruction that prepares the indirect branch stack (kind of a "push > prediction information"). I suspect Java sees a lot more indirect > branches than traditional Unix loads, so maybe Sun did do that. Probably true. OTOH, I've seen indirect branches get compiled to direct branches or the common-case special cased into a direct branch if (object->fn == default_object_fn) default_object_fn(); That might be an easy way to test suspicions about CPU scheduler slowdowns... (adding a likely() there, and using likely profiling would help ensure you got the defualt case right). ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-18 15:58 ` Linus Torvalds 2008-11-19 4:31 ` Nick Piggin @ 2008-11-20 9:14 ` David Miller 1 sibling, 0 replies; 191+ messages in thread From: David Miller @ 2008-11-20 9:14 UTC (permalink / raw) To: torvalds Cc: nickpiggin, mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, 18 Nov 2008 07:58:49 -0800 (PST) > There is obviously one very special indirect jump: "ret". That's the one > that is common, and that tends to have a special branch target buffer that > is a pure stack. And for that, there is usually a special branch target > register that needs to be set up 'x' cycles before the ret in order to > avoid the stall (then the predition is checking that register against the > branch target stack, which is somewhat akin to a regular conditional > branch comparison). Yes, UltraSPARC has a RAS or Return Address Stack. I think it has effectively zero latency (ie. you can call some function, immediately "ret" and it hits the RAS). This is probably because, due to delay slots, there is always going to be one instruction in between anyways. :) > So I strongly suspect that an indirect (non-ret) branch flushes the > pipeline on sparc. It is possible that there is a "prepare to jump" > instruction that prepares the indirect branch stack (kind of a "push > prediction information"). It doesn't flush the pipeline, it just stalls it waiting for the address computation. Branches are predicted and can execute in the same cycle as the condition-code setting instruction they depend upon. > I suspect Java sees a lot more indirect branches than traditional > Unix loads, so maybe Sun did do that. There really isn't anything special done here for indirect jumps, other than pushing onto the RAS. Indirects just suck :) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-18 9:44 ` Nick Piggin 2008-11-18 15:58 ` Linus Torvalds @ 2008-11-20 9:06 ` David Miller 1 sibling, 0 replies; 191+ messages in thread From: David Miller @ 2008-11-20 9:06 UTC (permalink / raw) To: nickpiggin Cc: torvalds, mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger From: Nick Piggin <nickpiggin@yahoo.com.au> Date: Tue, 18 Nov 2008 20:44:10 +1100 > On Tuesday 18 November 2008 07:58, David Miller wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > > Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST) > > > > > On Mon, 17 Nov 2008, David Miller wrote: > > > > It's on my workstation which is a much simpler 2 processor > > > > UltraSPARC-IIIi (1.5Ghz) system. > > > > > > Ok. It could easily be something like a cache footprint issue. And while > > > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is > > > super- scalar but does no out-of-order and speculation, no? > > > > I does only very simple speculation, but you're description is accurate. > > Surely it would do branch prediction, but maybe not indirect branch? Right. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:39 ` David Miller 2008-11-17 19:43 ` Eric Dumazet 2008-11-17 19:55 ` Linus Torvalds @ 2008-11-18 12:29 ` Mike Galbraith 2 siblings, 0 replies; 191+ messages in thread From: Mike Galbraith @ 2008-11-18 12:29 UTC (permalink / raw) To: David Miller Cc: mingo, torvalds, dada1, rjw, linux-kernel, kernel-testers, cl, a.p.zijlstra, shemminger On Mon, 2008-11-17 at 11:39 -0800, David Miller wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Mon, 17 Nov 2008 19:49:51 +0100 > > > > > * Ingo Molnar <mingo@elte.hu> wrote: > > > > 4> The place for the sock_rfree() hit looks a bit weird, and i'll > > > investigate it now a bit more to place the real overhead point > > > properly. (i already mapped the test-bit overhead: that comes from > > > napi_disable_pending()) > > > > ok, here's a new set of profiles. (again for tbench 64-thread on a > > 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i > > posted before.) > > Again, do a non-NMI profile and the top (at least for me) > looks like this: > > samples % app name symbol name > 473 6.3928 vmlinux finish_task_switch > 349 4.7169 vmlinux tcp_v4_rcv > 327 4.4195 vmlinux U3copy_from_user > 322 4.3519 vmlinux tl0_linux32 > 178 2.4057 vmlinux tcp_ack > 170 2.2976 vmlinux tcp_sendmsg > 167 2.2571 vmlinux U3copy_to_user > > That tcp_v4_rcv() hit is %98 on the wake_up() call it does. Easy enough, since i don't know how to do spiffy NMI profile.. yet ;-) I revived the 2.6.25 kernel where I tested back-ports of recent sched fixes, and did a non-NMI profile of 2.6.22.19 and the back-port kernel. The test kernel has all clock fixes 25->.git, min_vruntime accuracy fix native_read_tsc() fix, and back looking buddy. No knobs turned, and only testing one pair per CPU, as to not take unfair advantage of back looking buddy. Netperf TCP_RR (hits sched harder) looks about the same. Tbench 4 throughput was so close you would call these two twins. 2.6.22.19-smp CPU: Core 2, speed 2400 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 vma samples % symbol name ffffffff802e6670 575909 13.7425 copy_user_generic_string ffffffff80422ad8 175649 4.1914 schedule ffffffff803a522d 133152 3.1773 tcp_sendmsg ffffffff803a9387 128911 3.0761 tcp_ack ffffffff803b65f7 116562 2.7814 tcp_v4_rcv ffffffff803aeac8 116541 2.7809 tcp_transmit_skb ffffffff8039eb95 112133 2.6757 ip_queue_xmit ffffffff80209e20 110945 2.6474 system_call ffffffff8037b720 108277 2.5837 __kfree_skb ffffffff803a65cd 105493 2.5173 tcp_recvmsg ffffffff80210f87 97947 2.3372 read_tsc ffffffff802085b6 95255 2.2730 __switch_to ffffffff803803f1 82069 1.9584 netif_rx ffffffff8039f645 80937 1.9313 ip_output ffffffff8027617d 74585 1.7798 __slab_alloc ffffffff803824a0 70928 1.6925 process_backlog ffffffff803ad9a5 69574 1.6602 tcp_rcv_established ffffffff80399d40 55453 1.3232 ip_rcv ffffffff803b07d1 53256 1.2708 __tcp_push_pending_frames ffffffff8037b49c 52565 1.2543 skb_clone ffffffff80276e97 49690 1.1857 __kmalloc_track_caller ffffffff80379d05 45450 1.0845 sock_wfree ffffffff80223d82 44851 1.0702 effective_prio ffffffff803826b6 42417 1.0122 net_rx_action ffffffff8027684c 42341 1.0104 kfree 2.6.25.20-test-smp CPU: Core 2, speed 2400 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 vma samples % symbol name ffffffff80301450 576125 14.0874 copy_user_generic_string ffffffff803cf8d9 127997 3.1298 tcp_transmit_skb ffffffff803c9eac 125402 3.0663 tcp_ack ffffffff80454da3 122337 2.9914 schedule ffffffff803c673c 120401 2.9440 tcp_sendmsg ffffffff8039aa9e 116554 2.8500 skb_release_all ffffffff803c5abb 104840 2.5635 tcp_recvmsg ffffffff8020a63d 92180 2.2540 __switch_to ffffffff8020be20 79703 1.9489 system_call ffffffff803bf460 79384 1.9411 ip_queue_xmit ffffffff803a005c 78035 1.9081 netif_rx ffffffff803ce56b 71223 1.7415 tcp_rcv_established ffffffff8039ff70 66493 1.6259 process_backlog ffffffff803d5a2d 61635 1.5071 tcp_v4_rcv ffffffff803c1dae 60889 1.4889 __inet_lookup_established ffffffff802126bc 54711 1.3378 native_read_tsc ffffffff803d23bc 51843 1.2677 __tcp_push_pending_frames ffffffff803bfb24 51821 1.2671 ip_finish_output ffffffff8023700c 48248 1.1798 local_bh_enable ffffffff803979bc 42221 1.0324 sock_wfree ffffffff8039b12c 41279 1.0094 __alloc_skb ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar 2008-11-17 19:30 ` Eric Dumazet 2008-11-17 19:39 ` David Miller @ 2008-11-17 19:57 ` Ingo Molnar 2008-11-17 20:20 ` (avc_has_perm_noaudit()) " Ingo Molnar ` (11 subsequent siblings) 14 siblings, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 19:57 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger > [ I'll post per function analysis as i complete them, as a reply to > this mail. ] [ i'll do a separate mail for every function analyzed, the discussion spreads better that way. ] > 100.000000 total > ................ > 7.253355 copy_user_generic_string This is the Well-known pattern of user-copy overhead, which centers around this single REP MOVS instruction: nr-of-hits ......... ffffffff80341eea: 42 83 e2 07 and $0x7,%edx ffffffff80341eed: 677398 f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi) ffffffff80341ef0: 3642 89 d1 mov %edx,%ecx ffffffff80341ef2: 16260 f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) ffffffff80341ef4: 6554 31 c0 xor %eax,%eax ffffffff80341ef6: 1958 c3 retq ffffffff80341ef7: 0 90 nop ffffffff80341ef8: 0 90 nop That's to be expected - tbench shuffles 3.5 GB of effective data to/from sockets. That's 7.5 GB due to double-copy. So for every 64 bytes of data transferred we spend 1.4 CPU cycles in this specific function - that is OK-ish. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* (avc_has_perm_noaudit()) Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (2 preceding siblings ...) 2008-11-17 19:57 ` Ingo Molnar @ 2008-11-17 20:20 ` Ingo Molnar 2008-11-17 20:32 ` ip_queue_xmit(): " Ingo Molnar ` (10 subsequent siblings) 14 siblings, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 20:20 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 3.934833 avc_has_perm_noaudit this one seems spread out: hits (total: 393483 hits) ......... ffffffff80312af3: 1426 <avc_has_perm_noaudit>: ffffffff80312af3: 1426 41 57 push %r15 ffffffff80312af5: 6124 41 56 push %r14 ffffffff80312af7: 0 41 55 push %r13 ffffffff80312af9: 1443 41 89 f5 mov %esi,%r13d ffffffff80312afc: 1577 41 54 push %r12 ffffffff80312afe: 0 41 89 fc mov %edi,%r12d ffffffff80312b01: 1310 55 push %rbp ffffffff80312b02: 1531 53 push %rbx ffffffff80312b03: 3 48 83 ec 68 sub $0x68,%rsp ffffffff80312b07: 2202 85 c9 test %ecx,%ecx ffffffff80312b09: 0 89 4c 24 0c mov %ecx,0xc(%rsp) ffffffff80312b0d: 550 44 89 44 24 08 mov %r8d,0x8(%rsp) ffffffff80312b12: 1572 4c 89 0c 24 mov %r9,(%rsp) ffffffff80312b16: 0 66 89 54 24 12 mov %dx,0x12(%rsp) ffffffff80312b1b: 588 75 04 jne ffffffff80312b21 <avc_has_perm_noaudit+0x2e> ffffffff80312b1d: 0 0f 0b ud2a ffffffff80312b1f: 0 eb fe jmp ffffffff80312b1f <avc_has_perm_noaudit+0x2c> ffffffff80312b21: 1646 0f b7 44 24 12 movzwl 0x12(%rsp),%eax ffffffff80312b26: 829 48 c7 c2 d0 26 93 80 mov $0xffffffff809326d0,%rdx ffffffff80312b2d: 589 89 44 24 14 mov %eax,0x14(%rsp) ffffffff80312b31: 698 65 8b 04 25 24 00 00 mov %gs:0x24,%eax ffffffff80312b38: 0 00 ffffffff80312b39: 791 89 c0 mov %eax,%eax ffffffff80312b3b: 549 48 c1 e0 03 shl $0x3,%rax ffffffff80312b3f: 791 48 03 05 fa 30 5a 00 add 0x5a30fa(%rip),%rax # ffffffff808b5c40 <_cpu_pda> ffffffff80312b46: 864 48 8b 00 mov (%rax),%rax ffffffff80312b49: 533 48 03 50 08 add 0x8(%rax),%rdx ffffffff80312b4d: 732 ff 02 incl (%rdx) ffffffff80312b4f: 860 8b 54 24 14 mov 0x14(%rsp),%edx ffffffff80312b53: 1259 e8 54 fc ff ff callq ffffffff803127ac <avc_hash> ffffffff80312b58: 2087 48 98 cltq ffffffff80312b5a: 1015 48 89 44 24 18 mov %rax,0x18(%rsp) ffffffff80312b5f: 0 48 c1 e0 04 shl $0x4,%rax ffffffff80312b63: 2944 4c 8d b8 60 6b a9 80 lea -0x7f5694a0(%rax),%r15 ffffffff80312b6a: 71 48 8b 80 60 6b a9 80 mov -0x7f5694a0(%rax),%rax ffffffff80312b71: 3943 eb 1a jmp ffffffff80312b8d <avc_has_perm_noaudit+0x9a> ffffffff80312b73: 5184 44 3b 23 cmp (%rbx),%r12d ffffffff80312b76: 62007 75 11 jne ffffffff80312b89 <avc_has_perm_noaudit+0x96> ffffffff80312b78: 11 66 8b 44 24 12 mov 0x12(%rsp),%ax ffffffff80312b7d: 0 66 3b 43 08 cmp 0x8(%rbx),%ax ffffffff80312b81: 11115 75 06 jne ffffffff80312b89 <avc_has_perm_noaudit+0x96> ffffffff80312b83: 4 44 3b 6b 04 cmp 0x4(%rbx),%r13d ffffffff80312b87: 14224 74 1a je ffffffff80312ba3 <avc_has_perm_noaudit+0xb0> ffffffff80312b89: 1 48 8b 43 28 mov 0x28(%rbx),%rax ffffffff80312b8d: 6921 48 8d 58 d8 lea -0x28(%rax),%rbx ffffffff80312b91: 9654 48 8b 43 28 mov 0x28(%rbx),%rax ffffffff80312b95: 414 0f 18 08 prefetcht0 (%rax) ffffffff80312b98: 227 48 8d 43 28 lea 0x28(%rbx),%rax ffffffff80312b9c: 9617 4c 39 f8 cmp %r15,%rax ffffffff80312b9f: 1402 75 d2 jne ffffffff80312b73 <avc_has_perm_noaudit+0x80> ffffffff80312ba1: 0 eb 41 jmp ffffffff80312be4 <avc_has_perm_noaudit+0xf1> ffffffff80312ba3: 0 83 7b 20 01 cmpl $0x1,0x20(%rbx) ffffffff80312ba7: 671 0f 84 70 02 00 00 je ffffffff80312e1d <avc_has_perm_noaudit+0x32a> ffffffff80312bad: 0 c7 43 20 01 00 00 00 movl $0x1,0x20(%rbx) ffffffff80312bb4: 0 e9 64 02 00 00 jmpq ffffffff80312e1d <avc_has_perm_noaudit+0x32a> ffffffff80312bb9: 2118 65 8b 14 25 24 00 00 mov %gs:0x24,%edx ffffffff80312bc0: 0 00 ffffffff80312bc1: 8245 89 d2 mov %edx,%edx ffffffff80312bc3: 0 48 c7 c0 d0 26 93 80 mov $0xffffffff809326d0,%rax ffffffff80312bca: 511 48 c1 e2 03 shl $0x3,%rdx ffffffff80312bce: 11308 48 03 15 6b 30 5a 00 add 0x5a306b(%rip),%rdx # ffffffff808b5c40 <_cpu_pda> ffffffff80312bd5: 0 48 8b 12 mov (%rdx),%rdx ffffffff80312bd8: 35 48 03 42 08 add 0x8(%rdx),%rax ffffffff80312bdc: 2224 ff 40 04 incl 0x4(%rax) ffffffff80312bdf: 1 e9 06 01 00 00 jmpq ffffffff80312cea <avc_has_perm_noaudit+0x1f7> ffffffff80312be4: 0 65 8b 14 25 24 00 00 mov %gs:0x24,%edx ffffffff80312beb: 0 00 ffffffff80312bec: 0 89 d2 mov %edx,%edx ffffffff80312bee: 0 48 c7 c0 d0 26 93 80 mov $0xffffffff809326d0,%rax ffffffff80312bf5: 0 48 8d 6c 24 30 lea 0x30(%rsp),%rbp ffffffff80312bfa: 0 48 c1 e2 03 shl $0x3,%rdx ffffffff80312bfe: 0 48 03 15 3b 30 5a 00 add 0x5a303b(%rip),%rdx # ffffffff808b5c40 <_cpu_pda> ffffffff80312c05: 0 44 89 ee mov %r13d,%esi ffffffff80312c08: 0 4c 8d 45 0c lea 0xc(%rbp),%r8 ffffffff80312c0c: 0 44 89 e7 mov %r12d,%edi ffffffff80312c0f: 0 48 8b 12 mov (%rdx),%rdx ffffffff80312c12: 0 48 03 42 08 add 0x8(%rdx),%rax ffffffff80312c16: 0 ff 40 08 incl 0x8(%rax) ffffffff80312c19: 0 8b 4c 24 0c mov 0xc(%rsp),%ecx ffffffff80312c1d: 0 8b 54 24 14 mov 0x14(%rsp),%edx ffffffff80312c21: 0 e8 ee 0a 01 00 callq ffffffff80323714 <security_compute_av> ffffffff80312c26: 0 85 c0 test %eax,%eax ffffffff80312c28: 0 41 89 c6 mov %eax,%r14d ffffffff80312c2b: 0 0f 85 02 02 00 00 jne ffffffff80312e33 <avc_has_perm_noaudit+0x340> ffffffff80312c31: 0 8b 7c 24 4c mov 0x4c(%rsp),%edi ffffffff80312c35: 0 be 01 00 00 00 mov $0x1,%esi ffffffff80312c3a: 0 e8 a5 fb ff ff callq ffffffff803127e4 <avc_latest_notif_update> ffffffff80312c3f: 0 85 c0 test %eax,%eax ffffffff80312c41: 0 0f 85 9c 00 00 00 jne ffffffff80312ce3 <avc_has_perm_noaudit+0x1f0> ffffffff80312c47: 0 e8 23 fd ff ff callq ffffffff8031296f <avc_alloc_node> ffffffff80312c4c: 0 48 85 c0 test %rax,%rax ffffffff80312c4f: 0 48 89 c3 mov %rax,%rbx ffffffff80312c52: 0 0f 84 8b 00 00 00 je ffffffff80312ce3 <avc_has_perm_noaudit+0x1f0> ffffffff80312c58: 0 8b 4c 24 14 mov 0x14(%rsp),%ecx ffffffff80312c5c: 0 49 89 e8 mov %rbp,%r8 ffffffff80312c5f: 0 44 89 e6 mov %r12d,%esi ffffffff80312c62: 0 48 89 c7 mov %rax,%rdi ffffffff80312c65: 0 44 89 ea mov %r13d,%edx ffffffff80312c68: 0 e8 5d fb ff ff callq ffffffff803127ca <avc_node_populate> ffffffff80312c6d: 0 48 8b 44 24 18 mov 0x18(%rsp),%rax ffffffff80312c72: 0 48 8d 2c 85 60 8b a9 lea -0x7f5674a0(,%rax,4),%rbp ffffffff80312c79: 0 80 ffffffff80312c7a: 0 48 89 ef mov %rbp,%rdi ffffffff80312c7d: 0 e8 44 3c 20 00 callq ffffffff805168c6 <_spin_lock_irqsave> ffffffff80312c82: 0 49 8b 37 mov (%r15),%rsi ffffffff80312c85: 0 49 89 c6 mov %rax,%r14 ffffffff80312c88: 0 eb 24 jmp ffffffff80312cae <avc_has_perm_noaudit+0x1bb> ffffffff80312c8a: 0 44 39 26 cmp %r12d,(%rsi) ffffffff80312c8d: 0 75 1b jne ffffffff80312caa <avc_has_perm_noaudit+0x1b7> ffffffff80312c8f: 0 44 39 6e 04 cmp %r13d,0x4(%rsi) ffffffff80312c93: 0 75 15 jne ffffffff80312caa <avc_has_perm_noaudit+0x1b7> ffffffff80312c95: 0 66 8b 44 24 12 mov 0x12(%rsp),%ax ffffffff80312c9a: 0 66 39 46 08 cmp %ax,0x8(%rsi) ffffffff80312c9e: 0 75 0a jne ffffffff80312caa <avc_has_perm_noaudit+0x1b7> ffffffff80312ca0: 0 48 89 df mov %rbx,%rdi ffffffff80312ca3: 0 e8 9e fb ff ff callq ffffffff80312846 <avc_node_replace> ffffffff80312ca8: 0 eb 2c jmp ffffffff80312cd6 <avc_has_perm_noaudit+0x1e3> ffffffff80312caa: 0 48 8b 76 28 mov 0x28(%rsi),%rsi ffffffff80312cae: 0 48 83 ee 28 sub $0x28,%rsi ffffffff80312cb2: 0 48 8b 56 28 mov 0x28(%rsi),%rdx ffffffff80312cb6: 0 48 8d 46 28 lea 0x28(%rsi),%rax ffffffff80312cba: 0 4c 39 f8 cmp %r15,%rax ffffffff80312cbd: 0 0f 18 0a prefetcht0 (%rdx) ffffffff80312cc0: 0 75 c8 jne ffffffff80312c8a <avc_has_perm_noaudit+0x197> ffffffff80312cc2: 0 48 8d 43 28 lea 0x28(%rbx),%rax ffffffff80312cc6: 0 48 89 53 28 mov %rdx,0x28(%rbx) ffffffff80312cca: 0 4c 89 78 08 mov %r15,0x8(%rax) ffffffff80312cce: 0 48 89 46 28 mov %rax,0x28(%rsi) ffffffff80312cd2: 0 48 89 42 08 mov %rax,0x8(%rdx) ffffffff80312cd6: 0 4c 89 f6 mov %r14,%rsi ffffffff80312cd9: 0 48 89 ef mov %rbp,%rdi ffffffff80312cdc: 0 e8 20 3d 20 00 callq ffffffff80516a01 <_spin_unlock_irqrestore> ffffffff80312ce1: 0 eb 07 jmp ffffffff80312cea <avc_has_perm_noaudit+0x1f7> ffffffff80312ce3: 0 48 8d 44 24 30 lea 0x30(%rsp),%rax ffffffff80312ce8: 0 eb 06 jmp ffffffff80312cf0 <avc_has_perm_noaudit+0x1fd> ffffffff80312cea: 2116 48 89 d8 mov %rbx,%rax ffffffff80312ced: 7632 45 31 f6 xor %r14d,%r14d ffffffff80312cf0: 1 48 83 3c 24 00 cmpq $0x0,(%rsp) ffffffff80312cf5: 404 74 10 je ffffffff80312d07 <avc_has_perm_noaudit+0x214> ffffffff80312cf7: 1804 48 8d 70 0c lea 0xc(%rax),%rsi ffffffff80312cfb: 0 b9 05 00 00 00 mov $0x5,%ecx ffffffff80312d00: 378 48 8b 3c 24 mov (%rsp),%rdi ffffffff80312d04: 8174 fc cld ffffffff80312d05: 26860 f3 a5 rep movsl %ds:(%rsi),%es:(%rdi) ffffffff80312d07: 11573 8b 40 0c mov 0xc(%rax),%eax ffffffff80312d0a: 1997 f7 d0 not %eax ffffffff80312d0c: 0 85 44 24 0c test %eax,0xc(%rsp) ffffffff80312d10: 0 0f 84 1d 01 00 00 je ffffffff80312e33 <avc_has_perm_noaudit+0x340> ffffffff80312d16: 0 f6 44 24 08 01 testb $0x1,0x8(%rsp) ffffffff80312d1b: 0 0f 85 f4 00 00 00 jne ffffffff80312e15 <avc_has_perm_noaudit+0x322> ffffffff80312d21: 0 83 3d 5c 66 78 00 00 cmpl $0x0,0x78665c(%rip) # ffffffff80a99384 <selinux_enforcing> ffffffff80312d28: 0 74 10 je ffffffff80312d3a <avc_has_perm_noaudit+0x247> ffffffff80312d2a: 0 44 89 e7 mov %r12d,%edi ffffffff80312d2d: 0 e8 87 f9 00 00 callq ffffffff803226b9 <security_permissive_sid> ffffffff80312d32: 0 85 c0 test %eax,%eax ffffffff80312d34: 0 0f 84 db 00 00 00 je ffffffff80312e15 <avc_has_perm_noaudit+0x322> ffffffff80312d3a: 0 e8 30 fc ff ff callq ffffffff8031296f <avc_alloc_node> ffffffff80312d3f: 0 48 85 c0 test %rax,%rax ffffffff80312d42: 0 48 89 c5 mov %rax,%rbp ffffffff80312d45: 0 0f 84 e8 00 00 00 je ffffffff80312e33 <avc_has_perm_noaudit+0x340> ffffffff80312d4b: 0 48 8b 44 24 18 mov 0x18(%rsp),%rax ffffffff80312d50: 0 48 8d 04 85 60 8b a9 lea -0x7f5674a0(,%rax,4),%rax ffffffff80312d57: 0 80 ffffffff80312d58: 0 48 89 c7 mov %rax,%rdi ffffffff80312d5b: 0 48 89 44 24 28 mov %rax,0x28(%rsp) ffffffff80312d60: 0 e8 61 3b 20 00 callq ffffffff805168c6 <_spin_lock_irqsave> ffffffff80312d65: 0 49 8b 1f mov (%r15),%rbx ffffffff80312d68: 0 48 89 44 24 20 mov %rax,0x20(%rsp) ffffffff80312d6d: 0 eb 1a jmp ffffffff80312d89 <avc_has_perm_noaudit+0x296> ffffffff80312d6f: 0 44 3b 23 cmp (%rbx),%r12d ffffffff80312d72: 0 75 11 jne ffffffff80312d85 <avc_has_perm_noaudit+0x292> ffffffff80312d74: 0 44 3b 6b 04 cmp 0x4(%rbx),%r13d ffffffff80312d78: 0 75 0b jne ffffffff80312d85 <avc_has_perm_noaudit+0x292> ffffffff80312d7a: 0 66 8b 44 24 12 mov 0x12(%rsp),%ax ffffffff80312d7f: 0 66 3b 43 08 cmp 0x8(%rbx),%ax ffffffff80312d83: 0 74 1a je ffffffff80312d9f <avc_has_perm_noaudit+0x2ac> ffffffff80312d85: 0 48 8b 5b 28 mov 0x28(%rbx),%rbx ffffffff80312d89: 0 48 83 eb 28 sub $0x28,%rbx ffffffff80312d8d: 0 48 8b 43 28 mov 0x28(%rbx),%rax ffffffff80312d91: 0 0f 18 08 prefetcht0 (%rax) ffffffff80312d94: 0 48 8d 43 28 lea 0x28(%rbx),%rax ffffffff80312d98: 0 4c 39 f8 cmp %r15,%rax ffffffff80312d9b: 0 75 d2 jne ffffffff80312d6f <avc_has_perm_noaudit+0x27c> ffffffff80312d9d: 0 eb 29 jmp ffffffff80312dc8 <avc_has_perm_noaudit+0x2d5> ffffffff80312d9f: 0 8b 4c 24 14 mov 0x14(%rsp),%ecx ffffffff80312da3: 0 44 89 e6 mov %r12d,%esi ffffffff80312da6: 0 48 89 ef mov %rbp,%rdi ffffffff80312da9: 0 49 89 d8 mov %rbx,%r8 ffffffff80312dac: 0 44 89 ea mov %r13d,%edx ffffffff80312daf: 0 e8 16 fa ff ff callq ffffffff803127ca <avc_node_populate> ffffffff80312db4: 0 8b 44 24 0c mov 0xc(%rsp),%eax ffffffff80312db8: 0 09 45 0c or %eax,0xc(%rbp) ffffffff80312dbb: 0 48 89 de mov %rbx,%rsi ffffffff80312dbe: 0 48 89 ef mov %rbp,%rdi ffffffff80312dc1: 0 e8 80 fa ff ff callq ffffffff80312846 <avc_node_replace> ffffffff80312dc6: 0 eb 3c jmp ffffffff80312e04 <avc_has_perm_noaudit+0x311> ffffffff80312dc8: 0 48 8b 3d a9 65 78 00 mov 0x7865a9(%rip),%rdi # ffffffff80a99378 <avc_node_cachep> ffffffff80312dcf: 0 48 89 ee mov %rbp,%rsi ffffffff80312dd2: 0 e8 7b c6 f7 ff callq ffffffff8028f452 <kmem_cache_free> ffffffff80312dd7: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax ffffffff80312dde: 0 00 ffffffff80312ddf: 0 89 c0 mov %eax,%eax ffffffff80312de1: 0 48 c7 c2 d0 26 93 80 mov $0xffffffff809326d0,%rdx ffffffff80312de8: 0 48 c1 e0 03 shl $0x3,%rax ffffffff80312dec: 0 48 03 05 4d 2e 5a 00 add 0x5a2e4d(%rip),%rax # ffffffff808b5c40 <_cpu_pda> ffffffff80312df3: 0 48 8b 00 mov (%rax),%rax ffffffff80312df6: 0 48 03 50 08 add 0x8(%rax),%rdx ffffffff80312dfa: 0 ff 42 14 incl 0x14(%rdx) ffffffff80312dfd: 0 f0 ff 0d 60 65 78 00 lock decl 0x786560(%rip) # ffffffff80a99364 <avc_cache+0x2804> ffffffff80312e04: 0 48 8b 74 24 20 mov 0x20(%rsp),%rsi ffffffff80312e09: 0 48 8b 7c 24 28 mov 0x28(%rsp),%rdi ffffffff80312e0e: 0 e8 ee 3b 20 00 callq ffffffff80516a01 <_spin_unlock_irqrestore> ffffffff80312e13: 0 eb 1e jmp ffffffff80312e33 <avc_has_perm_noaudit+0x340> ffffffff80312e15: 0 41 be f3 ff ff ff mov $0xfffffff3,%r14d ffffffff80312e1b: 0 eb 16 jmp ffffffff80312e33 <avc_has_perm_noaudit+0x340> ffffffff80312e1d: 35502 8b 44 24 0c mov 0xc(%rsp),%eax ffffffff80312e21: 4360 23 43 10 and 0x10(%rbx),%eax ffffffff80312e24: 0 3b 44 24 0c cmp 0xc(%rsp),%eax ffffffff80312e28: 0 0f 85 b6 fd ff ff jne ffffffff80312be4 <avc_has_perm_noaudit+0xf1> ffffffff80312e2e: 104641 e9 86 fd ff ff jmpq ffffffff80312bb9 <avc_has_perm_noaudit+0xc6> ffffffff80312e33: 2106 48 83 c4 68 add $0x68,%rsp ffffffff80312e37: 1 44 89 f0 mov %r14d,%eax ffffffff80312e3a: 2068 5b pop %rbx ffffffff80312e3b: 0 5d pop %rbp ffffffff80312e3c: 8 41 5c pop %r12 ffffffff80312e3e: 2001 41 5d pop %r13 ffffffff80312e40: 0 41 5e pop %r14 ffffffff80312e42: 162 41 5f pop %r15 ffffffff80312e44: 2107 c3 retq its main callsite is: ffffffff8031368c: 2809 <avc_has_perm>: [...] ffffffff803136b6: 651 e8 38 f4 ff ff callq ffffffff80312af3 <avc_has_perm_noaudit> avc_has_perm() usage is spread out amongst 3 callsites in 2 selinux functions: selinux_ip_postroute(): ffffffff80314d02: 491 e8 85 e9 ff ff callq ffffffff8031368c <avc_has_perm> selinux_socket_sock_rcv_skb(): ffffffff80314eea: 461 e8 9d e7 ff ff callq ffffffff8031368c <avc_has_perm> ffffffff80314faf: 476 e8 d8 e6 ff ff callq ffffffff8031368c <avc_has_perm> related to networking. regarding avc_has_perm_noaudit() itself, it has a couple of hot spots: ffffffff80312b73: 5184 44 3b 23 cmp (%rbx),%r12d ffffffff80312b76: 62007 75 11 jne ffffffff80312b89 <avc_has_perm_noaudit+0x96> quick guess: cache-cold-miss site. ffffffff80312d04: 8174 fc cld ffffffff80312d05: 26860 f3 a5 rep movsl %ds:(%rsi),%es:(%rdi) quick guess: unnecessary initialization of something largish via memset. Probably: security/selinux/avc.c:avc_has_perm_noaudit()'s: [...] if (avd) memcpy(avd, &p_ae->avd, sizeof(*avd)); but one of the fattest ones: ffffffff80312e28: 0 0f 85 b6 fd ff ff jne ffffffff80312be4 <avc_has_perm_noaudit+0xf1> ffffffff80312e2e: 104641 e9 86 fd ff ff jmpq ffffffff80312bb9 <avc_has_perm_noaudit+0xc6> ffffffff80312e33: 2106 48 83 c4 68 add $0x68,%rsp that seems to be either a branch mispredict (seems a tad expensive for that though), or a cachemiss delayed to the first non-predicted branch. Ah, that's most likely the case, we fall through straight from here: ffffffff80312dfd: 0 f0 ff 0d 60 65 78 00 lock decl 0x786560(%rip) that's an atomic op of some global address, in the hotpath. Not good. the wider context is: ffffffff80312e1d: 35502 8b 44 24 0c mov 0xc(%rsp),%eax ffffffff80312e21: 4360 23 43 10 and 0x10(%rbx),%eax ffffffff80312e24: 0 3b 44 24 0c cmp 0xc(%rsp),%eax ffffffff80312e28: 0 0f 85 b6 fd ff ff jne ffffffff80312be4 <avc_has_perm_noaudit+0xf1> ffffffff80312e2e: 104641 e9 86 fd ff ff jmpq ffffffff80312bb9 <avc_has_perm_noaudit+0xc6> ffffffff80312e33: 2106 48 83 c4 68 add $0x68,%rsp ah, yes. My guess is that the "and (%rbx)" at ffffffff80312e21 generated this miss, and this all is avc_update_node()'s for-each-list-loop, and: spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag); that hash doesnt seem to be working well here. It's done via: static inline int avc_hash(u32 ssid, u32 tsid, u16 tclass) { return (ssid ^ (tsid<<2) ^ (tclass<<4)) & (AVC_CACHE_SLOTS - 1); } AVC_CACHE_SLOTS is 512 - but my usecase is likely has a much narrower hash key space than that. Increasing the hash wont work, these kind of things really only start scaling once some natural per-CPU construct is found to it. And things like this: /* cache hit */ if (atomic_read(&ret->ae.used) != 1) atomic_set(&ret->ae.used, 1); in avc_search_node() dont really help either as they immediately dirty the cacheline in the cache-hit case. Hashed fastpath lookup really should only be used to validate security rules in a read-mostly way, and cachelines should never be dirtied, as long as it can be avoided. Anyway, this function needs a good scalability look as it represents 3.9% of the total tbench cost. I'd not be surprised if it was possible more than half of that cost via not too ugly changes. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (3 preceding siblings ...) 2008-11-17 20:20 ` (avc_has_perm_noaudit()) " Ingo Molnar @ 2008-11-17 20:32 ` Ingo Molnar 2008-11-17 20:57 ` Eric Dumazet 2008-11-18 9:12 ` Nick Piggin 2008-11-17 20:47 ` Ingo Molnar ` (9 subsequent siblings) 14 siblings, 2 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 20:32 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 3.356152 ip_queue_xmit hits (335615 total) ......... ffffffff804b7045: 1001 <ip_queue_xmit>: ffffffff804b7045: 1001 41 57 push %r15 ffffffff804b7047: 36698 41 56 push %r14 ffffffff804b7049: 0 49 89 fe mov %rdi,%r14 ffffffff804b704c: 0 41 55 push %r13 ffffffff804b704e: 447 41 54 push %r12 ffffffff804b7050: 0 55 push %rbp ffffffff804b7051: 4 53 push %rbx ffffffff804b7052: 465 48 83 ec 68 sub $0x68,%rsp ffffffff804b7056: 1 89 74 24 08 mov %esi,0x8(%rsp) ffffffff804b705a: 486 48 8b 47 28 mov 0x28(%rdi),%rax ffffffff804b705e: 0 48 8b 6f 10 mov 0x10(%rdi),%rbp ffffffff804b7062: 7 48 85 c0 test %rax,%rax ffffffff804b7065: 480 48 89 44 24 58 mov %rax,0x58(%rsp) ffffffff804b706a: 0 4c 8b bd 48 02 00 00 mov 0x248(%rbp),%r15 ffffffff804b7071: 7 0f 85 0d 01 00 00 jne ffffffff804b7184 <ip_queue_xmit+0x13f> ffffffff804b7077: 452 31 f6 xor %esi,%esi ffffffff804b7079: 0 48 89 ef mov %rbp,%rdi ffffffff804b707c: 5 e8 c1 eb fc ff callq ffffffff80485c42 <__sk_dst_check> ffffffff804b7081: 434 48 85 c0 test %rax,%rax ffffffff804b7084: 54 48 89 44 24 58 mov %rax,0x58(%rsp) ffffffff804b7089: 0 0f 85 e0 00 00 00 jne ffffffff804b716f <ip_queue_xmit+0x12a> ffffffff804b708f: 0 4d 85 ff test %r15,%r15 ffffffff804b7092: 0 44 8b ad 30 02 00 00 mov 0x230(%rbp),%r13d ffffffff804b7099: 0 74 0a je ffffffff804b70a5 <ip_queue_xmit+0x60> ffffffff804b709b: 0 41 80 7f 05 00 cmpb $0x0,0x5(%r15) ffffffff804b70a0: 0 74 03 je ffffffff804b70a5 <ip_queue_xmit+0x60> ffffffff804b70a2: 0 45 8b 2f mov (%r15),%r13d ffffffff804b70a5: 0 8b 85 3c 02 00 00 mov 0x23c(%rbp),%eax ffffffff804b70ab: 0 48 8d b5 10 01 00 00 lea 0x110(%rbp),%rsi ffffffff804b70b2: 0 44 8b 65 04 mov 0x4(%rbp),%r12d ffffffff804b70b6: 0 bf 0d 00 00 00 mov $0xd,%edi ffffffff804b70bb: 0 89 44 24 0c mov %eax,0xc(%rsp) ffffffff804b70bf: 0 8a 9d 54 02 00 00 mov 0x254(%rbp),%bl ffffffff804b70c5: 0 e8 9a df ff ff callq ffffffff804b5064 <constant_test_bit> ffffffff804b70ca: 0 31 d2 xor %edx,%edx ffffffff804b70cc: 0 48 8d 7c 24 10 lea 0x10(%rsp),%rdi ffffffff804b70d1: 0 41 89 c3 mov %eax,%r11d ffffffff804b70d4: 0 fc cld ffffffff804b70d5: 0 89 d0 mov %edx,%eax ffffffff804b70d7: 0 b9 10 00 00 00 mov $0x10,%ecx ffffffff804b70dc: 0 44 8a 45 39 mov 0x39(%rbp),%r8b ffffffff804b70e0: 0 40 8a b5 57 02 00 00 mov 0x257(%rbp),%sil ffffffff804b70e7: 0 44 8b 8d 50 02 00 00 mov 0x250(%rbp),%r9d ffffffff804b70ee: 0 83 e3 1e and $0x1e,%ebx ffffffff804b70f1: 0 44 8b 95 38 02 00 00 mov 0x238(%rbp),%r10d ffffffff804b70f8: 0 44 09 db or %r11d,%ebx ffffffff804b70fb: 0 f3 ab rep stos %eax,%es:(%rdi) ffffffff804b70fd: 0 40 c0 ee 05 shr $0x5,%sil ffffffff804b7101: 0 88 5c 24 24 mov %bl,0x24(%rsp) ffffffff804b7105: 0 48 8d 5c 24 10 lea 0x10(%rsp),%rbx ffffffff804b710a: 0 83 e6 01 and $0x1,%esi ffffffff804b710d: 0 48 89 ef mov %rbp,%rdi ffffffff804b7110: 0 44 88 44 24 40 mov %r8b,0x40(%rsp) ffffffff804b7115: 0 8b 44 24 0c mov 0xc(%rsp),%eax ffffffff804b7119: 0 40 88 74 24 41 mov %sil,0x41(%rsp) ffffffff804b711e: 0 48 89 de mov %rbx,%rsi ffffffff804b7121: 0 66 44 89 4c 24 44 mov %r9w,0x44(%rsp) ffffffff804b7127: 0 66 44 89 54 24 46 mov %r10w,0x46(%rsp) ffffffff804b712d: 0 44 89 64 24 10 mov %r12d,0x10(%rsp) ffffffff804b7132: 0 44 89 6c 24 1c mov %r13d,0x1c(%rsp) ffffffff804b7137: 0 89 44 24 20 mov %eax,0x20(%rsp) ffffffff804b713b: 0 e8 2d 9f e5 ff callq ffffffff8031106d <security_sk_classify_flow> ffffffff804b7140: 0 48 8d 74 24 58 lea 0x58(%rsp),%rsi ffffffff804b7145: 0 45 31 c0 xor %r8d,%r8d ffffffff804b7148: 0 48 89 e9 mov %rbp,%rcx ffffffff804b714b: 0 48 89 da mov %rbx,%rdx ffffffff804b714e: 0 48 c7 c7 d0 15 ab 80 mov $0xffffffff80ab15d0,%rdi ffffffff804b7155: 0 e8 1a 91 ff ff callq ffffffff804b0274 <ip_route_output_flow> ffffffff804b715a: 0 85 c0 test %eax,%eax ffffffff804b715c: 0 0f 85 9f 01 00 00 jne ffffffff804b7301 <ip_queue_xmit+0x2bc> ffffffff804b7162: 0 48 8b 74 24 58 mov 0x58(%rsp),%rsi ffffffff804b7167: 0 48 89 ef mov %rbp,%rdi ffffffff804b716a: 0 e8 a8 eb fc ff callq ffffffff80485d17 <sk_setup_caps> ffffffff804b716f: 441 48 8b 44 24 58 mov 0x58(%rsp),%rax ffffffff804b7174: 1388 48 85 c0 test %rax,%rax ffffffff804b7177: 0 74 07 je ffffffff804b7180 <ip_queue_xmit+0x13b> ffffffff804b7179: 0 f0 ff 80 b0 00 00 00 lock incl 0xb0(%rax) ffffffff804b7180: 556 49 89 46 28 mov %rax,0x28(%r14) ffffffff804b7184: 8351 4d 85 ff test %r15,%r15 ffffffff804b7187: 0 be 14 00 00 00 mov $0x14,%esi ffffffff804b718c: 461 74 26 je ffffffff804b71b4 <ip_queue_xmit+0x16f> ffffffff804b718e: 0 41 f6 47 08 01 testb $0x1,0x8(%r15) ffffffff804b7193: 0 74 17 je ffffffff804b71ac <ip_queue_xmit+0x167> ffffffff804b7195: 0 48 8b 54 24 58 mov 0x58(%rsp),%rdx ffffffff804b719a: 0 8b 82 28 01 00 00 mov 0x128(%rdx),%eax ffffffff804b71a0: 0 39 82 1c 01 00 00 cmp %eax,0x11c(%rdx) ffffffff804b71a6: 0 0f 85 55 01 00 00 jne ffffffff804b7301 <ip_queue_xmit+0x2bc> ffffffff804b71ac: 0 41 0f b6 47 04 movzbl 0x4(%r15),%eax ffffffff804b71b1: 0 8d 70 14 lea 0x14(%rax),%esi ffffffff804b71b4: 39 4c 89 f7 mov %r14,%rdi ffffffff804b71b7: 493 e8 f8 18 fd ff callq ffffffff80488ab4 <skb_push> ffffffff804b71bc: 0 4c 89 f7 mov %r14,%rdi ffffffff804b71bf: 1701 e8 99 df ff ff callq ffffffff804b515d <skb_reset_network_header> ffffffff804b71c4: 481 0f b6 85 54 02 00 00 movzbl 0x254(%rbp),%eax ffffffff804b71cb: 4202 41 8b 9e bc 00 00 00 mov 0xbc(%r14),%ebx ffffffff804b71d2: 3 48 89 ef mov %rbp,%rdi ffffffff804b71d5: 0 49 03 9e d0 00 00 00 add 0xd0(%r14),%rbx ffffffff804b71dc: 466 80 cc 45 or $0x45,%ah ffffffff804b71df: 7 66 c1 c0 08 rol $0x8,%ax ffffffff804b71e3: 0 66 89 03 mov %ax,(%rbx) ffffffff804b71e6: 492 48 8b 74 24 58 mov 0x58(%rsp),%rsi ffffffff804b71eb: 3 e8 a0 df ff ff callq ffffffff804b5190 <ip_dont_fragment> ffffffff804b71f0: 1405 85 c0 test %eax,%eax ffffffff804b71f2: 4391 74 0f je ffffffff804b7203 <ip_queue_xmit+0x1be> ffffffff804b71f4: 0 83 7c 24 08 00 cmpl $0x0,0x8(%rsp) ffffffff804b71f9: 417 75 08 jne ffffffff804b7203 <ip_queue_xmit+0x1be> ffffffff804b71fb: 503 66 c7 43 06 40 00 movw $0x40,0x6(%rbx) ffffffff804b7201: 6743 eb 06 jmp ffffffff804b7209 <ip_queue_xmit+0x1c4> ffffffff804b7203: 0 66 c7 43 06 00 00 movw $0x0,0x6(%rbx) ffffffff804b7209: 118 0f bf 85 40 02 00 00 movswl 0x240(%rbp),%eax ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx ffffffff804b7215: 340 85 c0 test %eax,%eax ffffffff804b7217: 0 79 06 jns ffffffff804b721f <ip_queue_xmit+0x1da> ffffffff804b7219: 107464 8b 82 9c 00 00 00 mov 0x9c(%rdx),%eax ffffffff804b721f: 4963 88 43 08 mov %al,0x8(%rbx) ffffffff804b7222: 26297 8a 45 39 mov 0x39(%rbp),%al ffffffff804b7225: 76658 4d 85 ff test %r15,%r15 ffffffff804b7228: 1712 88 43 09 mov %al,0x9(%rbx) ffffffff804b722b: 148 48 8b 44 24 58 mov 0x58(%rsp),%rax ffffffff804b7230: 2971 8b 80 20 01 00 00 mov 0x120(%rax),%eax ffffffff804b7236: 14849 89 43 0c mov %eax,0xc(%rbx) ffffffff804b7239: 84 48 8b 44 24 58 mov 0x58(%rsp),%rax ffffffff804b723e: 360 8b 80 1c 01 00 00 mov 0x11c(%rax),%eax ffffffff804b7244: 174 89 43 10 mov %eax,0x10(%rbx) ffffffff804b7247: 96 74 32 je ffffffff804b727b <ip_queue_xmit+0x236> ffffffff804b7249: 0 41 8a 57 04 mov 0x4(%r15),%dl ffffffff804b724d: 0 84 d2 test %dl,%dl ffffffff804b724f: 0 74 2a je ffffffff804b727b <ip_queue_xmit+0x236> ffffffff804b7251: 0 c0 ea 02 shr $0x2,%dl ffffffff804b7254: 0 03 13 add (%rbx),%edx ffffffff804b7256: 0 8a 03 mov (%rbx),%al ffffffff804b7258: 0 45 31 c0 xor %r8d,%r8d ffffffff804b725b: 0 4c 89 fe mov %r15,%rsi ffffffff804b725e: 0 4c 89 f7 mov %r14,%rdi ffffffff804b7261: 0 83 e0 f0 and $0xfffffffffffffff0,%eax ffffffff804b7264: 0 83 e2 0f and $0xf,%edx ffffffff804b7267: 0 09 d0 or %edx,%eax ffffffff804b7269: 0 88 03 mov %al,(%rbx) ffffffff804b726b: 0 48 8b 4c 24 58 mov 0x58(%rsp),%rcx ffffffff804b7270: 0 8b 95 30 02 00 00 mov 0x230(%rbp),%edx ffffffff804b7276: 0 e8 e4 d8 ff ff callq ffffffff804b4b5f <ip_options_build> ffffffff804b727b: 541 41 8b 86 c8 00 00 00 mov 0xc8(%r14),%eax ffffffff804b7282: 570 31 d2 xor %edx,%edx ffffffff804b7284: 0 49 03 86 d0 00 00 00 add 0xd0(%r14),%rax ffffffff804b728b: 34 8b 40 08 mov 0x8(%rax),%eax ffffffff804b728e: 496 66 85 c0 test %ax,%ax ffffffff804b7291: 11 74 06 je ffffffff804b7299 <ip_queue_xmit+0x254> ffffffff804b7293: 9 0f b7 c0 movzwl %ax,%eax ffffffff804b7296: 495 8d 50 ff lea -0x1(%rax),%edx ffffffff804b7299: 2 f6 43 06 40 testb $0x40,0x6(%rbx) ffffffff804b729d: 9 48 8b 74 24 58 mov 0x58(%rsp),%rsi ffffffff804b72a2: 497 74 34 je ffffffff804b72d8 <ip_queue_xmit+0x293> ffffffff804b72a4: 8 83 bd 30 02 00 00 00 cmpl $0x0,0x230(%rbp) ffffffff804b72ab: 10 74 23 je ffffffff804b72d0 <ip_queue_xmit+0x28b> ffffffff804b72ad: 1044 66 8b 85 52 02 00 00 mov 0x252(%rbp),%ax ffffffff804b72b4: 7 66 c1 c0 08 rol $0x8,%ax ffffffff804b72b8: 8 66 89 43 04 mov %ax,0x4(%rbx) ffffffff804b72bc: 432 66 8b 85 52 02 00 00 mov 0x252(%rbp),%ax ffffffff804b72c3: 9 ff c0 inc %eax ffffffff804b72c5: 14 01 d0 add %edx,%eax ffffffff804b72c7: 1141 66 89 85 52 02 00 00 mov %ax,0x252(%rbp) ffffffff804b72ce: 7 eb 10 jmp ffffffff804b72e0 <ip_queue_xmit+0x29b> ffffffff804b72d0: 0 66 c7 43 04 00 00 movw $0x0,0x4(%rbx) ffffffff804b72d6: 0 eb 08 jmp ffffffff804b72e0 <ip_queue_xmit+0x29b> ffffffff804b72d8: 0 48 89 df mov %rbx,%rdi ffffffff804b72db: 0 e8 b7 9d ff ff callq ffffffff804b1097 <__ip_select_ident> ffffffff804b72e0: 6 8b 85 54 01 00 00 mov 0x154(%rbp),%eax ffffffff804b72e6: 458 4c 89 f7 mov %r14,%rdi ffffffff804b72e9: 2 41 89 46 78 mov %eax,0x78(%r14) ffffffff804b72ed: 4 8b 85 f0 01 00 00 mov 0x1f0(%rbp),%eax ffffffff804b72f3: 841 41 89 86 b0 00 00 00 mov %eax,0xb0(%r14) ffffffff804b72fa: 11 e8 30 f2 ff ff callq ffffffff804b652f <ip_local_out> ffffffff804b72ff: 0 eb 44 jmp ffffffff804b7345 <ip_queue_xmit+0x300> ffffffff804b7301: 0 65 48 8b 04 25 10 00 mov %gs:0x10,%rax ffffffff804b7308: 0 00 00 ffffffff804b730a: 0 8b 80 48 e0 ff ff mov -0x1fb8(%rax),%eax ffffffff804b7310: 0 4c 89 f7 mov %r14,%rdi ffffffff804b7313: 0 30 c0 xor %al,%al ffffffff804b7315: 0 66 83 f8 01 cmp $0x1,%ax ffffffff804b7319: 0 48 19 c0 sbb %rax,%rax ffffffff804b731c: 0 83 e0 08 and $0x8,%eax ffffffff804b731f: 0 48 8b 90 a8 16 ab 80 mov -0x7f54e958(%rax),%rdx ffffffff804b7326: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax ffffffff804b732d: 0 00 ffffffff804b732e: 0 89 c0 mov %eax,%eax ffffffff804b7330: 0 48 f7 d2 not %rdx ffffffff804b7333: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax ffffffff804b7337: 0 48 ff 40 68 incq 0x68(%rax) ffffffff804b733b: 0 e8 b1 18 fd ff callq ffffffff80488bf1 <kfree_skb> ffffffff804b7340: 0 b8 8f ff ff ff mov $0xffffff8f,%eax ffffffff804b7345: 9196 48 83 c4 68 add $0x68,%rsp ffffffff804b7349: 892 5b pop %rbx ffffffff804b734a: 0 5d pop %rbp ffffffff804b734b: 488 41 5c pop %r12 ffffffff804b734d: 0 41 5d pop %r13 ffffffff804b734f: 0 41 5e pop %r14 ffffffff804b7351: 513 41 5f pop %r15 ffffffff804b7353: 0 c3 retq about 10% of this function's cost is artificial: ffffffff804b7045: 1001 <ip_queue_xmit>: ffffffff804b7045: 1001 41 57 push %r15 ffffffff804b7047: 36698 41 56 push %r14 there are profiler hits that leaked in via out-of-order execution from the callsites. The callsites are hard to map unfortunately, as this function is called via function pointers. the most likely callsite is tcp_transmit_skb(). 30% of the overhead of this function comes from: ffffffff804b7203: 0 66 c7 43 06 00 00 movw $0x0,0x6(%rbx) ffffffff804b7209: 118 0f bf 85 40 02 00 00 movswl 0x240(%rbp),%eax ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx ffffffff804b7215: 340 85 c0 test %eax,%eax ffffffff804b7217: 0 79 06 jns ffffffff804b721f <ip_queue_xmit+0x1da> ffffffff804b7219: 107464 8b 82 9c 00 00 00 mov 0x9c(%rdx),%eax ffffffff804b721f: 4963 88 43 08 mov %al,0x8(%rbx) the 16-bit movw looks a bit weird. It comes from line 372: 0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372). 367 iph = ip_hdr(skb); 368 *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff)); 369 if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok) 370 iph->frag_off = htons(IP_DF); 371 else 372 iph->frag_off = 0; 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst); 374 iph->protocol = sk->sk_protocol; 375 iph->saddr = rt->rt_src; 376 iph->daddr = rt->rt_dst; the ip-header fragment flag setting to zero. 16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is towards eliminating them as much as possible. _But_, the real overhead probably comes from: ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx which is the next line, the ttl field: 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst); this shows that we are doing a hard cachemiss on the net-localhost route dst structure cacheline. We do a plain load instruction from it here and get a hefty cachemiss. (because 16 CPUs are banging on that single route) And let make sure we see this in perspective as well: that single cachemiss is _1.0 percent_ of the total tbench cost. (!) We could make the scheduler 10% slower straight away and it would have less of a real-life effect than this single iph->ttl field setting. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 20:32 ` ip_queue_xmit(): " Ingo Molnar @ 2008-11-17 20:57 ` Eric Dumazet 2008-11-18 9:12 ` Nick Piggin 1 sibling, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 20:57 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger Ingo Molnar a écrit : > * Ingo Molnar <mingo@elte.hu> wrote: > >> 100.000000 total >> ................ >> 3.356152 ip_queue_xmit > > hits (335615 total) > ......... > ffffffff804b7045: 1001 <ip_queue_xmit>: > ffffffff804b7045: 1001 41 57 push %r15 > ffffffff804b7047: 36698 41 56 push %r14 > ffffffff804b7049: 0 49 89 fe mov %rdi,%r14 > ffffffff804b704c: 0 41 55 push %r13 > ffffffff804b704e: 447 41 54 push %r12 > ffffffff804b7050: 0 55 push %rbp > ffffffff804b7051: 4 53 push %rbx > ffffffff804b7052: 465 48 83 ec 68 sub $0x68,%rsp > ffffffff804b7056: 1 89 74 24 08 mov %esi,0x8(%rsp) > ffffffff804b705a: 486 48 8b 47 28 mov 0x28(%rdi),%rax > ffffffff804b705e: 0 48 8b 6f 10 mov 0x10(%rdi),%rbp > ffffffff804b7062: 7 48 85 c0 test %rax,%rax > ffffffff804b7065: 480 48 89 44 24 58 mov %rax,0x58(%rsp) > ffffffff804b706a: 0 4c 8b bd 48 02 00 00 mov 0x248(%rbp),%r15 > ffffffff804b7071: 7 0f 85 0d 01 00 00 jne ffffffff804b7184 <ip_queue_xmit+0x13f> > ffffffff804b7077: 452 31 f6 xor %esi,%esi > ffffffff804b7079: 0 48 89 ef mov %rbp,%rdi > ffffffff804b707c: 5 e8 c1 eb fc ff callq ffffffff80485c42 <__sk_dst_check> > ffffffff804b7081: 434 48 85 c0 test %rax,%rax > ffffffff804b7084: 54 48 89 44 24 58 mov %rax,0x58(%rsp) > ffffffff804b7089: 0 0f 85 e0 00 00 00 jne ffffffff804b716f <ip_queue_xmit+0x12a> > ffffffff804b708f: 0 4d 85 ff test %r15,%r15 > ffffffff804b7092: 0 44 8b ad 30 02 00 00 mov 0x230(%rbp),%r13d > ffffffff804b7099: 0 74 0a je ffffffff804b70a5 <ip_queue_xmit+0x60> > ffffffff804b709b: 0 41 80 7f 05 00 cmpb $0x0,0x5(%r15) > ffffffff804b70a0: 0 74 03 je ffffffff804b70a5 <ip_queue_xmit+0x60> > ffffffff804b70a2: 0 45 8b 2f mov (%r15),%r13d > ffffffff804b70a5: 0 8b 85 3c 02 00 00 mov 0x23c(%rbp),%eax > ffffffff804b70ab: 0 48 8d b5 10 01 00 00 lea 0x110(%rbp),%rsi > ffffffff804b70b2: 0 44 8b 65 04 mov 0x4(%rbp),%r12d > ffffffff804b70b6: 0 bf 0d 00 00 00 mov $0xd,%edi > ffffffff804b70bb: 0 89 44 24 0c mov %eax,0xc(%rsp) > ffffffff804b70bf: 0 8a 9d 54 02 00 00 mov 0x254(%rbp),%bl > ffffffff804b70c5: 0 e8 9a df ff ff callq ffffffff804b5064 <constant_test_bit> > ffffffff804b70ca: 0 31 d2 xor %edx,%edx > ffffffff804b70cc: 0 48 8d 7c 24 10 lea 0x10(%rsp),%rdi > ffffffff804b70d1: 0 41 89 c3 mov %eax,%r11d > ffffffff804b70d4: 0 fc cld > ffffffff804b70d5: 0 89 d0 mov %edx,%eax > ffffffff804b70d7: 0 b9 10 00 00 00 mov $0x10,%ecx > ffffffff804b70dc: 0 44 8a 45 39 mov 0x39(%rbp),%r8b > ffffffff804b70e0: 0 40 8a b5 57 02 00 00 mov 0x257(%rbp),%sil > ffffffff804b70e7: 0 44 8b 8d 50 02 00 00 mov 0x250(%rbp),%r9d > ffffffff804b70ee: 0 83 e3 1e and $0x1e,%ebx > ffffffff804b70f1: 0 44 8b 95 38 02 00 00 mov 0x238(%rbp),%r10d > ffffffff804b70f8: 0 44 09 db or %r11d,%ebx > ffffffff804b70fb: 0 f3 ab rep stos %eax,%es:(%rdi) > ffffffff804b70fd: 0 40 c0 ee 05 shr $0x5,%sil > ffffffff804b7101: 0 88 5c 24 24 mov %bl,0x24(%rsp) > ffffffff804b7105: 0 48 8d 5c 24 10 lea 0x10(%rsp),%rbx > ffffffff804b710a: 0 83 e6 01 and $0x1,%esi > ffffffff804b710d: 0 48 89 ef mov %rbp,%rdi > ffffffff804b7110: 0 44 88 44 24 40 mov %r8b,0x40(%rsp) > ffffffff804b7115: 0 8b 44 24 0c mov 0xc(%rsp),%eax > ffffffff804b7119: 0 40 88 74 24 41 mov %sil,0x41(%rsp) > ffffffff804b711e: 0 48 89 de mov %rbx,%rsi > ffffffff804b7121: 0 66 44 89 4c 24 44 mov %r9w,0x44(%rsp) > ffffffff804b7127: 0 66 44 89 54 24 46 mov %r10w,0x46(%rsp) > ffffffff804b712d: 0 44 89 64 24 10 mov %r12d,0x10(%rsp) > ffffffff804b7132: 0 44 89 6c 24 1c mov %r13d,0x1c(%rsp) > ffffffff804b7137: 0 89 44 24 20 mov %eax,0x20(%rsp) > ffffffff804b713b: 0 e8 2d 9f e5 ff callq ffffffff8031106d <security_sk_classify_flow> > ffffffff804b7140: 0 48 8d 74 24 58 lea 0x58(%rsp),%rsi > ffffffff804b7145: 0 45 31 c0 xor %r8d,%r8d > ffffffff804b7148: 0 48 89 e9 mov %rbp,%rcx > ffffffff804b714b: 0 48 89 da mov %rbx,%rdx > ffffffff804b714e: 0 48 c7 c7 d0 15 ab 80 mov $0xffffffff80ab15d0,%rdi > ffffffff804b7155: 0 e8 1a 91 ff ff callq ffffffff804b0274 <ip_route_output_flow> > ffffffff804b715a: 0 85 c0 test %eax,%eax > ffffffff804b715c: 0 0f 85 9f 01 00 00 jne ffffffff804b7301 <ip_queue_xmit+0x2bc> > ffffffff804b7162: 0 48 8b 74 24 58 mov 0x58(%rsp),%rsi > ffffffff804b7167: 0 48 89 ef mov %rbp,%rdi > ffffffff804b716a: 0 e8 a8 eb fc ff callq ffffffff80485d17 <sk_setup_caps> > ffffffff804b716f: 441 48 8b 44 24 58 mov 0x58(%rsp),%rax > ffffffff804b7174: 1388 48 85 c0 test %rax,%rax > ffffffff804b7177: 0 74 07 je ffffffff804b7180 <ip_queue_xmit+0x13b> > ffffffff804b7179: 0 f0 ff 80 b0 00 00 00 lock incl 0xb0(%rax) > ffffffff804b7180: 556 49 89 46 28 mov %rax,0x28(%r14) > ffffffff804b7184: 8351 4d 85 ff test %r15,%r15 > ffffffff804b7187: 0 be 14 00 00 00 mov $0x14,%esi > ffffffff804b718c: 461 74 26 je ffffffff804b71b4 <ip_queue_xmit+0x16f> > ffffffff804b718e: 0 41 f6 47 08 01 testb $0x1,0x8(%r15) > ffffffff804b7193: 0 74 17 je ffffffff804b71ac <ip_queue_xmit+0x167> > ffffffff804b7195: 0 48 8b 54 24 58 mov 0x58(%rsp),%rdx > ffffffff804b719a: 0 8b 82 28 01 00 00 mov 0x128(%rdx),%eax > ffffffff804b71a0: 0 39 82 1c 01 00 00 cmp %eax,0x11c(%rdx) > ffffffff804b71a6: 0 0f 85 55 01 00 00 jne ffffffff804b7301 <ip_queue_xmit+0x2bc> > ffffffff804b71ac: 0 41 0f b6 47 04 movzbl 0x4(%r15),%eax > ffffffff804b71b1: 0 8d 70 14 lea 0x14(%rax),%esi > ffffffff804b71b4: 39 4c 89 f7 mov %r14,%rdi > ffffffff804b71b7: 493 e8 f8 18 fd ff callq ffffffff80488ab4 <skb_push> > ffffffff804b71bc: 0 4c 89 f7 mov %r14,%rdi > ffffffff804b71bf: 1701 e8 99 df ff ff callq ffffffff804b515d <skb_reset_network_header> > ffffffff804b71c4: 481 0f b6 85 54 02 00 00 movzbl 0x254(%rbp),%eax > ffffffff804b71cb: 4202 41 8b 9e bc 00 00 00 mov 0xbc(%r14),%ebx > ffffffff804b71d2: 3 48 89 ef mov %rbp,%rdi > ffffffff804b71d5: 0 49 03 9e d0 00 00 00 add 0xd0(%r14),%rbx > ffffffff804b71dc: 466 80 cc 45 or $0x45,%ah > ffffffff804b71df: 7 66 c1 c0 08 rol $0x8,%ax > ffffffff804b71e3: 0 66 89 03 mov %ax,(%rbx) > ffffffff804b71e6: 492 48 8b 74 24 58 mov 0x58(%rsp),%rsi > ffffffff804b71eb: 3 e8 a0 df ff ff callq ffffffff804b5190 <ip_dont_fragment> > ffffffff804b71f0: 1405 85 c0 test %eax,%eax > ffffffff804b71f2: 4391 74 0f je ffffffff804b7203 <ip_queue_xmit+0x1be> > ffffffff804b71f4: 0 83 7c 24 08 00 cmpl $0x0,0x8(%rsp) > ffffffff804b71f9: 417 75 08 jne ffffffff804b7203 <ip_queue_xmit+0x1be> > ffffffff804b71fb: 503 66 c7 43 06 40 00 movw $0x40,0x6(%rbx) > ffffffff804b7201: 6743 eb 06 jmp ffffffff804b7209 <ip_queue_xmit+0x1c4> > ffffffff804b7203: 0 66 c7 43 06 00 00 movw $0x0,0x6(%rbx) > ffffffff804b7209: 118 0f bf 85 40 02 00 00 movswl 0x240(%rbp),%eax > ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx > ffffffff804b7215: 340 85 c0 test %eax,%eax > ffffffff804b7217: 0 79 06 jns ffffffff804b721f <ip_queue_xmit+0x1da> > ffffffff804b7219: 107464 8b 82 9c 00 00 00 mov 0x9c(%rdx),%eax > ffffffff804b721f: 4963 88 43 08 mov %al,0x8(%rbx) > ffffffff804b7222: 26297 8a 45 39 mov 0x39(%rbp),%al > ffffffff804b7225: 76658 4d 85 ff test %r15,%r15 > ffffffff804b7228: 1712 88 43 09 mov %al,0x9(%rbx) > ffffffff804b722b: 148 48 8b 44 24 58 mov 0x58(%rsp),%rax > ffffffff804b7230: 2971 8b 80 20 01 00 00 mov 0x120(%rax),%eax > ffffffff804b7236: 14849 89 43 0c mov %eax,0xc(%rbx) > ffffffff804b7239: 84 48 8b 44 24 58 mov 0x58(%rsp),%rax > ffffffff804b723e: 360 8b 80 1c 01 00 00 mov 0x11c(%rax),%eax > ffffffff804b7244: 174 89 43 10 mov %eax,0x10(%rbx) > ffffffff804b7247: 96 74 32 je ffffffff804b727b <ip_queue_xmit+0x236> > ffffffff804b7249: 0 41 8a 57 04 mov 0x4(%r15),%dl > ffffffff804b724d: 0 84 d2 test %dl,%dl > ffffffff804b724f: 0 74 2a je ffffffff804b727b <ip_queue_xmit+0x236> > ffffffff804b7251: 0 c0 ea 02 shr $0x2,%dl > ffffffff804b7254: 0 03 13 add (%rbx),%edx > ffffffff804b7256: 0 8a 03 mov (%rbx),%al > ffffffff804b7258: 0 45 31 c0 xor %r8d,%r8d > ffffffff804b725b: 0 4c 89 fe mov %r15,%rsi > ffffffff804b725e: 0 4c 89 f7 mov %r14,%rdi > ffffffff804b7261: 0 83 e0 f0 and $0xfffffffffffffff0,%eax > ffffffff804b7264: 0 83 e2 0f and $0xf,%edx > ffffffff804b7267: 0 09 d0 or %edx,%eax > ffffffff804b7269: 0 88 03 mov %al,(%rbx) > ffffffff804b726b: 0 48 8b 4c 24 58 mov 0x58(%rsp),%rcx > ffffffff804b7270: 0 8b 95 30 02 00 00 mov 0x230(%rbp),%edx > ffffffff804b7276: 0 e8 e4 d8 ff ff callq ffffffff804b4b5f <ip_options_build> > ffffffff804b727b: 541 41 8b 86 c8 00 00 00 mov 0xc8(%r14),%eax > ffffffff804b7282: 570 31 d2 xor %edx,%edx > ffffffff804b7284: 0 49 03 86 d0 00 00 00 add 0xd0(%r14),%rax > ffffffff804b728b: 34 8b 40 08 mov 0x8(%rax),%eax > ffffffff804b728e: 496 66 85 c0 test %ax,%ax > ffffffff804b7291: 11 74 06 je ffffffff804b7299 <ip_queue_xmit+0x254> > ffffffff804b7293: 9 0f b7 c0 movzwl %ax,%eax > ffffffff804b7296: 495 8d 50 ff lea -0x1(%rax),%edx > ffffffff804b7299: 2 f6 43 06 40 testb $0x40,0x6(%rbx) > ffffffff804b729d: 9 48 8b 74 24 58 mov 0x58(%rsp),%rsi > ffffffff804b72a2: 497 74 34 je ffffffff804b72d8 <ip_queue_xmit+0x293> > ffffffff804b72a4: 8 83 bd 30 02 00 00 00 cmpl $0x0,0x230(%rbp) > ffffffff804b72ab: 10 74 23 je ffffffff804b72d0 <ip_queue_xmit+0x28b> > ffffffff804b72ad: 1044 66 8b 85 52 02 00 00 mov 0x252(%rbp),%ax > ffffffff804b72b4: 7 66 c1 c0 08 rol $0x8,%ax > ffffffff804b72b8: 8 66 89 43 04 mov %ax,0x4(%rbx) > ffffffff804b72bc: 432 66 8b 85 52 02 00 00 mov 0x252(%rbp),%ax > ffffffff804b72c3: 9 ff c0 inc %eax > ffffffff804b72c5: 14 01 d0 add %edx,%eax > ffffffff804b72c7: 1141 66 89 85 52 02 00 00 mov %ax,0x252(%rbp) > ffffffff804b72ce: 7 eb 10 jmp ffffffff804b72e0 <ip_queue_xmit+0x29b> > ffffffff804b72d0: 0 66 c7 43 04 00 00 movw $0x0,0x4(%rbx) > ffffffff804b72d6: 0 eb 08 jmp ffffffff804b72e0 <ip_queue_xmit+0x29b> > ffffffff804b72d8: 0 48 89 df mov %rbx,%rdi > ffffffff804b72db: 0 e8 b7 9d ff ff callq ffffffff804b1097 <__ip_select_ident> > ffffffff804b72e0: 6 8b 85 54 01 00 00 mov 0x154(%rbp),%eax > ffffffff804b72e6: 458 4c 89 f7 mov %r14,%rdi > ffffffff804b72e9: 2 41 89 46 78 mov %eax,0x78(%r14) > ffffffff804b72ed: 4 8b 85 f0 01 00 00 mov 0x1f0(%rbp),%eax > ffffffff804b72f3: 841 41 89 86 b0 00 00 00 mov %eax,0xb0(%r14) > ffffffff804b72fa: 11 e8 30 f2 ff ff callq ffffffff804b652f <ip_local_out> > ffffffff804b72ff: 0 eb 44 jmp ffffffff804b7345 <ip_queue_xmit+0x300> > ffffffff804b7301: 0 65 48 8b 04 25 10 00 mov %gs:0x10,%rax > ffffffff804b7308: 0 00 00 > ffffffff804b730a: 0 8b 80 48 e0 ff ff mov -0x1fb8(%rax),%eax > ffffffff804b7310: 0 4c 89 f7 mov %r14,%rdi > ffffffff804b7313: 0 30 c0 xor %al,%al > ffffffff804b7315: 0 66 83 f8 01 cmp $0x1,%ax > ffffffff804b7319: 0 48 19 c0 sbb %rax,%rax > ffffffff804b731c: 0 83 e0 08 and $0x8,%eax > ffffffff804b731f: 0 48 8b 90 a8 16 ab 80 mov -0x7f54e958(%rax),%rdx > ffffffff804b7326: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax > ffffffff804b732d: 0 00 > ffffffff804b732e: 0 89 c0 mov %eax,%eax > ffffffff804b7330: 0 48 f7 d2 not %rdx > ffffffff804b7333: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax > ffffffff804b7337: 0 48 ff 40 68 incq 0x68(%rax) > ffffffff804b733b: 0 e8 b1 18 fd ff callq ffffffff80488bf1 <kfree_skb> > ffffffff804b7340: 0 b8 8f ff ff ff mov $0xffffff8f,%eax > ffffffff804b7345: 9196 48 83 c4 68 add $0x68,%rsp > ffffffff804b7349: 892 5b pop %rbx > ffffffff804b734a: 0 5d pop %rbp > ffffffff804b734b: 488 41 5c pop %r12 > ffffffff804b734d: 0 41 5d pop %r13 > ffffffff804b734f: 0 41 5e pop %r14 > ffffffff804b7351: 513 41 5f pop %r15 > ffffffff804b7353: 0 c3 retq > > about 10% of this function's cost is artificial: > > ffffffff804b7045: 1001 <ip_queue_xmit>: > ffffffff804b7045: 1001 41 57 push %r15 > ffffffff804b7047: 36698 41 56 push %r14 > > there are profiler hits that leaked in via out-of-order execution from > the callsites. The callsites are hard to map unfortunately, as this > function is called via function pointers. > > the most likely callsite is tcp_transmit_skb(). > > 30% of the overhead of this function comes from: > > ffffffff804b7203: 0 66 c7 43 06 00 00 movw $0x0,0x6(%rbx) > ffffffff804b7209: 118 0f bf 85 40 02 00 00 movswl 0x240(%rbp),%eax > ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx > ffffffff804b7215: 340 85 c0 test %eax,%eax > ffffffff804b7217: 0 79 06 jns ffffffff804b721f <ip_queue_xmit+0x1da> > ffffffff804b7219: 107464 8b 82 9c 00 00 00 mov 0x9c(%rdx),%eax > ffffffff804b721f: 4963 88 43 08 mov %al,0x8(%rbx) > > the 16-bit movw looks a bit weird. It comes from line 372: > > 0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372). > 367 iph = ip_hdr(skb); > 368 *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff)); > 369 if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok) > 370 iph->frag_off = htons(IP_DF); > 371 else > 372 iph->frag_off = 0; > 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst); > 374 iph->protocol = sk->sk_protocol; > 375 iph->saddr = rt->rt_src; > 376 iph->daddr = rt->rt_dst; > > the ip-header fragment flag setting to zero. > > 16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is > towards eliminating them as much as possible. > > _But_, the real overhead probably comes from: > > ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx > > which is the next line, the ttl field: > > 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst); > > this shows that we are doing a hard cachemiss on the net-localhost > route dst structure cacheline. We do a plain load instruction from it > here and get a hefty cachemiss. (because 16 CPUs are banging on that > single route) > > And let make sure we see this in perspective as well: that single > cachemiss is _1.0 percent_ of the total tbench cost. (!) We could make > the scheduler 10% slower straight away and it would have less of a > real-life effect than this single iph->ttl field setting. > If you applied my patch against dst_entry, then you should not have any cache line miss accessing the first and second cache line of dst_entry, that are mostly read (and contains all metrics, like ttl at offset 0x58 ). Or something is really wrong... Now if your cpu cache is blown away because of the huge send()/receive() done by tbench, we are stuck of course. I dont know what you want to prove here. We already have one dst_entry per route in the rt cache, and it already can consume *lot* of ram if you have 1 million entries in rt cache. tbench is mostly a network benchmark (and one using loopback device), thats not a suprise it can stress network part or the kernel :) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 20:32 ` ip_queue_xmit(): " Ingo Molnar 2008-11-17 20:57 ` Eric Dumazet @ 2008-11-18 9:12 ` Nick Piggin 1 sibling, 0 replies; 191+ messages in thread From: Nick Piggin @ 2008-11-18 9:12 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger On Tuesday 18 November 2008 07:32, Ingo Molnar wrote: > * Ingo Molnar <mingo@elte.hu> wrote: > > 100.000000 total > > ................ > > 3.356152 ip_queue_xmit > 30% of the overhead of this function comes from: > > ffffffff804b7203: 0 66 c7 43 06 00 00 movw $0x0,0x6(%rbx) > ffffffff804b7209: 118 0f bf 85 40 02 00 00 movswl 0x240(%rbp),%eax > ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx > ffffffff804b7215: 340 85 c0 test %eax,%eax > ffffffff804b7217: 0 79 06 jns ffffffff804b721f > <ip_queue_xmit+0x1da> ffffffff804b7219: 107464 8b 82 9c 00 00 00 mov > 0x9c(%rdx),%eax ffffffff804b721f: 4963 88 43 08 mov > %al,0x8(%rbx) > > the 16-bit movw looks a bit weird. It comes from line 372: > > 0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372). > 367 iph = ip_hdr(skb); > 368 *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff)); > 369 if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok) > 370 iph->frag_off = htons(IP_DF); > 371 else > 372 iph->frag_off = 0; > 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst); > 374 iph->protocol = sk->sk_protocol; > 375 iph->saddr = rt->rt_src; > 376 iph->daddr = rt->rt_dst; > > the ip-header fragment flag setting to zero. > > 16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is > towards eliminating them as much as possible. > > _But_, the real overhead probably comes from: > > ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx > > which is the next line, the ttl field: > > 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst); > > this shows that we are doing a hard cachemiss on the net-localhost > route dst structure cacheline. We do a plain load instruction from it > here and get a hefty cachemiss. (because 16 CPUs are banging on that > single route) Why would that show up right there, though? Instruction like this should be non-blocking. Shouldn't the cost should show up at some point where the CPU executes an instruction depending on rdx? (and good luck working out when that happens!) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (4 preceding siblings ...) 2008-11-17 20:32 ` ip_queue_xmit(): " Ingo Molnar @ 2008-11-17 20:47 ` Ingo Molnar 2008-11-17 20:56 ` Eric Dumazet 2008-11-17 20:55 ` skb_release_head_state(): " Ingo Molnar ` (8 subsequent siblings) 14 siblings, 1 reply; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 20:47 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 3.038025 skb_release_data hits (303802 total) ......... ffffffff80488c7e: 780 <skb_release_data>: ffffffff80488c7e: 780 55 push %rbp ffffffff80488c7f: 267141 53 push %rbx ffffffff80488c80: 0 48 89 fb mov %rdi,%rbx ffffffff80488c83: 3552 48 83 ec 08 sub $0x8,%rsp ffffffff80488c87: 604 8a 47 7c mov 0x7c(%rdi),%al ffffffff80488c8a: 2644 a8 02 test $0x2,%al ffffffff80488c8c: 49 74 2a je ffffffff80488cb8 <skb_release_data+0x3a> ffffffff80488c8e: 0 83 e0 10 and $0x10,%eax ffffffff80488c91: 2079 8b 97 c8 00 00 00 mov 0xc8(%rdi),%edx ffffffff80488c97: 53 3c 01 cmp $0x1,%al ffffffff80488c99: 0 19 c0 sbb %eax,%eax ffffffff80488c9b: 870 48 03 97 d0 00 00 00 add 0xd0(%rdi),%rdx ffffffff80488ca2: 65 66 31 c0 xor %ax,%ax ffffffff80488ca5: 0 05 01 00 01 00 add $0x10001,%eax ffffffff80488caa: 888 f7 d8 neg %eax ffffffff80488cac: 49 89 c1 mov %eax,%ecx ffffffff80488cae: 0 f0 0f c1 0a lock xadd %ecx,(%rdx) ffffffff80488cb2: 1909 01 c8 add %ecx,%eax ffffffff80488cb4: 1040 85 c0 test %eax,%eax ffffffff80488cb6: 0 75 6d jne ffffffff80488d25 <skb_release_data+0xa7> ffffffff80488cb8: 0 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx ffffffff80488cbe: 4199 48 8b 83 d0 00 00 00 mov 0xd0(%rbx),%rax ffffffff80488cc5: 4995 31 ed xor %ebp,%ebp ffffffff80488cc7: 0 66 83 7c 10 04 00 cmpw $0x0,0x4(%rax,%rdx,1) ffffffff80488ccd: 983 75 15 jne ffffffff80488ce4 <skb_release_data+0x66> ffffffff80488ccf: 15 eb 28 jmp ffffffff80488cf9 <skb_release_data+0x7b> ffffffff80488cd1: 665 48 63 c5 movslq %ebp,%rax ffffffff80488cd4: 546 ff c5 inc %ebp ffffffff80488cd6: 328 48 c1 e0 04 shl $0x4,%rax ffffffff80488cda: 356 48 8b 7c 02 20 mov 0x20(%rdx,%rax,1),%rdi ffffffff80488cdf: 95 e8 be 87 de ff callq ffffffff802714a2 <put_page> ffffffff80488ce4: 66 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx ffffffff80488cea: 1321 48 03 93 d0 00 00 00 add 0xd0(%rbx),%rdx ffffffff80488cf1: 439 0f b7 42 04 movzwl 0x4(%rdx),%eax ffffffff80488cf5: 0 39 c5 cmp %eax,%ebp ffffffff80488cf7: 1887 7c d8 jl ffffffff80488cd1 <skb_release_data+0x53> ffffffff80488cf9: 2187 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx ffffffff80488cff: 1784 48 8b 83 d0 00 00 00 mov 0xd0(%rbx),%rax ffffffff80488d06: 422 48 83 7c 10 18 00 cmpq $0x0,0x18(%rax,%rdx,1) ffffffff80488d0c: 110 74 08 je ffffffff80488d16 <skb_release_data+0x98> ffffffff80488d0e: 0 48 89 df mov %rbx,%rdi ffffffff80488d11: 0 e8 52 ff ff ff callq ffffffff80488c68 <skb_drop_fraglist> ffffffff80488d16: 14 48 8b bb d0 00 00 00 mov 0xd0(%rbx),%rdi ffffffff80488d1d: 715 5e pop %rsi ffffffff80488d1e: 109 5b pop %rbx ffffffff80488d1f: 20 5d pop %rbp ffffffff80488d20: 980 e9 b7 66 e0 ff jmpq ffffffff8028f3dc <kfree> ffffffff80488d25: 0 59 pop %rcx ffffffff80488d26: 1948 5b pop %rbx ffffffff80488d27: 0 5d pop %rbp ffffffff80488d28: 0 c3 retq this is a short function, and 90% of the overhead is false leaked-in overhead from callsites: ffffffff80488c7f: 267141 53 push %rbx unfortunately i have a hard time mapping its callsites. pskb_expand_head() is the only static callsite, but it's not active in the profile. The _usual_ callsite is normally skb_release_all(), which does have overhead: ffffffff80489449: 925 <skb_release_all>: ffffffff80489449: 925 53 push %rbx ffffffff8048944a: 5249 48 89 fb mov %rdi,%rbx ffffffff8048944d: 4 e8 3c ff ff ff callq ffffffff8048938e <skb_release_head_state> ffffffff80489452: 1149 48 89 df mov %rbx,%rdi ffffffff80489455: 13163 5b pop %rbx ffffffff80489456: 0 e9 23 f8 ff ff jmpq ffffffff80488c7e <skb_release_data> it is also tail-optimized, which explains why i found little callsites. The main callsite of skb_release_all() is: ffffffff80488b86: 26 e8 be 08 00 00 callq ffffffff80489449 <skb_release_all> which is __kfree_skb(). That is a frequently referenced function, and in my profile there's a single callsite active: ffffffff804c1027: 432 e8 56 7b fc ff callq ffffffff80488b82 <__kfree_skb> which is tcp_ack() - subject of a later email. The wider context is: ffffffff804c0ffc: 433 41 2b 85 e0 00 00 00 sub 0xe0(%r13),%eax ffffffff804c1003: 4843 89 85 f0 00 00 00 mov %eax,0xf0(%rbp) ffffffff804c1009: 1730 48 8b 45 30 mov 0x30(%rbp),%rax ffffffff804c100d: 311 41 8b 95 e0 00 00 00 mov 0xe0(%r13),%edx ffffffff804c1014: 0 48 83 b8 b0 00 00 00 cmpq $0x0,0xb0(%rax) ffffffff804c101b: 0 00 ffffffff804c101c: 418 74 06 je ffffffff804c1024 <tcp_ack+0x50d> ffffffff804c101e: 37 01 95 f4 00 00 00 add %edx,0xf4(%rbp) ffffffff804c1024: 2 4c 89 ef mov %r13,%rdi ffffffff804c1027: 432 e8 56 7b fc ff callq ffffffff80488b82 <__kfree_skb> this is a good, top-of-the-line x86 CPU with a really good BTB implementation that seems to be able to fall through calls and tail optimizations as if they werent there. some guesses are: (gdb) list *0xffffffff804c1003 0xffffffff804c1003 is in tcp_ack (include/net/sock.h:789). 784 785 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb) 786 { 787 skb_truesize_check(skb); 788 sock_set_flag(sk, SOCK_QUEUE_SHRUNK); 789 sk->sk_wmem_queued -= skb->truesize; 790 sk_mem_uncharge(sk, skb->truesize); 791 __kfree_skb(skb); 792 } 793 both sk and skb should be cache-hot here so this seems unlikely. (gdb) list *0xffffffff804c10090xffffffff804c1009 is in tcp_ack (include/net/sock.h:736). 731 } 732 733 static inline int sk_has_account(struct sock *sk) 734 { 735 /* return true if protocol supports memory accounting */ 736 return !!sk->sk_prot->memory_allocated; 737 } 738 739 static inline int sk_wmem_schedule(struct sock *sk, int size) 740 { this cannot be it - unless sk_prot somehow ends up being dirtied or false-shared? Still, my guess would be on ffffffff804c1009 and a sk_prot->memory_allocated cachemiss: look at how this instruction uses %ebp, and the one that shows the many hits in skb_release_data() pushes %ebp to the stack - that's where the CPU's OOO trick ends: it has to compute the result and serialize on the cachemiss. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 20:47 ` Ingo Molnar @ 2008-11-17 20:56 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 20:56 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger Ingo Molnar a écrit : > * Ingo Molnar <mingo@elte.hu> wrote: > >> 100.000000 total >> ................ >> 3.038025 skb_release_data > > hits (303802 total) > ......... > ffffffff80488c7e: 780 <skb_release_data>: > ffffffff80488c7e: 780 55 push %rbp > ffffffff80488c7f: 267141 53 push %rbx > ffffffff80488c80: 0 48 89 fb mov %rdi,%rbx > ffffffff80488c83: 3552 48 83 ec 08 sub $0x8,%rsp > ffffffff80488c87: 604 8a 47 7c mov 0x7c(%rdi),%al > ffffffff80488c8a: 2644 a8 02 test $0x2,%al > ffffffff80488c8c: 49 74 2a je ffffffff80488cb8 <skb_release_data+0x3a> > ffffffff80488c8e: 0 83 e0 10 and $0x10,%eax > ffffffff80488c91: 2079 8b 97 c8 00 00 00 mov 0xc8(%rdi),%edx > ffffffff80488c97: 53 3c 01 cmp $0x1,%al > ffffffff80488c99: 0 19 c0 sbb %eax,%eax > ffffffff80488c9b: 870 48 03 97 d0 00 00 00 add 0xd0(%rdi),%rdx > ffffffff80488ca2: 65 66 31 c0 xor %ax,%ax > ffffffff80488ca5: 0 05 01 00 01 00 add $0x10001,%eax > ffffffff80488caa: 888 f7 d8 neg %eax > ffffffff80488cac: 49 89 c1 mov %eax,%ecx > ffffffff80488cae: 0 f0 0f c1 0a lock xadd %ecx,(%rdx) > ffffffff80488cb2: 1909 01 c8 add %ecx,%eax > ffffffff80488cb4: 1040 85 c0 test %eax,%eax > ffffffff80488cb6: 0 75 6d jne ffffffff80488d25 <skb_release_data+0xa7> > ffffffff80488cb8: 0 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx > ffffffff80488cbe: 4199 48 8b 83 d0 00 00 00 mov 0xd0(%rbx),%rax > ffffffff80488cc5: 4995 31 ed xor %ebp,%ebp > ffffffff80488cc7: 0 66 83 7c 10 04 00 cmpw $0x0,0x4(%rax,%rdx,1) > ffffffff80488ccd: 983 75 15 jne ffffffff80488ce4 <skb_release_data+0x66> > ffffffff80488ccf: 15 eb 28 jmp ffffffff80488cf9 <skb_release_data+0x7b> > ffffffff80488cd1: 665 48 63 c5 movslq %ebp,%rax > ffffffff80488cd4: 546 ff c5 inc %ebp > ffffffff80488cd6: 328 48 c1 e0 04 shl $0x4,%rax > ffffffff80488cda: 356 48 8b 7c 02 20 mov 0x20(%rdx,%rax,1),%rdi > ffffffff80488cdf: 95 e8 be 87 de ff callq ffffffff802714a2 <put_page> > ffffffff80488ce4: 66 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx > ffffffff80488cea: 1321 48 03 93 d0 00 00 00 add 0xd0(%rbx),%rdx > ffffffff80488cf1: 439 0f b7 42 04 movzwl 0x4(%rdx),%eax > ffffffff80488cf5: 0 39 c5 cmp %eax,%ebp > ffffffff80488cf7: 1887 7c d8 jl ffffffff80488cd1 <skb_release_data+0x53> > ffffffff80488cf9: 2187 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx > ffffffff80488cff: 1784 48 8b 83 d0 00 00 00 mov 0xd0(%rbx),%rax > ffffffff80488d06: 422 48 83 7c 10 18 00 cmpq $0x0,0x18(%rax,%rdx,1) > ffffffff80488d0c: 110 74 08 je ffffffff80488d16 <skb_release_data+0x98> > ffffffff80488d0e: 0 48 89 df mov %rbx,%rdi > ffffffff80488d11: 0 e8 52 ff ff ff callq ffffffff80488c68 <skb_drop_fraglist> > ffffffff80488d16: 14 48 8b bb d0 00 00 00 mov 0xd0(%rbx),%rdi > ffffffff80488d1d: 715 5e pop %rsi > ffffffff80488d1e: 109 5b pop %rbx > ffffffff80488d1f: 20 5d pop %rbp > ffffffff80488d20: 980 e9 b7 66 e0 ff jmpq ffffffff8028f3dc <kfree> > ffffffff80488d25: 0 59 pop %rcx > ffffffff80488d26: 1948 5b pop %rbx > ffffffff80488d27: 0 5d pop %rbp > ffffffff80488d28: 0 c3 retq > > this is a short function, and 90% of the overhead is false leaked-in > overhead from callsites: > > ffffffff80488c7f: 267141 53 push %rbx > > unfortunately i have a hard time mapping its callsites. > pskb_expand_head() is the only static callsite, but it's not active in > the profile. > > The _usual_ callsite is normally skb_release_all(), which does have > overhead: > > ffffffff80489449: 925 <skb_release_all>: > ffffffff80489449: 925 53 push %rbx > ffffffff8048944a: 5249 48 89 fb mov %rdi,%rbx > ffffffff8048944d: 4 e8 3c ff ff ff callq ffffffff8048938e <skb_release_head_state> > ffffffff80489452: 1149 48 89 df mov %rbx,%rdi > ffffffff80489455: 13163 5b pop %rbx > ffffffff80489456: 0 e9 23 f8 ff ff jmpq ffffffff80488c7e <skb_release_data> > > it is also tail-optimized, which explains why i found little > callsites. The main callsite of skb_release_all() is: > > ffffffff80488b86: 26 e8 be 08 00 00 callq ffffffff80489449 <skb_release_all> > > which is __kfree_skb(). That is a frequently referenced function, and > in my profile there's a single callsite active: > > ffffffff804c1027: 432 e8 56 7b fc ff callq ffffffff80488b82 <__kfree_skb> > > which is tcp_ack() - subject of a later email. The wider context is: > > ffffffff804c0ffc: 433 41 2b 85 e0 00 00 00 sub 0xe0(%r13),%eax > ffffffff804c1003: 4843 89 85 f0 00 00 00 mov %eax,0xf0(%rbp) > ffffffff804c1009: 1730 48 8b 45 30 mov 0x30(%rbp),%rax > ffffffff804c100d: 311 41 8b 95 e0 00 00 00 mov 0xe0(%r13),%edx > ffffffff804c1014: 0 48 83 b8 b0 00 00 00 cmpq $0x0,0xb0(%rax) > ffffffff804c101b: 0 00 > ffffffff804c101c: 418 74 06 je ffffffff804c1024 <tcp_ack+0x50d> > ffffffff804c101e: 37 01 95 f4 00 00 00 add %edx,0xf4(%rbp) > ffffffff804c1024: 2 4c 89 ef mov %r13,%rdi > ffffffff804c1027: 432 e8 56 7b fc ff callq ffffffff80488b82 <__kfree_skb> > > this is a good, top-of-the-line x86 CPU with a really good BTB > implementation that seems to be able to fall through calls and tail > optimizations as if they werent there. > > some guesses are: > > (gdb) list *0xffffffff804c1003 > 0xffffffff804c1003 is in tcp_ack (include/net/sock.h:789). > 784 > 785 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb) > 786 { > 787 skb_truesize_check(skb); > 788 sock_set_flag(sk, SOCK_QUEUE_SHRUNK); > 789 sk->sk_wmem_queued -= skb->truesize; > 790 sk_mem_uncharge(sk, skb->truesize); > 791 __kfree_skb(skb); > 792 } > 793 > > both sk and skb should be cache-hot here so this seems unlikely. > > (gdb) list *0xffffffff804c10090xffffffff804c1009 is in tcp_ack (include/net/sock.h:736). > 731 } > 732 > 733 static inline int sk_has_account(struct sock *sk) > 734 { > 735 /* return true if protocol supports memory accounting */ > 736 return !!sk->sk_prot->memory_allocated; > 737 } > 738 > 739 static inline int sk_wmem_schedule(struct sock *sk, int size) > 740 { > > this cannot be it - unless sk_prot somehow ends up being dirtied or > false-shared? > > Still, my guess would be on ffffffff804c1009 and a > sk_prot->memory_allocated cachemiss: look at how this instruction uses > %ebp, and the one that shows the many hits in skb_release_data() > pushes %ebp to the stack - that's where the CPU's OOO trick ends: it > has to compute the result and serialize on the cachemiss. > I did some investigation on this part (memory_allocated) and discovered UDP had a problem, not TCP (and tbench) commit 270acefafeb74ce2fe93d35b75733870bf1e11e7 net: sk_free_datagram() should use sk_mem_reclaim_partial() I noticed a contention on udp_memory_allocated on regular UDP applications. While tcp_memory_allocated is seldom used, it appears each incoming UDP frame is currently touching udp_memory_allocated when queued, and when received by application. One possible solution is to use sk_mem_reclaim_partial() instead of sk_mem_reclaim(), so that we keep a small reserve (less than one page) of memory for each UDP socket. We did something very similar on TCP side in commit 9993e7d313e80bdc005d09c7def91903e0068f07 ([TCP]: Do not purge sk_forward_alloc entirely in tcp_delack_timer()) A more complex solution would need to convert prot->memory_allocated to use a percpu_counter with batches of 64 or 128 pages. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net> ^ permalink raw reply [flat|nested] 191+ messages in thread
* skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (5 preceding siblings ...) 2008-11-17 20:47 ` Ingo Molnar @ 2008-11-17 20:55 ` Ingo Molnar 2008-11-17 21:01 ` David Miller ` (2 more replies) 2008-11-17 21:09 ` tcp_ack(): " Ingo Molnar ` (7 subsequent siblings) 14 siblings, 3 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 20:55 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 2.118525 skb_release_head_state hits (total: 211852) ......... ffffffff8048938e: 967 <skb_release_head_state>: ffffffff8048938e: 967 53 push %rbx ffffffff8048938f: 3975 48 89 fb mov %rdi,%rbx ffffffff80489392: 17 48 8b 7f 28 mov 0x28(%rdi),%rdi ffffffff80489396: 0 e8 9c 93 00 00 callq ffffffff80492737 <dst_release> ffffffff8048939b: 6 48 8b 7b 30 mov 0x30(%rbx),%rdi ffffffff8048939f: 2887 48 85 ff test %rdi,%rdi ffffffff804893a2: 859 74 0f je ffffffff804893b3 <skb_release_head_state+0x25> ffffffff804893a4: 0 f0 ff 0f lock decl (%rdi) ffffffff804893a7: 0 0f 94 c0 sete %al ffffffff804893aa: 0 84 c0 test %al,%al ffffffff804893ac: 0 74 05 je ffffffff804893b3 <skb_release_head_state+0x25> ffffffff804893ae: 0 e8 7a 14 06 00 callq ffffffff804ea82d <__secpath_destroy> ffffffff804893b3: 16 48 83 bb 80 00 00 00 cmpq $0x0,0x80(%rbx) ffffffff804893ba: 0 00 ffffffff804893bb: 4294 74 31 je ffffffff804893ee <skb_release_head_state+0x60> ffffffff804893bd: 0 65 48 8b 04 25 10 00 mov %gs:0x10,%rax ffffffff804893c4: 0 00 00 ffffffff804893c6: 6540 48 63 80 48 e0 ff ff movslq -0x1fb8(%rax),%rax ffffffff804893cd: 14 a9 00 00 ff 0f test $0xfff0000,%eax ffffffff804893d2: 471 74 11 je ffffffff804893e5 <skb_release_head_state+0x57> ffffffff804893d4: 0 be 89 01 00 00 mov $0x189,%esi ffffffff804893d9: 0 48 c7 c7 cc b1 6a 80 mov $0xffffffff806ab1cc,%rdi ffffffff804893e0: 0 e8 d0 cd da ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804893e5: 0 48 89 df mov %rbx,%rdi ffffffff804893e8: 1733 ff 93 80 00 00 00 callq *0x80(%rbx) ffffffff804893ee: 888 48 8b bb 88 00 00 00 mov 0x88(%rbx),%rdi ffffffff804893f5: 3959 48 85 ff test %rdi,%rdi ffffffff804893f8: 0 74 0f je ffffffff80489409 <skb_release_head_state+0x7b> ffffffff804893fa: 0 f0 ff 0f lock decl (%rdi) ffffffff804893fd: 0 0f 94 c0 sete %al ffffffff80489400: 0 84 c0 test %al,%al ffffffff80489402: 0 74 05 je ffffffff80489409 <skb_release_head_state+0x7b> ffffffff80489404: 0 e8 48 f2 01 00 callq ffffffff804a8651 <nf_conntrack_destroy> ffffffff80489409: 0 48 8b bb 90 00 00 00 mov 0x90(%rbx),%rdi ffffffff80489410: 3132 48 85 ff test %rdi,%rdi ffffffff80489413: 1 74 05 je ffffffff8048941a <skb_release_head_state+0x8c> ffffffff80489415: 0 e8 d7 f7 ff ff callq ffffffff80488bf1 <kfree_skb> ffffffff8048941a: 958 48 8b bb 98 00 00 00 mov 0x98(%rbx),%rdi ffffffff80489421: 1999 48 85 ff test %rdi,%rdi ffffffff80489424: 0 74 0f je ffffffff80489435 <skb_release_head_state+0xa7> ffffffff80489426: 0 f0 ff 0f lock decl (%rdi) ffffffff80489429: 0 0f 94 c0 sete %al ffffffff8048942c: 0 84 c0 test %al,%al ffffffff8048942e: 0 74 05 je ffffffff80489435 <skb_release_head_state+0xa7> ffffffff80489430: 0 e8 a7 5f e0 ff callq ffffffff8028f3dc <kfree> ffffffff80489435: 0 66 c7 83 a6 00 00 00 movw $0x0,0xa6(%rbx) ffffffff8048943c: 0 00 00 ffffffff8048943e: 6503 66 c7 83 a8 00 00 00 movw $0x0,0xa8(%rbx) ffffffff80489445: 0 00 00 ffffffff80489447: 174101 5b pop %rbx ffffffff80489448: 0 c3 retq this function _really_ hurts from a 16-bit op: ffffffff8048943e: 6503 66 c7 83 a8 00 00 00 movw $0x0,0xa8(%rbx) ffffffff80489445: 0 00 00 ffffffff80489447: 174101 5b pop %rbx (gdb) list *0xffffffff8048943e 0xffffffff8048943e is in skb_release_head_state (net/core/skbuff.c:407). 402 #endif 403 /* XXX: IS this still necessary? - JHS */ 404 #ifdef CONFIG_NET_SCHED 405 skb->tc_index = 0; 406 #ifdef CONFIG_NET_CLS_ACT 407 skb->tc_verd = 0; 408 #endif 409 #endif 410 } 411 dirtying skb->tc_verd. I do have: CONFIG_NET_CLS_ACT=y BUT, on a second look, i dont think it's really this 16-bit op that hurts us. The wider context is: ffffffff80489426: 0 f0 ff 0f lock decl (%rdi) ffffffff80489429: 0 0f 94 c0 sete %al ffffffff8048942c: 0 84 c0 test %al,%al ffffffff8048942e: 0 74 05 je ffffffff80489435 <skb_release_head_state+0xa7> ffffffff80489430: 0 e8 a7 5f e0 ff callq ffffffff8028f3dc <kfree> ffffffff80489435: 0 66 c7 83 a6 00 00 00 movw $0x0,0xa6(%rbx) ffffffff8048943c: 0 00 00 ffffffff8048943e: 6503 66 c7 83 a8 00 00 00 movw $0x0,0xa8(%rbx) ffffffff80489445: 0 00 00 ffffffff80489447: 174101 5b pop %rbx ffffffff80489448: 0 c3 retq look how we jump over the callq most of the time - so what we are seeing here i believe is the cost of the atomic op at ffffffff80489426. That comes from: (gdb) list *0xffffffff8048942e 0xffffffff8048942e is in skb_release_head_state (include/linux/skbuff.h:1783). 1778 } 1779 #endif 1780 #ifdef CONFIG_BRIDGE_NETFILTER 1781 static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge) 1782 { 1783 if (nf_bridge && atomic_dec_and_test(&nf_bridge->use)) 1784 kfree(nf_bridge); 1785 } 1786 static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge) 1787 { and ouch does that global dec on &nf_bridge->use hurt! i do have: CONFIG_BRIDGE_NETFILTER=y (this is a Fedora distro kernel derived .config) Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 20:55 ` skb_release_head_state(): " Ingo Molnar @ 2008-11-17 21:01 ` David Miller 2008-11-17 21:04 ` Eric Dumazet 2008-11-17 21:34 ` Linus Torvalds 2 siblings, 0 replies; 191+ messages in thread From: David Miller @ 2008-11-17 21:01 UTC (permalink / raw) To: mingo Cc: torvalds, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger From: Ingo Molnar <mingo@elte.hu> Date: Mon, 17 Nov 2008 21:55:30 +0100 > and ouch does that global dec on &nf_bridge->use hurt! nf_bridge should always be NULL on your system ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 20:55 ` skb_release_head_state(): " Ingo Molnar 2008-11-17 21:01 ` David Miller @ 2008-11-17 21:04 ` Eric Dumazet 2008-11-17 21:34 ` Linus Torvalds 2 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 21:04 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger Ingo Molnar a écrit : > (gdb) list *0xffffffff8048942e > 0xffffffff8048942e is in skb_release_head_state (include/linux/skbuff.h:1783). > 1778 } > 1779 #endif > 1780 #ifdef CONFIG_BRIDGE_NETFILTER > 1781 static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge) > 1782 { > 1783 if (nf_bridge && atomic_dec_and_test(&nf_bridge->use)) > 1784 kfree(nf_bridge); > 1785 } > 1786 static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge) > 1787 { > > and ouch does that global dec on &nf_bridge->use hurt! > > i do have: > > CONFIG_BRIDGE_NETFILTER=y > > (this is a Fedora distro kernel derived .config) Hum, you also should hit this cache line at atomic_inc() site then... Strange, I never caught this one. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 20:55 ` skb_release_head_state(): " Ingo Molnar 2008-11-17 21:01 ` David Miller 2008-11-17 21:04 ` Eric Dumazet @ 2008-11-17 21:34 ` Linus Torvalds 2008-11-17 21:38 ` Ingo Molnar 2 siblings, 1 reply; 191+ messages in thread From: Linus Torvalds @ 2008-11-17 21:34 UTC (permalink / raw) To: Ingo Molnar Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger On Mon, 17 Nov 2008, Ingo Molnar wrote: > > this function _really_ hurts from a 16-bit op: > > ffffffff8048943e: 6503 66 c7 83 a8 00 00 00 movw $0x0,0xa8(%rbx) > ffffffff80489445: 0 00 00 > ffffffff80489447: 174101 5b pop %rbx I don't think that is it, actually. The 16-bit store just before it had a zero count, even though anything that executes the second one will always execute the first one too. The fact is, x86 profiles are subtle at an instruction level, and you tend to get profile hits _after_ the instruction that caused the cost because an interrupt (even an NMI) is always delayed to the next instruction (the one that didn't complete). And since the core will execute out-of-order, you don't even know what that one is, since there could easily be branches, but even in the absense of branches you have many instructions executing together. For example, in many situations the two 16-bit stores will happily execute together, and what you see may simply be a cache miss on the line that was stored to. The store buffer needs to resolve the read of the "pop" in order to complete, so having a big count in between stores and a subsequent load is not all that unlikely. So doing per-instruction profiling is not useful unless you start looking at what preceded the instruction, and because of the out-of-order nature, you really almost have to look for cache misses or branch mispredicts. One common reason for such a big count on an instruction that looks perfectly simple is often that there is a branch to that instruction that was mispredicted. Or that there was an instruction that was costly _long_ before, and that other instructions were in the shadow of that one completing (ie they had actually completed first, but didn't retire until the earlier instruction did). So you really should never just look at the previous instruction or anythign as simplistic as that. The time of in-order execution is long past. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 21:34 ` Linus Torvalds @ 2008-11-17 21:38 ` Ingo Molnar 0 siblings, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 21:38 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 17 Nov 2008, Ingo Molnar wrote: > > > > this function _really_ hurts from a 16-bit op: > > > > ffffffff8048943e: 6503 66 c7 83 a8 00 00 00 movw $0x0,0xa8(%rbx) > > ffffffff80489445: 0 00 00 > > ffffffff80489447: 174101 5b pop %rbx > > I don't think that is it, actually. The 16-bit store just before it > had a zero count, even though anything that executes the second one > will always execute the first one too. yeah - look at the followup bits that identify the likely real source of that overhead: >> _But_, the real overhead probably comes from: >> >> ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx >> >> which is the next line, the ttl field: >> >> 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst); >> >> this shows that we are doing a hard cachemiss on the net-localhost >> route dst structure cacheline. We do a plain load instruction from >> it here and get a hefty cachemiss. (because 16 CPUs are banging on >> that single route) >> >> And let make sure we see this in perspective as well: that single >> cachemiss is _1.0 percent_ of the total tbench cost. (!) We could >> make the scheduler 10% slower straight away and it would have less >> of a real-life effect than this single iph->ttl field setting. ^ permalink raw reply [flat|nested] 191+ messages in thread
* tcp_ack(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (6 preceding siblings ...) 2008-11-17 20:55 ` skb_release_head_state(): " Ingo Molnar @ 2008-11-17 21:09 ` Ingo Molnar 2008-11-17 21:19 ` tcp_recvmsg(): " Ingo Molnar ` (6 subsequent siblings) 14 siblings, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 21:09 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 1.997533 tcp_ack hits (total: 199753) ......... ffffffff804c0b17: 452 <tcp_ack>: ffffffff804c0b17: 452 41 57 push %r15 ffffffff804c0b19: 9569 41 56 push %r14 ffffffff804c0b1b: 0 41 55 push %r13 ffffffff804c0b1d: 0 49 89 f5 mov %rsi,%r13 ffffffff804c0b20: 493 41 54 push %r12 ffffffff804c0b22: 104 41 89 d4 mov %edx,%r12d ffffffff804c0b25: 0 55 push %rbp ffffffff804c0b26: 425 48 89 fd mov %rdi,%rbp ffffffff804c0b29: 21 53 push %rbx ffffffff804c0b2a: 0 48 81 ec 88 00 00 00 sub $0x88,%rsp ffffffff804c0b31: 445 8b 87 00 04 00 00 mov 0x400(%rdi),%eax ffffffff804c0b37: 0 89 44 24 18 mov %eax,0x18(%rsp) ffffffff804c0b3b: 443 48 8d 46 38 lea 0x38(%rsi),%rax ffffffff804c0b3f: 18 8b 50 28 mov 0x28(%rax),%edx ffffffff804c0b42: 2565 44 8b 70 18 mov 0x18(%rax),%r14d ffffffff804c0b46: 358 89 54 24 1c mov %edx,0x1c(%rsp) ffffffff804c0b4a: 2 39 97 fc 03 00 00 cmp %edx,0x3fc(%rdi) ffffffff804c0b50: 368 0f 88 af 13 00 00 js ffffffff804c1f05 <tcp_ack+0x13ee> ffffffff804c0b56: 106 89 d1 mov %edx,%ecx ffffffff804c0b58: 2 2b 4c 24 18 sub 0x18(%rsp),%ecx ffffffff804c0b5c: 328 0f 88 83 13 00 00 js ffffffff804c1ee5 <tcp_ack+0x13ce> ffffffff804c0b62: 1440 8b 44 24 18 mov 0x18(%rsp),%eax ffffffff804c0b66: 2 29 d0 sub %edx,%eax ffffffff804c0b68: 77 44 89 e2 mov %r12d,%edx ffffffff804c0b6b: 398 89 c6 mov %eax,%esi ffffffff804c0b6d: 3 80 ce 04 or $0x4,%dh ffffffff804c0b70: 65 c1 ee 1f shr $0x1f,%esi ffffffff804c0b73: 362 44 0f 45 e2 cmovne %edx,%r12d ffffffff804c0b77: 1 83 3d ea 78 3f 00 00 cmpl $0x0,0x3f78ea(%rip) # ffffffff808b8468 <sysctl_tcp_abc> ffffffff804c0b7e: 64 74 27 je ffffffff804c0ba7 <tcp_ack+0x90> ffffffff804c0b80: 0 8a 87 78 03 00 00 mov 0x378(%rdi),%al ffffffff804c0b86: 0 3c 01 cmp $0x1,%al ffffffff804c0b88: 0 77 08 ja ffffffff804c0b92 <tcp_ack+0x7b> ffffffff804c0b8a: 0 01 8f dc 04 00 00 add %ecx,0x4dc(%rdi) ffffffff804c0b90: 0 eb 15 jmp ffffffff804c0ba7 <tcp_ack+0x90> ffffffff804c0b92: 0 3c 04 cmp $0x4,%al ffffffff804c0b94: 0 75 11 jne ffffffff804c0ba7 <tcp_ack+0x90> ffffffff804c0b96: 0 8b 87 4c 04 00 00 mov 0x44c(%rdi),%eax ffffffff804c0b9c: 0 39 c1 cmp %eax,%ecx ffffffff804c0b9e: 0 0f 46 c1 cmovbe %ecx,%eax ffffffff804c0ba1: 0 01 87 dc 04 00 00 add %eax,0x4dc(%rdi) ffffffff804c0ba7: 377 8b 9d d4 04 00 00 mov 0x4d4(%rbp),%ebx ffffffff804c0bad: 3672 41 f7 c4 00 01 00 00 test $0x100,%r12d ffffffff804c0bb4: 282 89 5c 24 20 mov %ebx,0x20(%rsp) ffffffff804c0bb8: 0 8b 85 74 04 00 00 mov 0x474(%rbp),%eax ffffffff804c0bbe: 140 89 44 24 30 mov %eax,0x30(%rsp) ffffffff804c0bc2: 7592 8b 95 d0 04 00 00 mov 0x4d0(%rbp),%edx ffffffff804c0bc8: 1580 89 54 24 24 mov %edx,0x24(%rsp) ffffffff804c0bcc: 3 8b 9d cc 04 00 00 mov 0x4cc(%rbp),%ebx ffffffff804c0bd2: 58 89 5c 24 28 mov %ebx,0x28(%rsp) ffffffff804c0bd6: 419 8b 85 78 04 00 00 mov 0x478(%rbp),%eax ffffffff804c0bdc: 0 89 44 24 2c mov %eax,0x2c(%rsp) ffffffff804c0be0: 65 75 4f jne ffffffff804c0c31 <tcp_ack+0x11a> ffffffff804c0be2: 423 85 f6 test %esi,%esi ffffffff804c0be4: 55 74 4b je ffffffff804c0c31 <tcp_ack+0x11a> ffffffff804c0be6: 36 44 89 b5 40 04 00 00 mov %r14d,0x440(%rbp) ffffffff804c0bed: 368 8b 54 24 1c mov 0x1c(%rsp),%edx ffffffff804c0bf1: 4 41 83 cc 02 or $0x2,%r12d ffffffff804c0bf5: 32 be 05 00 00 00 mov $0x5,%esi ffffffff804c0bfa: 392 48 89 ef mov %rbp,%rdi ffffffff804c0bfd: 4 89 95 00 04 00 00 mov %edx,0x400(%rbp) ffffffff804c0c03: 3341 44 89 64 24 5c mov %r12d,0x5c(%rsp) ffffffff804c0c08: 855 e8 98 dc ff ff callq ffffffff804be8a5 <tcp_ca_event> ffffffff804c0c0d: 2018 48 8b 05 a4 0a 5f 00 mov 0x5f0aa4(%rip),%rax # ffffffff80ab16b8 <init_net+0xe8> ffffffff804c0c14: 858 65 8b 14 25 24 00 00 mov %gs:0x24,%edx ffffffff804c0c1b: 0 00 ffffffff804c0c1c: 0 89 d2 mov %edx,%edx ffffffff804c0c1e: 0 48 f7 d0 not %rax ffffffff804c0c21: 425 48 8b 04 d0 mov (%rax,%rdx,8),%rax ffffffff804c0c25: 0 48 ff 80 e8 00 00 00 incq 0xe8(%rax) ffffffff804c0c2c: 0 e9 1b 01 00 00 jmpq ffffffff804c0d4c <tcp_ack+0x235> ffffffff804c0c31: 41 45 3b 75 54 cmp 0x54(%r13),%r14d ffffffff804c0c35: 360 74 06 je ffffffff804c0c3d <tcp_ack+0x126> ffffffff804c0c37: 1 41 83 cc 01 or $0x1,%r12d ffffffff804c0c3b: 80 eb 1f jmp ffffffff804c0c5c <tcp_ack+0x145> ffffffff804c0c3d: 1 48 8b 05 74 0a 5f 00 mov 0x5f0a74(%rip),%rax # ffffffff80ab16b8 <init_net+0xe8> ffffffff804c0c44: 303 65 8b 14 25 24 00 00 mov %gs:0x24,%edx ffffffff804c0c4b: 0 00 ffffffff804c0c4c: 56 89 d2 mov %edx,%edx ffffffff804c0c4e: 0 48 f7 d0 not %rax ffffffff804c0c51: 4 48 8b 04 d0 mov (%rax,%rdx,8),%rax ffffffff804c0c55: 13 48 ff 80 e0 00 00 00 incq 0xe0(%rax) ffffffff804c0c5c: 12 41 8b 95 b8 00 00 00 mov 0xb8(%r13),%edx ffffffff804c0c63: 300 49 03 95 d0 00 00 00 add 0xd0(%r13),%rdx ffffffff804c0c6a: 17 66 8b 42 0e mov 0xe(%rdx),%ax ffffffff804c0c6e: 0 66 c1 c0 08 rol $0x8,%ax ffffffff804c0c72: 22 f6 42 0d 02 testb $0x2,0xd(%rdx) ffffffff804c0c76: 13 0f b7 d8 movzwl %ax,%ebx ffffffff804c0c79: 0 75 0b jne ffffffff804c0c86 <tcp_ack+0x16f> ffffffff804c0c7b: 26 8a 8d 9d 04 00 00 mov 0x49d(%rbp),%cl ffffffff804c0c81: 343 83 e1 0f and $0xf,%ecx ffffffff804c0c84: 0 d3 e3 shl %cl,%ebx ffffffff804c0c86: 82 8b 74 24 1c mov 0x1c(%rsp),%esi ffffffff804c0c8a: 18 44 89 f2 mov %r14d,%edx ffffffff804c0c8d: 0 89 d9 mov %ebx,%ecx ffffffff804c0c8f: 12 48 89 ef mov %rbp,%rdi ffffffff804c0c92: 12 e8 47 e6 ff ff callq ffffffff804bf2de <tcp_may_update_window> ffffffff804c0c97: 16 31 d2 xor %edx,%edx ffffffff804c0c99: 66 85 c0 test %eax,%eax ffffffff804c0c9b: 0 74 48 je ffffffff804c0ce5 <tcp_ack+0x1ce> ffffffff804c0c9d: 12 39 9d 44 04 00 00 cmp %ebx,0x444(%rbp) ffffffff804c0ca3: 29 44 89 b5 40 04 00 00 mov %r14d,0x440(%rbp) ffffffff804c0caa: 0 74 34 je ffffffff804c0ce0 <tcp_ack+0x1c9> ffffffff804c0cac: 7 89 9d 44 04 00 00 mov %ebx,0x444(%rbp) ffffffff804c0cb2: 59 c7 85 ec 03 00 00 00 movl $0x0,0x3ec(%rbp) ffffffff804c0cb9: 0 00 00 00 ffffffff804c0cbc: 0 48 89 ef mov %rbp,%rdi ffffffff804c0cbf: 7 e8 13 e8 ff ff callq ffffffff804bf4d7 <tcp_fast_path_check> ffffffff804c0cc4: 23 3b 9d 48 04 00 00 cmp 0x448(%rbp),%ebx ffffffff804c0cca: 48 76 14 jbe ffffffff804c0ce0 <tcp_ack+0x1c9> ffffffff804c0ccc: 0 8b b5 5c 03 00 00 mov 0x35c(%rbp),%esi ffffffff804c0cd2: 0 89 9d 48 04 00 00 mov %ebx,0x448(%rbp) ffffffff804c0cd8: 0 48 89 ef mov %rbp,%rdi ffffffff804c0cdb: 0 e8 40 41 00 00 callq ffffffff804c4e20 <tcp_sync_mss> ffffffff804c0ce0: 6 ba 02 00 00 00 mov $0x2,%edx ffffffff804c0ce5: 141 8b 5c 24 1c mov 0x1c(%rsp),%ebx ffffffff804c0ce9: 1 44 09 e2 or %r12d,%edx ffffffff804c0cec: 3 89 9d 00 04 00 00 mov %ebx,0x400(%rbp) ffffffff804c0cf2: 34 89 54 24 5c mov %edx,0x5c(%rsp) ffffffff804c0cf6: 0 41 80 7d 5d 00 cmpb $0x0,0x5d(%r13) ffffffff804c0cfb: 6 74 13 je ffffffff804c0d10 <tcp_ack+0x1f9> ffffffff804c0cfd: 0 8b 54 24 18 mov 0x18(%rsp),%edx ffffffff804c0d01: 0 4c 89 ee mov %r13,%rsi ffffffff804c0d04: 0 48 89 ef mov %rbp,%rdi ffffffff804c0d07: 0 e8 b4 f5 ff ff callq ffffffff804c02c0 <tcp_sacktag_write_queue> ffffffff804c0d0c: 0 09 44 24 5c or %eax,0x5c(%rsp) ffffffff804c0d10: 29 41 8b 85 b8 00 00 00 mov 0xb8(%r13),%eax ffffffff804c0d17: 128 49 03 85 d0 00 00 00 add 0xd0(%r13),%rax ffffffff804c0d1e: 0 8a 40 0d mov 0xd(%rax),%al ffffffff804c0d21: 33 83 e0 42 and $0x42,%eax ffffffff804c0d24: 0 3c 40 cmp $0x40,%al ffffffff804c0d26: 0 75 17 jne ffffffff804c0d3f <tcp_ack+0x228> ffffffff804c0d28: 0 8b 44 24 5c mov 0x5c(%rsp),%eax ffffffff804c0d2c: 0 83 c8 40 or $0x40,%eax ffffffff804c0d2f: 0 f6 85 7e 04 00 00 01 testb $0x1,0x47e(%rbp) ffffffff804c0d36: 0 0f 44 44 24 5c cmove 0x5c(%rsp),%eax ffffffff804c0d3b: 0 89 44 24 5c mov %eax,0x5c(%rsp) ffffffff804c0d3f: 36 be 06 00 00 00 mov $0x6,%esi ffffffff804c0d44: 167 48 89 ef mov %rbp,%rdi ffffffff804c0d47: 1 e8 59 db ff ff callq ffffffff804be8a5 <tcp_ca_event> ffffffff804c0d4c: 581 c7 85 48 01 00 00 00 movl $0x0,0x148(%rbp) ffffffff804c0d53: 0 00 00 00 ffffffff804c0d56: 6076 c6 85 7d 03 00 00 00 movb $0x0,0x37d(%rbp) ffffffff804c0d5d: 0 48 8b 05 1c 8b 3f 00 mov 0x3f8b1c(%rip),%rax # ffffffff808b9880 <jiffies> ffffffff804c0d64: 443 89 85 08 04 00 00 mov %eax,0x408(%rbp) ffffffff804c0d6a: 0 8b 85 74 04 00 00 mov 0x474(%rbp),%eax ffffffff804c0d70: 0 85 c0 test %eax,%eax ffffffff804c0d72: 845 89 44 24 14 mov %eax,0x14(%rsp) ffffffff804c0d76: 0 0f 84 fb 10 00 00 je ffffffff804c1e77 <tcp_ack+0x1360> ffffffff804c0d7c: 0 48 8b 05 fd 8a 3f 00 mov 0x3f8afd(%rip),%rax # ffffffff808b9880 <jiffies> ffffffff804c0d83: 586 8b 54 24 14 mov 0x14(%rsp),%edx ffffffff804c0d87: 1 41 83 cc ff or $0xffffffffffffffff,%r12d ffffffff804c0d8b: 2 89 44 24 48 mov %eax,0x48(%rsp) ffffffff804c0d8f: 879 89 54 24 34 mov %edx,0x34(%rsp) ffffffff804c0d93: 1 8b 9d d0 04 00 00 mov 0x4d0(%rbp),%ebx ffffffff804c0d99: 0 89 5c 24 40 mov %ebx,0x40(%rsp) ffffffff804c0d9d: 889 e8 e2 e8 ff ff callq ffffffff804bf684 <net_invalid_timestamp> ffffffff804c0da2: 0 48 89 44 24 08 mov %rax,0x8(%rsp) ffffffff804c0da7: 16 48 8d 85 c0 00 00 00 lea 0xc0(%rbp),%rax ffffffff804c0dae: 445 c7 44 24 44 01 00 00 movl $0x1,0x44(%rsp) ffffffff804c0db5: 0 00 ffffffff804c0db6: 0 c7 44 24 50 00 00 00 movl $0x0,0x50(%rsp) ffffffff804c0dbd: 0 00 ffffffff804c0dbe: 10 c7 44 24 38 00 00 00 movl $0x0,0x38(%rsp) ffffffff804c0dc5: 0 00 ffffffff804c0dc6: 1308 44 89 64 24 4c mov %r12d,0x4c(%rsp) ffffffff804c0dcb: 225 48 89 04 24 mov %rax,(%rsp) ffffffff804c0dcf: 2 e9 8b 02 00 00 jmpq ffffffff804c105f <tcp_ack+0x548> ffffffff804c0dd4: 488 4d 8d 7d 38 lea 0x38(%r13),%r15 ffffffff804c0dd8: 2298 41 8a 57 25 mov 0x25(%r15),%dl ffffffff804c0ddc: 0 88 54 24 3f mov %dl,0x3f(%rsp) ffffffff804c0de0: 6 41 8b 77 1c mov 0x1c(%r15),%esi ffffffff804c0de4: 455 8b 95 00 04 00 00 mov 0x400(%rbp),%edx ffffffff804c0dea: 3 49 8b 8d d0 00 00 00 mov 0xd0(%r13),%rcx ffffffff804c0df1: 0 41 8b 85 c8 00 00 00 mov 0xc8(%r13),%eax ffffffff804c0df8: 440 39 f2 cmp %esi,%edx ffffffff804c0dfa: 0 79 6f jns ffffffff804c0e6b <tcp_ack+0x354> ffffffff804c0dfc: 0 89 c0 mov %eax,%eax ffffffff804c0dfe: 39 8b 5c 08 08 mov 0x8(%rax,%rcx,1),%ebx ffffffff804c0e02: 0 66 83 fb 01 cmp $0x1,%bx ffffffff804c0e06: 2 0f 84 77 02 00 00 je ffffffff804c1083 <tcp_ack+0x56c> ffffffff804c0e0c: 0 41 8b 47 18 mov 0x18(%r15),%eax ffffffff804c0e10: 0 39 d0 cmp %edx,%eax ffffffff804c0e12: 0 0f 89 6b 02 00 00 jns ffffffff804c1083 <tcp_ack+0x56c> ffffffff804c0e18: 0 29 c2 sub %eax,%edx ffffffff804c0e1a: 0 4c 89 ee mov %r13,%rsi ffffffff804c0e1d: 0 48 89 ef mov %rbp,%rdi ffffffff804c0e20: 0 e8 8f 4f 00 00 callq ffffffff804c5db4 <tcp_trim_head> ffffffff804c0e25: 0 85 c0 test %eax,%eax ffffffff804c0e27: 0 0f 85 56 02 00 00 jne ffffffff804c1083 <tcp_ack+0x56c> ffffffff804c0e2d: 0 41 8b 85 c8 00 00 00 mov 0xc8(%r13),%eax ffffffff804c0e34: 0 0f b7 d3 movzwl %bx,%edx ffffffff804c0e37: 0 49 03 85 d0 00 00 00 add 0xd0(%r13),%rax ffffffff804c0e3e: 0 41 89 d6 mov %edx,%r14d ffffffff804c0e41: 0 8b 48 08 mov 0x8(%rax),%ecx ffffffff804c0e44: 0 0f b7 c1 movzwl %cx,%eax ffffffff804c0e47: 0 41 29 c6 sub %eax,%r14d ffffffff804c0e4a: 0 0f 84 33 02 00 00 je ffffffff804c1083 <tcp_ack+0x56c> ffffffff804c0e50: 0 66 85 c9 test %cx,%cx ffffffff804c0e53: 0 75 04 jne ffffffff804c0e59 <tcp_ack+0x342> ffffffff804c0e55: 0 0f 0b ud2a ffffffff804c0e57: 0 eb fe jmp ffffffff804c0e57 <tcp_ack+0x340> ffffffff804c0e59: 0 41 8b 5f 1c mov 0x1c(%r15),%ebx ffffffff804c0e5d: 0 41 39 5f 18 cmp %ebx,0x18(%r15) ffffffff804c0e61: 0 0f 88 d6 10 00 00 js ffffffff804c1f3d <tcp_ack+0x1426> ffffffff804c0e67: 0 0f 0b ud2a ffffffff804c0e69: 0 eb fe jmp ffffffff804c0e69 <tcp_ack+0x352> ffffffff804c0e6b: 0 83 7c 24 44 00 cmpl $0x0,0x44(%rsp) ffffffff804c0e70: 6326 89 c0 mov %eax,%eax ffffffff804c0e72: 348 44 0f b7 74 08 08 movzwl 0x8(%rax,%rcx,1),%r14d ffffffff804c0e78: 0 0f 84 8f 00 00 00 je ffffffff804c0f0d <tcp_ack+0x3f6> ffffffff804c0e7e: 132 83 bd a4 03 00 00 00 cmpl $0x0,0x3a4(%rbp) ffffffff804c0e85: 5840 0f 84 82 00 00 00 je ffffffff804c0f0d <tcp_ack+0x3f6> ffffffff804c0e8b: 0 3b b5 b4 05 00 00 cmp 0x5b4(%rbp),%esi ffffffff804c0e91: 0 78 7a js ffffffff804c0f0d <tcp_ack+0x3f6> ffffffff804c0e93: 0 48 89 ef mov %rbp,%rdi ffffffff804c0e96: 0 e8 21 da ff ff callq ffffffff804be8bc <tcp_current_ssthresh> ffffffff804c0e9b: 0 8b b5 4c 04 00 00 mov 0x44c(%rbp),%esi ffffffff804c0ea1: 0 44 8b a5 ac 04 00 00 mov 0x4ac(%rbp),%r12d ffffffff804c0ea8: 0 48 89 ef mov %rbp,%rdi ffffffff804c0eab: 0 89 85 6c 05 00 00 mov %eax,0x56c(%rbp) ffffffff804c0eb1: 0 e8 c7 3e 00 00 callq ffffffff804c4d7d <tcp_mss_to_mtu> ffffffff804c0eb6: 0 8b 9d a4 03 00 00 mov 0x3a4(%rbp),%ebx ffffffff804c0ebc: 0 31 d2 xor %edx,%edx ffffffff804c0ebe: 0 c7 85 b0 04 00 00 00 movl $0x0,0x4b0(%rbp) ffffffff804c0ec5: 0 00 00 00 ffffffff804c0ec8: 0 41 0f af c4 imul %r12d,%eax ffffffff804c0ecc: 0 48 89 ef mov %rbp,%rdi ffffffff804c0ecf: 0 f7 f3 div %ebx ffffffff804c0ed1: 0 89 85 ac 04 00 00 mov %eax,0x4ac(%rbp) ffffffff804c0ed7: 0 48 8b 05 a2 89 3f 00 mov 0x3f89a2(%rip),%rax # ffffffff808b9880 <jiffies> ffffffff804c0ede: 0 89 85 bc 04 00 00 mov %eax,0x4bc(%rbp) ffffffff804c0ee4: 0 e8 d3 d9 ff ff callq ffffffff804be8bc <tcp_current_ssthresh> ffffffff804c0ee9: 0 8b b5 5c 03 00 00 mov 0x35c(%rbp),%esi ffffffff804c0eef: 0 89 85 54 04 00 00 mov %eax,0x454(%rbp) ffffffff804c0ef5: 0 48 89 ef mov %rbp,%rdi ffffffff804c0ef8: 0 89 9d a0 03 00 00 mov %ebx,0x3a0(%rbp) ffffffff804c0efe: 0 c7 85 a4 03 00 00 00 movl $0x0,0x3a4(%rbp) ffffffff804c0f05: 0 00 00 00 ffffffff804c0f08: 0 e8 13 3f 00 00 callq ffffffff804c4e20 <tcp_sync_mss> ffffffff804c0f0d: 945 0f b6 44 24 3f movzbl 0x3f(%rsp),%eax ffffffff804c0f12: 6361 a8 82 test $0x82,%al ffffffff804c0f14: 0 74 30 je ffffffff804c0f46 <tcp_ack+0x42f> ffffffff804c0f16: 0 a8 02 test $0x2,%al ffffffff804c0f18: 0 74 07 je ffffffff804c0f21 <tcp_ack+0x40a> ffffffff804c0f1a: 0 44 29 b5 78 04 00 00 sub %r14d,0x478(%rbp) ffffffff804c0f21: 0 83 4c 24 50 08 orl $0x8,0x50(%rsp) ffffffff804c0f26: 0 f6 44 24 50 04 testb $0x4,0x50(%rsp) ffffffff804c0f2b: 0 75 06 jne ffffffff804c0f33 <tcp_ack+0x41c> ffffffff804c0f2d: 0 41 83 fe 01 cmp $0x1,%r14d ffffffff804c0f31: 0 76 08 jbe ffffffff804c0f3b <tcp_ack+0x424> ffffffff804c0f33: 0 81 4c 24 50 00 10 00 orl $0x1000,0x50(%rsp) ffffffff804c0f3a: 0 00 ffffffff804c0f3b: 0 41 83 cc ff or $0xffffffffffffffff,%r12d ffffffff804c0f3f: 0 44 89 64 24 4c mov %r12d,0x4c(%rsp) ffffffff804c0f44: 0 eb 38 jmp ffffffff804c0f7e <tcp_ack+0x467> ffffffff804c0f46: 0 44 8b 64 24 48 mov 0x48(%rsp),%r12d ffffffff804c0f4b: 5837 45 2b 67 20 sub 0x20(%r15),%r12d ffffffff804c0f4f: 1 83 7c 24 4c 00 cmpl $0x0,0x4c(%rsp) ffffffff804c0f54: 167 8b 5c 24 4c mov 0x4c(%rsp),%ebx ffffffff804c0f58: 514 49 8b 55 18 mov 0x18(%r13),%rdx ffffffff804c0f5c: 0 41 0f 48 dc cmovs %r12d,%ebx ffffffff804c0f60: 164 a8 01 test $0x1,%al ffffffff804c0f62: 413 48 89 54 24 08 mov %rdx,0x8(%rsp) ffffffff804c0f67: 0 89 5c 24 4c mov %ebx,0x4c(%rsp) ffffffff804c0f6b: 148 75 11 jne ffffffff804c0f7e <tcp_ack+0x467> ffffffff804c0f6d: 1608 8b 54 24 38 mov 0x38(%rsp),%edx ffffffff804c0f71: 0 39 54 24 34 cmp %edx,0x34(%rsp) ffffffff804c0f75: 272 0f 46 54 24 34 cmovbe 0x34(%rsp),%edx ffffffff804c0f7a: 266 89 54 24 34 mov %edx,0x34(%rsp) ffffffff804c0f7e: 0 a8 01 test $0x1,%al ffffffff804c0f80: 164 74 07 je ffffffff804c0f89 <tcp_ack+0x472> ffffffff804c0f82: 0 44 29 b5 d0 04 00 00 sub %r14d,0x4d0(%rbp) ffffffff804c0f89: 3955 a8 04 test $0x4,%al ffffffff804c0f8b: 8510 74 07 je ffffffff804c0f94 <tcp_ack+0x47d> ffffffff804c0f8d: 0 44 29 b5 cc 04 00 00 sub %r14d,0x4cc(%rbp) ffffffff804c0f94: 11 44 29 b5 74 04 00 00 sub %r14d,0x474(%rbp) ffffffff804c0f9b: 1426 44 01 74 24 38 add %r14d,0x38(%rsp) ffffffff804c0fa0: 6 41 f6 47 24 02 testb $0x2,0x24(%r15) ffffffff804c0fa5: 548 75 07 jne ffffffff804c0fae <tcp_ack+0x497> ffffffff804c0fa7: 2 83 4c 24 50 04 orl $0x4,0x50(%rsp) ffffffff804c0fac: 0 eb 0f jmp ffffffff804c0fbd <tcp_ack+0x4a6> ffffffff804c0fae: 0 83 4c 24 50 10 orl $0x10,0x50(%rsp) ffffffff804c0fb3: 0 c7 85 74 05 00 00 00 movl $0x0,0x574(%rbp) ffffffff804c0fba: 0 00 00 00 ffffffff804c0fbd: 517 83 7c 24 44 00 cmpl $0x0,0x44(%rsp) ffffffff804c0fc2: 6012 0f 84 bb 00 00 00 je ffffffff804c1083 <tcp_ack+0x56c> ffffffff804c0fc8: 1111 48 8b 34 24 mov (%rsp),%rsi ffffffff804c0fcc: 0 4c 89 ef mov %r13,%rdi ffffffff804c0fcf: 184 e8 0d d8 ff ff callq ffffffff804be7e1 <__skb_unlink> ffffffff804c0fd4: 5 41 8b 45 68 mov 0x68(%r13),%eax ffffffff804c0fd8: 517 05 e8 00 00 00 add $0xe8,%eax ffffffff804c0fdd: 0 41 39 85 e0 00 00 00 cmp %eax,0xe0(%r13) ffffffff804c0fe4: 31 7d 08 jge ffffffff804c0fee <tcp_ack+0x4d7> ffffffff804c0fe6: 0 4c 89 ef mov %r13,%rdi ffffffff804c0fe9: 0 e8 d4 66 fc ff callq ffffffff804876c2 <skb_truesize_bug> ffffffff804c0fee: 1142 0f ba ad 10 01 00 00 btsl $0xe,0x110(%rbp) ffffffff804c0ff5: 0 0e ffffffff804c0ff6: 2576 8b 85 f0 00 00 00 mov 0xf0(%rbp),%eax ffffffff804c0ffc: 433 41 2b 85 e0 00 00 00 sub 0xe0(%r13),%eax ffffffff804c1003: 4843 89 85 f0 00 00 00 mov %eax,0xf0(%rbp) ffffffff804c1009: 1730 48 8b 45 30 mov 0x30(%rbp),%rax ffffffff804c100d: 311 41 8b 95 e0 00 00 00 mov 0xe0(%r13),%edx ffffffff804c1014: 0 48 83 b8 b0 00 00 00 cmpq $0x0,0xb0(%rax) ffffffff804c101b: 0 00 ffffffff804c101c: 418 74 06 je ffffffff804c1024 <tcp_ack+0x50d> ffffffff804c101e: 37 01 95 f4 00 00 00 add %edx,0xf4(%rbp) ffffffff804c1024: 2 4c 89 ef mov %r13,%rdi ffffffff804c1027: 432 e8 56 7b fc ff callq ffffffff80488b82 <__kfree_skb> ffffffff804c102c: 44 4c 3b ad f0 04 00 00 cmp 0x4f0(%rbp),%r13 ffffffff804c1033: 511 48 c7 85 e8 04 00 00 movq $0x0,0x4e8(%rbp) ffffffff804c103a: 0 00 00 00 00 ffffffff804c103e: 1 75 0b jne ffffffff804c104b <tcp_ack+0x534> ffffffff804c1040: 0 48 c7 85 f0 04 00 00 movq $0x0,0x4f0(%rbp) ffffffff804c1047: 0 00 00 00 00 ffffffff804c104b: 0 4c 3b ad e0 04 00 00 cmp 0x4e0(%rbp),%r13 ffffffff804c1052: 518 75 0b jne ffffffff804c105f <tcp_ack+0x548> ffffffff804c1054: 0 48 c7 85 e0 04 00 00 movq $0x0,0x4e0(%rbp) ffffffff804c105b: 0 00 00 00 00 ffffffff804c105f: 439 4c 8b ad c0 00 00 00 mov 0xc0(%rbp),%r13 ffffffff804c1066: 5655 4c 3b 2c 24 cmp (%rsp),%r13 ffffffff804c106a: 0 75 05 jne ffffffff804c1071 <tcp_ack+0x55a> ffffffff804c106c: 0 45 31 ed xor %r13d,%r13d ffffffff804c106f: 810 eb 12 jmp ffffffff804c1083 <tcp_ack+0x56c> ffffffff804c1071: 0 4d 85 ed test %r13,%r13 ffffffff804c1074: 2574 74 0d je ffffffff804c1083 <tcp_ack+0x56c> ffffffff804c1076: 0 4c 3b ad d8 01 00 00 cmp 0x1d8(%rbp),%r13 ffffffff804c107d: 0 0f 85 51 fd ff ff jne ffffffff804c0dd4 <tcp_ack+0x2bd> ffffffff804c1083: 454 8b 8d 00 04 00 00 mov 0x400(%rbp),%ecx ffffffff804c1089: 497 8b 85 80 04 00 00 mov 0x480(%rbp),%eax ffffffff804c108f: 0 2b 44 24 18 sub 0x18(%rsp),%eax ffffffff804c1093: 0 89 ca mov %ecx,%edx ffffffff804c1095: 534 2b 54 24 18 sub 0x18(%rsp),%edx ffffffff804c1099: 0 39 c2 cmp %eax,%edx ffffffff804c109b: 0 72 06 jb ffffffff804c10a3 <tcp_ack+0x58c> ffffffff804c109d: 458 89 8d 80 04 00 00 mov %ecx,0x480(%rbp) ffffffff804c10a3: 0 4d 85 ed test %r13,%r13 ffffffff804c10a6: 0 74 15 je ffffffff804c10bd <tcp_ack+0x5a6> ffffffff804c10a8: 0 8b 44 24 50 mov 0x50(%rsp),%eax ffffffff804c10ac: 2 80 cc 20 or $0x20,%ah ffffffff804c10af: 3 41 f6 45 5d 01 testb $0x1,0x5d(%r13) ffffffff804c10b4: 0 0f 44 44 24 50 cmove 0x50(%rsp),%eax ffffffff804c10b9: 0 89 44 24 50 mov %eax,0x50(%rsp) ffffffff804c10bd: 444 f6 44 24 50 14 testb $0x14,0x50(%rsp) ffffffff804c10c2: 551 0f 84 e1 01 00 00 je ffffffff804c12a9 <tcp_ack+0x792> ffffffff804c10c8: 1 f6 85 9c 04 00 00 01 testb $0x1,0x49c(%rbp) ffffffff804c10cf: 2 48 8b 9d 60 03 00 00 mov 0x360(%rbp),%rbx ffffffff804c10d6: 462 74 17 je ffffffff804c10ef <tcp_ack+0x5d8> ffffffff804c10d8: 0 83 bd 98 04 00 00 00 cmpl $0x0,0x498(%rbp) ffffffff804c10df: 0 74 0e je ffffffff804c10ef <tcp_ack+0x5d8> ffffffff804c10e1: 451 8b 74 24 50 mov 0x50(%rsp),%esi ffffffff804c10e5: 43 48 89 ef mov %rbp,%rdi ffffffff804c10e8: 0 e8 ea e8 ff ff callq ffffffff804bf9d7 <tcp_ack_saw_tstamp> ffffffff804c10ed: 66 eb 47 jmp ffffffff804c1136 <tcp_ack+0x61f> ffffffff804c10ef: 0 83 7c 24 4c 00 cmpl $0x0,0x4c(%rsp) ffffffff804c10f4: 0 78 40 js ffffffff804c1136 <tcp_ack+0x61f> ffffffff804c10f6: 0 f6 44 24 50 08 testb $0x8,0x50(%rsp) ffffffff804c10fb: 0 75 39 jne ffffffff804c1136 <tcp_ack+0x61f> ffffffff804c10fd: 0 8b 74 24 4c mov 0x4c(%rsp),%esi ffffffff804c1101: 0 48 89 ef mov %rbp,%rdi ffffffff804c1104: 0 e8 b5 e7 ff ff callq ffffffff804bf8be <tcp_rtt_estimator> ffffffff804c1109: 0 8b 85 60 04 00 00 mov 0x460(%rbp),%eax ffffffff804c110f: 0 c6 85 7b 03 00 00 00 movb $0x0,0x37b(%rbp) ffffffff804c1116: 0 c1 e8 03 shr $0x3,%eax ffffffff804c1119: 0 03 85 6c 04 00 00 add 0x46c(%rbp),%eax ffffffff804c111f: 0 3d 30 75 00 00 cmp $0x7530,%eax ffffffff804c1124: 0 89 85 58 03 00 00 mov %eax,0x358(%rbp) ffffffff804c112a: 0 76 0a jbe ffffffff804c1136 <tcp_ack+0x61f> ffffffff804c112c: 0 c7 85 58 03 00 00 30 movl $0x7530,0x358(%rbp) ffffffff804c1133: 0 75 00 00 ffffffff804c1136: 732 83 bd 74 04 00 00 00 cmpl $0x0,0x474(%rbp) ffffffff804c113d: 1833 75 0f jne ffffffff804c114e <tcp_ack+0x637> ffffffff804c113f: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804c1144: 493 48 89 ef mov %rbp,%rdi ffffffff804c1147: 0 e8 07 d7 ff ff callq ffffffff804be853 <inet_csk_clear_xmit_timer> ffffffff804c114c: 0 eb 18 jmp ffffffff804c1166 <tcp_ack+0x64f> ffffffff804c114e: 0 8b 95 58 03 00 00 mov 0x358(%rbp),%edx ffffffff804c1154: 0 b9 30 75 00 00 mov $0x7530,%ecx ffffffff804c1159: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804c115e: 0 48 89 ef mov %rbp,%rdi ffffffff804c1161: 0 e8 7d e4 ff ff callq ffffffff804bf5e3 <inet_csk_reset_xmit_timer> ffffffff804c1166: 881 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c116c: 845 c0 e8 04 shr $0x4,%al ffffffff804c116f: 1 75 63 jne ffffffff804c11d4 <tcp_ack+0x6bd> ffffffff804c1171: 0 83 7c 24 38 00 cmpl $0x0,0x38(%rsp) ffffffff804c1176: 0 7e 29 jle ffffffff804c11a1 <tcp_ack+0x68a> ffffffff804c1178: 0 8b 44 24 38 mov 0x38(%rsp),%eax ffffffff804c117c: 0 8b 95 d0 04 00 00 mov 0x4d0(%rbp),%edx ffffffff804c1182: 0 ff c8 dec %eax ffffffff804c1184: 0 39 d0 cmp %edx,%eax ffffffff804c1186: 0 72 0c jb ffffffff804c1194 <tcp_ack+0x67d> ffffffff804c1188: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp) ffffffff804c118f: 0 00 00 00 ffffffff804c1192: 0 eb 0d jmp ffffffff804c11a1 <tcp_ack+0x68a> ffffffff804c1194: 0 8d 42 01 lea 0x1(%rdx),%eax ffffffff804c1197: 0 2b 44 24 38 sub 0x38(%rsp),%eax ffffffff804c119b: 0 89 85 d0 04 00 00 mov %eax,0x4d0(%rbp) ffffffff804c11a1: 0 8b 74 24 38 mov 0x38(%rsp),%esi ffffffff804c11a5: 0 48 89 ef mov %rbp,%rdi ffffffff804c11a8: 0 e8 2d dd ff ff callq ffffffff804beeda <tcp_check_reno_reordering> ffffffff804c11ad: 0 8b 85 cc 04 00 00 mov 0x4cc(%rbp),%eax ffffffff804c11b3: 0 03 85 d0 04 00 00 add 0x4d0(%rbp),%eax ffffffff804c11b9: 0 3b 85 74 04 00 00 cmp 0x474(%rbp),%eax ffffffff804c11bf: 0 76 5e jbe ffffffff804c121f <tcp_ack+0x708> ffffffff804c11c1: 0 be b0 06 00 00 mov $0x6b0,%esi ffffffff804c11c6: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c11cd: 0 e8 e3 4f d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c11d2: 0 eb 4b jmp ffffffff804c121f <tcp_ack+0x708> ffffffff804c11d4: 414 8b 44 24 20 mov 0x20(%rsp),%eax ffffffff804c11d8: 1591 39 44 24 34 cmp %eax,0x34(%rsp) ffffffff804c11dc: 2 73 14 jae ffffffff804c11f2 <tcp_ack+0x6db> ffffffff804c11de: 0 8b b5 d4 04 00 00 mov 0x4d4(%rbp),%esi ffffffff804c11e4: 0 2b 74 24 34 sub 0x34(%rsp),%esi ffffffff804c11e8: 0 31 d2 xor %edx,%edx ffffffff804c11ea: 0 48 89 ef mov %rbp,%rdi ffffffff804c11ed: 0 e8 9c db ff ff callq ffffffff804bed8e <tcp_update_reordering> ffffffff804c11f2: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c11f8: 865 c0 e8 04 shr $0x4,%al ffffffff804c11fb: 3 a8 02 test $0x2,%al ffffffff804c11fd: 0 8b 85 60 05 00 00 mov 0x560(%rbp),%eax ffffffff804c1203: 453 74 06 je ffffffff804c120b <tcp_ack+0x6f4> ffffffff804c1205: 8 2b 44 24 38 sub 0x38(%rsp),%eax ffffffff804c1209: 0 eb 0e jmp ffffffff804c1219 <tcp_ack+0x702> ffffffff804c120b: 0 8b 95 d0 04 00 00 mov 0x4d0(%rbp),%edx ffffffff804c1211: 0 29 54 24 40 sub %edx,0x40(%rsp) ffffffff804c1215: 0 2b 44 24 40 sub 0x40(%rsp),%eax ffffffff804c1219: 423 89 85 60 05 00 00 mov %eax,0x560(%rbp) ffffffff804c121f: 492 8b 85 d4 04 00 00 mov 0x4d4(%rbp),%eax ffffffff804c1225: 489 39 44 24 38 cmp %eax,0x38(%rsp) ffffffff804c1229: 0 8b 54 24 38 mov 0x38(%rsp),%edx ffffffff804c122d: 0 0f 47 d0 cmova %eax,%edx ffffffff804c1230: 438 29 d0 sub %edx,%eax ffffffff804c1232: 0 89 85 d4 04 00 00 mov %eax,0x4d4(%rbp) ffffffff804c1238: 1 48 83 7b 58 00 cmpq $0x0,0x58(%rbx) ffffffff804c123d: 446 74 6a je ffffffff804c12a9 <tcp_ack+0x792> ffffffff804c123f: 0 f6 44 24 50 08 testb $0x8,0x50(%rsp) ffffffff804c1244: 3 75 54 jne ffffffff804c129a <tcp_ack+0x783> ffffffff804c1246: 441 f6 43 10 02 testb $0x2,0x10(%rbx) ffffffff804c124a: 8 74 3f je ffffffff804c128b <tcp_ack+0x774> ffffffff804c124c: 0 e8 33 e4 ff ff callq ffffffff804bf684 <net_invalid_timestamp> ffffffff804c1251: 0 48 39 44 24 08 cmp %rax,0x8(%rsp) ffffffff804c1256: 0 74 33 je ffffffff804c128b <tcp_ack+0x774> ffffffff804c1258: 0 e8 17 8b d8 ff callq ffffffff80249d74 <ktime_get_real> ffffffff804c125d: 0 48 89 c7 mov %rax,%rdi ffffffff804c1260: 0 48 2b 7c 24 08 sub 0x8(%rsp),%rdi ffffffff804c1265: 0 e8 e3 8e d7 ff callq ffffffff8023a14d <ns_to_timeval> ffffffff804c126a: 0 48 89 44 24 60 mov %rax,0x60(%rsp) ffffffff804c126f: 0 48 89 44 24 70 mov %rax,0x70(%rsp) ffffffff804c1274: 0 48 69 c0 40 42 0f 00 imul $0xf4240,%rax,%rax ffffffff804c127b: 0 48 89 54 24 78 mov %rdx,0x78(%rsp) ffffffff804c1280: 0 48 89 54 24 68 mov %rdx,0x68(%rsp) ffffffff804c1285: 0 03 44 24 78 add 0x78(%rsp),%eax ffffffff804c1289: 0 eb 12 jmp ffffffff804c129d <tcp_ack+0x786> ffffffff804c128b: 89 45 85 e4 test %r12d,%r12d ffffffff804c128e: 414 7e 0a jle ffffffff804c129a <tcp_ack+0x783> ffffffff804c1290: 0 49 63 fc movslq %r12d,%rdi ffffffff804c1293: 65 e8 a8 8b d7 ff callq ffffffff80239e40 <jiffies_to_usecs> ffffffff804c1298: 0 eb 03 jmp ffffffff804c129d <tcp_ack+0x786> ffffffff804c129a: 0 83 c8 ff or $0xffffffffffffffff,%eax ffffffff804c129d: 1136 89 c2 mov %eax,%edx ffffffff804c129f: 7 8b 74 24 38 mov 0x38(%rsp),%esi ffffffff804c12a3: 444 48 89 ef mov %rbp,%rdi ffffffff804c12a6: 1 ff 53 58 callq *0x58(%rbx) ffffffff804c12a9: 305 83 bd d0 04 00 00 00 cmpl $0x0,0x4d0(%rbp) ffffffff804c12b0: 518 79 11 jns ffffffff804c12c3 <tcp_ack+0x7ac> ffffffff804c12b2: 0 be ac 0b 00 00 mov $0xbac,%esi ffffffff804c12b7: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c12be: 0 e8 f2 4e d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c12c3: 415 83 bd cc 04 00 00 00 cmpl $0x0,0x4cc(%rbp) ffffffff804c12ca: 2204 79 11 jns ffffffff804c12dd <tcp_ack+0x7c6> ffffffff804c12cc: 0 be ad 0b 00 00 mov $0xbad,%esi ffffffff804c12d1: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c12d8: 0 e8 d8 4e d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c12dd: 0 83 bd 78 04 00 00 00 cmpl $0x0,0x478(%rbp) ffffffff804c12e4: 1747 79 11 jns ffffffff804c12f7 <tcp_ack+0x7e0> ffffffff804c12e6: 0 be ae 0b 00 00 mov $0xbae,%esi ffffffff804c12eb: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c12f2: 0 e8 be 4e d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c12f7: 0 83 bd 74 04 00 00 00 cmpl $0x0,0x474(%rbp) ffffffff804c12fe: 878 0f 85 86 00 00 00 jne ffffffff804c138a <tcp_ack+0x873> ffffffff804c1304: 4721 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c130a: 968 c0 e8 04 shr $0x4,%al ffffffff804c130d: 2 74 7b je ffffffff804c138a <tcp_ack+0x873> ffffffff804c130f: 171 8b b5 cc 04 00 00 mov 0x4cc(%rbp),%esi ffffffff804c1315: 282 85 f6 test %esi,%esi ffffffff804c1317: 0 74 1f je ffffffff804c1338 <tcp_ack+0x821> ffffffff804c1319: 0 0f b6 95 78 03 00 00 movzbl 0x378(%rbp),%edx ffffffff804c1320: 0 48 c7 c7 b2 d9 6a 80 mov $0xffffffff806ad9b2,%rdi ffffffff804c1327: 0 31 c0 xor %eax,%eax ffffffff804c1329: 0 e8 46 5a d7 ff callq ffffffff80236d74 <printk> ffffffff804c132e: 0 c7 85 cc 04 00 00 00 movl $0x0,0x4cc(%rbp) ffffffff804c1335: 0 00 00 00 ffffffff804c1338: 198 8b b5 d0 04 00 00 mov 0x4d0(%rbp),%esi ffffffff804c133e: 257 85 f6 test %esi,%esi ffffffff804c1340: 0 74 1f je ffffffff804c1361 <tcp_ack+0x84a> ffffffff804c1342: 0 0f b6 95 78 03 00 00 movzbl 0x378(%rbp),%edx ffffffff804c1349: 0 48 c7 c7 c3 d9 6a 80 mov $0xffffffff806ad9c3,%rdi ffffffff804c1350: 0 31 c0 xor %eax,%eax ffffffff804c1352: 0 e8 1d 5a d7 ff callq ffffffff80236d74 <printk> ffffffff804c1357: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp) ffffffff804c135e: 0 00 00 00 ffffffff804c1361: 2524 8b b5 78 04 00 00 mov 0x478(%rbp),%esi ffffffff804c1367: 1825 85 f6 test %esi,%esi ffffffff804c1369: 0 74 1f je ffffffff804c138a <tcp_ack+0x873> ffffffff804c136b: 0 0f b6 95 78 03 00 00 movzbl 0x378(%rbp),%edx ffffffff804c1372: 0 48 c7 c7 d4 d9 6a 80 mov $0xffffffff806ad9d4,%rdi ffffffff804c1379: 0 31 c0 xor %eax,%eax ffffffff804c137b: 0 e8 f4 59 d7 ff callq ffffffff80236d74 <printk> ffffffff804c1380: 0 c7 85 78 04 00 00 00 movl $0x0,0x478(%rbp) ffffffff804c1387: 0 00 00 00 ffffffff804c138a: 46 44 8b 64 24 50 mov 0x50(%rsp),%r12d ffffffff804c138f: 7369 31 c9 xor %ecx,%ecx ffffffff804c1391: 348 44 0b 64 24 5c or 0x5c(%rsp),%r12d ffffffff804c1396: 0 80 bd 5e 04 00 00 00 cmpb $0x0,0x45e(%rbp) ffffffff804c139d: 96 0f 84 26 02 00 00 je ffffffff804c15c9 <tcp_ack+0xab2> ffffffff804c13a3: 0 8b 85 cc 04 00 00 mov 0x4cc(%rbp),%eax ffffffff804c13a9: 0 03 85 d0 04 00 00 add 0x4d0(%rbp),%eax ffffffff804c13af: 0 3b 85 74 04 00 00 cmp 0x474(%rbp),%eax ffffffff804c13b5: 0 76 11 jbe ffffffff804c13c8 <tcp_ack+0x8b1> ffffffff804c13b7: 0 be 58 0c 00 00 mov $0xc58,%esi ffffffff804c13bc: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c13c3: 0 e8 ed 4d d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c13c8: 0 44 89 e3 mov %r12d,%ebx ffffffff804c13cb: 0 83 e3 04 and $0x4,%ebx ffffffff804c13ce: 0 74 07 je ffffffff804c13d7 <tcp_ack+0x8c0> ffffffff804c13d0: 0 c6 85 79 03 00 00 00 movb $0x0,0x379(%rbp) ffffffff804c13d7: 0 41 f7 c4 00 10 00 00 test $0x1000,%r12d ffffffff804c13de: 0 75 0f jne ffffffff804c13ef <tcp_ack+0x8d8> ffffffff804c13e0: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp) ffffffff804c13e7: 0 76 10 jbe ffffffff804c13f9 <tcp_ack+0x8e2> ffffffff804c13e9: 0 41 f6 c4 08 test $0x8,%r12b ffffffff804c13ed: 0 74 0a je ffffffff804c13f9 <tcp_ack+0x8e2> ffffffff804c13ef: 0 c7 85 78 05 00 00 00 movl $0x0,0x578(%rbp) ffffffff804c13f6: 0 00 00 00 ffffffff804c13f9: 0 8b 85 58 04 00 00 mov 0x458(%rbp),%eax ffffffff804c13ff: 0 39 85 00 04 00 00 cmp %eax,0x400(%rbp) ffffffff804c1405: 0 78 12 js ffffffff804c1419 <tcp_ack+0x902> ffffffff804c1407: 0 31 f6 xor %esi,%esi ffffffff804c1409: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp) ffffffff804c1410: 0 40 0f 95 c6 setne %sil ffffffff804c1414: 0 83 c6 02 add $0x2,%esi ffffffff804c1417: 0 eb 37 jmp ffffffff804c1450 <tcp_ack+0x939> ffffffff804c1419: 0 48 89 ef mov %rbp,%rdi ffffffff804c141c: 0 e8 e0 da ff ff callq ffffffff804bef01 <tcp_is_sackfrto> ffffffff804c1421: 0 85 c0 test %eax,%eax ffffffff804c1423: 0 75 3b jne ffffffff804c1460 <tcp_ack+0x949> ffffffff804c1425: 0 41 f7 c4 34 04 00 00 test $0x434,%r12d ffffffff804c142c: 0 75 0a jne ffffffff804c1438 <tcp_ack+0x921> ffffffff804c142e: 0 41 f6 c4 17 test $0x17,%r12b ffffffff804c1432: 0 0f 85 8c 01 00 00 jne ffffffff804c15c4 <tcp_ack+0xaad> ffffffff804c1438: 0 85 db test %ebx,%ebx ffffffff804c143a: 0 0f 85 8d 00 00 00 jne ffffffff804c14cd <tcp_ack+0x9b6> ffffffff804c1440: 0 31 f6 xor %esi,%esi ffffffff804c1442: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp) ffffffff804c1449: 0 40 0f 95 c6 setne %sil ffffffff804c144d: 0 8d 34 76 lea (%rsi,%rsi,2),%esi ffffffff804c1450: 0 44 89 e2 mov %r12d,%edx ffffffff804c1453: 0 48 89 ef mov %rbp,%rdi ffffffff804c1456: 0 e8 b8 e7 ff ff callq ffffffff804bfc13 <tcp_enter_frto_loss> ffffffff804c145b: 0 e9 64 01 00 00 jmpq ffffffff804c15c4 <tcp_ack+0xaad> ffffffff804c1460: 0 85 db test %ebx,%ebx ffffffff804c1462: 0 75 37 jne ffffffff804c149b <tcp_ack+0x984> ffffffff804c1464: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp) ffffffff804c146b: 0 75 2e jne ffffffff804c149b <tcp_ack+0x984> ffffffff804c146d: 0 8b 85 78 04 00 00 mov 0x478(%rbp),%eax ffffffff804c1473: 0 03 85 74 04 00 00 add 0x474(%rbp),%eax ffffffff804c1479: 0 2b 85 d0 04 00 00 sub 0x4d0(%rbp),%eax ffffffff804c147f: 0 8b 95 ac 04 00 00 mov 0x4ac(%rbp),%edx ffffffff804c1485: 0 2b 85 cc 04 00 00 sub 0x4cc(%rbp),%eax ffffffff804c148b: 0 39 d0 cmp %edx,%eax ffffffff804c148d: 0 0f 47 c2 cmova %edx,%eax ffffffff804c1490: 0 89 85 ac 04 00 00 mov %eax,0x4ac(%rbp) ffffffff804c1496: 0 e9 29 01 00 00 jmpq ffffffff804c15c4 <tcp_ack+0xaad> ffffffff804c149b: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp) ffffffff804c14a2: 0 76 29 jbe ffffffff804c14cd <tcp_ack+0x9b6> ffffffff804c14a4: 0 41 f6 c4 34 test $0x34,%r12b ffffffff804c14a8: 0 74 0f je ffffffff804c14b9 <tcp_ack+0x9a2> ffffffff804c14aa: 0 44 89 e0 mov %r12d,%eax ffffffff804c14ad: 0 25 20 02 00 00 and $0x220,%eax ffffffff804c14b2: 0 83 f8 20 cmp $0x20,%eax ffffffff804c14b5: 0 75 16 jne ffffffff804c14cd <tcp_ack+0x9b6> ffffffff804c14b7: 0 eb 0a jmp ffffffff804c14c3 <tcp_ack+0x9ac> ffffffff804c14b9: 0 41 f6 c4 17 test $0x17,%r12b ffffffff804c14bd: 0 0f 85 01 01 00 00 jne ffffffff804c15c4 <tcp_ack+0xaad> ffffffff804c14c3: 0 44 89 e2 mov %r12d,%edx ffffffff804c14c6: 0 be 03 00 00 00 mov $0x3,%esi ffffffff804c14cb: 0 eb 86 jmp ffffffff804c1453 <tcp_ack+0x93c> ffffffff804c14cd: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp) ffffffff804c14d4: 0 75 45 jne ffffffff804c151b <tcp_ack+0xa04> ffffffff804c14d6: 0 8b 85 78 04 00 00 mov 0x478(%rbp),%eax ffffffff804c14dc: 0 03 85 74 04 00 00 add 0x474(%rbp),%eax ffffffff804c14e2: 0 48 89 ef mov %rbp,%rdi ffffffff804c14e5: 0 c6 85 5e 04 00 00 02 movb $0x2,0x45e(%rbp) ffffffff804c14ec: 0 83 c0 02 add $0x2,%eax ffffffff804c14ef: 0 2b 85 cc 04 00 00 sub 0x4cc(%rbp),%eax ffffffff804c14f5: 0 2b 85 d0 04 00 00 sub 0x4d0(%rbp),%eax ffffffff804c14fb: 0 89 85 ac 04 00 00 mov %eax,0x4ac(%rbp) ffffffff804c1501: 0 e8 0a 3e 00 00 callq ffffffff804c5310 <tcp_may_send_now> ffffffff804c1506: 0 85 c0 test %eax,%eax ffffffff804c1508: 0 0f 85 b6 00 00 00 jne ffffffff804c15c4 <tcp_ack+0xaad> ffffffff804c150e: 0 44 89 e2 mov %r12d,%edx ffffffff804c1511: 0 be 02 00 00 00 mov $0x2,%esi ffffffff804c1516: 0 e9 38 ff ff ff jmpq ffffffff804c1453 <tcp_ack+0x93c> ffffffff804c151b: 0 8b 05 3f 6f 3f 00 mov 0x3f6f3f(%rip),%eax # ffffffff808b8460 <sysctl_tcp_frto_response> ffffffff804c1521: 0 83 f8 01 cmp $0x1,%eax ffffffff804c1524: 0 74 1a je ffffffff804c1540 <tcp_ack+0xa29> ffffffff804c1526: 0 83 f8 02 cmp $0x2,%eax ffffffff804c1529: 0 75 5d jne ffffffff804c1588 <tcp_ack+0xa71> ffffffff804c152b: 0 41 f6 c4 40 test $0x40,%r12b ffffffff804c152f: 0 75 57 jne ffffffff804c1588 <tcp_ack+0xa71> ffffffff804c1531: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804c1536: 0 48 89 ef mov %rbp,%rdi ffffffff804c1539: 0 e8 5a db ff ff callq ffffffff804bf098 <tcp_undo_cwr> ffffffff804c153e: 0 eb 50 jmp ffffffff804c1590 <tcp_ack+0xa79> ffffffff804c1540: 0 8b 85 ac 04 00 00 mov 0x4ac(%rbp),%eax ffffffff804c1546: 0 8b 95 a8 04 00 00 mov 0x4a8(%rbp),%edx ffffffff804c154c: 0 c7 85 b0 04 00 00 00 movl $0x0,0x4b0(%rbp) ffffffff804c1553: 0 00 00 00 ffffffff804c1556: 0 c7 85 dc 04 00 00 00 movl $0x0,0x4dc(%rbp) ffffffff804c155d: 0 00 00 00 ffffffff804c1560: 0 39 c2 cmp %eax,%edx ffffffff804c1562: 0 0f 46 c2 cmovbe %edx,%eax ffffffff804c1565: 0 89 85 ac 04 00 00 mov %eax,0x4ac(%rbp) ffffffff804c156b: 0 8a 85 7e 04 00 00 mov 0x47e(%rbp),%al ffffffff804c1571: 0 a8 01 test $0x1,%al ffffffff804c1573: 0 74 09 je ffffffff804c157e <tcp_ack+0xa67> ffffffff804c1575: 0 83 c8 02 or $0x2,%eax ffffffff804c1578: 0 88 85 7e 04 00 00 mov %al,0x47e(%rbp) ffffffff804c157e: 0 48 89 ef mov %rbp,%rdi ffffffff804c1581: 0 e8 27 da ff ff callq ffffffff804befad <tcp_moderate_cwnd> ffffffff804c1586: 0 eb 08 jmp ffffffff804c1590 <tcp_ack+0xa79> ffffffff804c1588: 0 48 89 ef mov %rbp,%rdi ffffffff804c158b: 0 e8 78 dd ff ff callq ffffffff804bf308 <tcp_ratehalving_spur_to_response> ffffffff804c1590: 0 c6 85 5e 04 00 00 00 movb $0x0,0x45e(%rbp) ffffffff804c1597: 0 c7 85 78 05 00 00 00 movl $0x0,0x578(%rbp) ffffffff804c159e: 0 00 00 00 ffffffff804c15a1: 0 31 c9 xor %ecx,%ecx ffffffff804c15a3: 0 48 8b 05 0e 01 5f 00 mov 0x5f010e(%rip),%rax # ffffffff80ab16b8 <init_net+0xe8> ffffffff804c15aa: 0 65 8b 14 25 24 00 00 mov %gs:0x24,%edx ffffffff804c15b1: 0 00 ffffffff804c15b2: 0 89 d2 mov %edx,%edx ffffffff804c15b4: 0 48 f7 d0 not %rax ffffffff804c15b7: 0 48 8b 04 d0 mov (%rax,%rdx,8),%rax ffffffff804c15bb: 0 48 ff 80 28 02 00 00 incq 0x228(%rax) ffffffff804c15c2: 0 eb 05 jmp ffffffff804c15c9 <tcp_ack+0xab2> ffffffff804c15c4: 0 b9 01 00 00 00 mov $0x1,%ecx ffffffff804c15c9: 466 8b 95 00 04 00 00 mov 0x400(%rbp),%edx ffffffff804c15cf: 5645 39 95 58 04 00 00 cmp %edx,0x458(%rbp) ffffffff804c15d5: 176 79 0a jns ffffffff804c15e1 <tcp_ack+0xaca> ffffffff804c15d7: 24 c7 85 58 04 00 00 00 movl $0x0,0x458(%rbp) ffffffff804c15de: 0 00 00 00 ffffffff804c15e1: 620 8b 54 24 2c mov 0x2c(%rsp),%edx ffffffff804c15e5: 639 03 54 24 30 add 0x30(%rsp),%edx ffffffff804c15e9: 2 44 89 e3 mov %r12d,%ebx ffffffff804c15ec: 283 2b 54 24 28 sub 0x28(%rsp),%edx ffffffff804c15f0: 154 2b 54 24 24 sub 0x24(%rsp),%edx ffffffff804c15f4: 0 83 e3 17 and $0x17,%ebx ffffffff804c15f7: 266 89 5c 24 54 mov %ebx,0x54(%rsp) ffffffff804c15fb: 168 74 13 je ffffffff804c1610 <tcp_ack+0xaf9> ffffffff804c15fd: 0 41 f6 c4 60 test $0x60,%r12b ffffffff804c1601: 6575 75 0d jne ffffffff804c1610 <tcp_ack+0xaf9> ffffffff804c1603: 20 80 bd 78 03 00 00 00 cmpb $0x0,0x378(%rbp) ffffffff804c160a: 1417 0f 84 3a 09 00 00 je ffffffff804c1f4a <tcp_ack+0x1433> ffffffff804c1610: 0 44 89 e0 mov %r12d,%eax ffffffff804c1613: 0 c1 e8 02 shr $0x2,%eax ffffffff804c1616: 0 88 c3 mov %al,%bl ffffffff804c1618: 0 80 e3 01 and $0x1,%bl ffffffff804c161b: 0 41 88 de mov %bl,%r14b ffffffff804c161e: 0 74 36 je ffffffff804c1656 <tcp_ack+0xb3f> ffffffff804c1620: 0 85 c9 test %ecx,%ecx ffffffff804c1622: 0 75 32 jne ffffffff804c1656 <tcp_ack+0xb3f> ffffffff804c1624: 0 41 f6 c4 40 test $0x40,%r12b ffffffff804c1628: 0 74 0e je ffffffff804c1638 <tcp_ack+0xb21> ffffffff804c162a: 0 8b 85 a8 04 00 00 mov 0x4a8(%rbp),%eax ffffffff804c1630: 0 39 85 ac 04 00 00 cmp %eax,0x4ac(%rbp) ffffffff804c1636: 0 73 1e jae ffffffff804c1656 <tcp_ack+0xb3f> ffffffff804c1638: 0 0f b6 8d 78 03 00 00 movzbl 0x378(%rbp),%ecx ffffffff804c163f: 0 b8 0c 00 00 00 mov $0xc,%eax ffffffff804c1644: 0 d3 f8 sar %cl,%eax ffffffff804c1646: 0 a8 01 test $0x1,%al ffffffff804c1648: 0 75 0c jne ffffffff804c1656 <tcp_ack+0xb3f> ffffffff804c164a: 0 8b 74 24 1c mov 0x1c(%rsp),%esi ffffffff804c164e: 0 48 89 ef mov %rbp,%rdi ffffffff804c1651: 0 e8 6b dc ff ff callq ffffffff804bf2c1 <tcp_cong_avoid> ffffffff804c1656: 0 31 db xor %ebx,%ebx ffffffff804c1658: 0 41 f7 c4 17 04 00 00 test $0x417,%r12d ffffffff804c165f: 0 44 8b bd 74 04 00 00 mov 0x474(%rbp),%r15d ffffffff804c1666: 0 0f 94 c3 sete %bl ffffffff804c1669: 0 41 bd 01 00 00 00 mov $0x1,%r13d ffffffff804c166f: 0 85 db test %ebx,%ebx ffffffff804c1671: 0 75 21 jne ffffffff804c1694 <tcp_ack+0xb7d> ffffffff804c1673: 0 45 30 ed xor %r13b,%r13b ffffffff804c1676: 0 41 f6 c4 20 test $0x20,%r12b ffffffff804c167a: 0 74 18 je ffffffff804c1694 <tcp_ack+0xb7d> ffffffff804c167c: 0 48 89 ef mov %rbp,%rdi ffffffff804c167f: 0 45 31 ed xor %r13d,%r13d ffffffff804c1682: 0 e8 cf d8 ff ff callq ffffffff804bef56 <tcp_fackets_out> ffffffff804c1687: 0 0f b6 95 7f 04 00 00 movzbl 0x47f(%rbp),%edx ffffffff804c168e: 0 39 d0 cmp %edx,%eax ffffffff804c1690: 0 41 0f 9f c5 setg %r13b ffffffff804c1694: 0 83 bd 74 04 00 00 00 cmpl $0x0,0x474(%rbp) ffffffff804c169b: 0 75 24 jne ffffffff804c16c1 <tcp_ack+0xbaa> ffffffff804c169d: 0 83 bd d0 04 00 00 00 cmpl $0x0,0x4d0(%rbp) ffffffff804c16a4: 0 74 1b je ffffffff804c16c1 <tcp_ack+0xbaa> ffffffff804c16a6: 0 be 16 0a 00 00 mov $0xa16,%esi ffffffff804c16ab: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c16b2: 0 e8 fe 4a d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c16b7: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp) ffffffff804c16be: 0 00 00 00 ffffffff804c16c1: 0 83 bd d0 04 00 00 00 cmpl $0x0,0x4d0(%rbp) ffffffff804c16c8: 0 75 24 jne ffffffff804c16ee <tcp_ack+0xbd7> ffffffff804c16ca: 0 83 bd d4 04 00 00 00 cmpl $0x0,0x4d4(%rbp) ffffffff804c16d1: 0 74 1b je ffffffff804c16ee <tcp_ack+0xbd7> ffffffff804c16d3: 0 be 18 0a 00 00 mov $0xa18,%esi ffffffff804c16d8: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c16df: 0 e8 d1 4a d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c16e4: 0 c7 85 d4 04 00 00 00 movl $0x0,0x4d4(%rbp) ffffffff804c16eb: 0 00 00 00 ffffffff804c16ee: 0 44 89 e0 mov %r12d,%eax ffffffff804c16f1: 0 83 e0 40 and $0x40,%eax ffffffff804c16f4: 0 89 44 24 58 mov %eax,0x58(%rsp) ffffffff804c16f8: 0 74 0a je ffffffff804c1704 <tcp_ack+0xbed> ffffffff804c16fa: 0 c7 85 6c 05 00 00 00 movl $0x0,0x56c(%rbp) ffffffff804c1701: 0 00 00 00 ffffffff804c1704: 0 41 f7 c4 00 20 00 00 test $0x2000,%r12d ffffffff804c170b: 0 0f 84 50 08 00 00 je ffffffff804c1f61 <tcp_ack+0x144a> ffffffff804c1711: 0 48 8b 15 a0 ff 5e 00 mov 0x5effa0(%rip),%rdx # ffffffff80ab16b8 <init_net+0xe8> ffffffff804c1718: 0 48 89 ef mov %rbp,%rdi ffffffff804c171b: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804c1720: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax ffffffff804c1727: 0 00 ffffffff804c1728: 0 89 c0 mov %eax,%eax ffffffff804c172a: 0 48 f7 d2 not %rdx ffffffff804c172d: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax ffffffff804c1731: 0 48 ff 80 00 01 00 00 incq 0x100(%rax) ffffffff804c1738: 0 e8 df e2 ff ff callq ffffffff804bfa1c <tcp_enter_loss> ffffffff804c173d: 0 48 8b b5 c0 00 00 00 mov 0xc0(%rbp),%rsi ffffffff804c1744: 0 fe 85 79 03 00 00 incb 0x379(%rbp) ffffffff804c174a: 0 48 8d 85 c0 00 00 00 lea 0xc0(%rbp),%rax ffffffff804c1751: 0 48 89 ef mov %rbp,%rdi ffffffff804c1754: 0 48 39 c6 cmp %rax,%rsi ffffffff804c1757: 0 b8 00 00 00 00 mov $0x0,%eax ffffffff804c175c: 0 48 0f 44 f0 cmove %rax,%rsi ffffffff804c1760: 0 e8 2d 4b 00 00 callq ffffffff804c6292 <tcp_retransmit_skb> ffffffff804c1765: 0 8b 95 58 03 00 00 mov 0x358(%rbp),%edx ffffffff804c176b: 0 b9 30 75 00 00 mov $0x7530,%ecx ffffffff804c1770: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804c1775: 0 48 89 ef mov %rbp,%rdi ffffffff804c1778: 0 e8 66 de ff ff callq ffffffff804bf5e3 <inet_csk_reset_xmit_timer> ffffffff804c177d: 0 e9 dd 06 00 00 jmpq ffffffff804c1e5f <tcp_ack+0x1348> ffffffff804c1782: 0 45 84 e4 test %r12b,%r12b ffffffff804c1785: 0 79 51 jns ffffffff804c17d8 <tcp_ack+0xcc1> ffffffff804c1787: 0 8b 95 70 05 00 00 mov 0x570(%rbp),%edx ffffffff804c178d: 0 39 95 00 04 00 00 cmp %edx,0x400(%rbp) ffffffff804c1793: 0 79 43 jns ffffffff804c17d8 <tcp_ack+0xcc1> ffffffff804c1795: 0 80 bd 78 03 00 00 00 cmpb $0x0,0x378(%rbp) ffffffff804c179c: 0 74 3a je ffffffff804c17d8 <tcp_ack+0xcc1> ffffffff804c179e: 0 0f b6 85 7f 04 00 00 movzbl 0x47f(%rbp),%eax ffffffff804c17a5: 0 8b b5 d4 04 00 00 mov 0x4d4(%rbp),%esi ffffffff804c17ab: 0 39 c6 cmp %eax,%esi ffffffff804c17ad: 0 76 29 jbe ffffffff804c17d8 <tcp_ack+0xcc1> ffffffff804c17af: 0 29 c6 sub %eax,%esi ffffffff804c17b1: 0 48 89 ef mov %rbp,%rdi ffffffff804c17b4: 0 e8 58 e6 ff ff callq ffffffff804bfe11 <tcp_mark_head_lost> ffffffff804c17b9: 0 48 8b 05 f8 fe 5e 00 mov 0x5efef8(%rip),%rax # ffffffff80ab16b8 <init_net+0xe8> ffffffff804c17c0: 0 65 8b 14 25 24 00 00 mov %gs:0x24,%edx ffffffff804c17c7: 0 00 ffffffff804c17c8: 0 89 d2 mov %edx,%edx ffffffff804c17ca: 0 48 f7 d0 not %rax ffffffff804c17cd: 0 48 8b 04 d0 mov (%rax,%rdx,8),%rax ffffffff804c17d1: 0 48 ff 80 48 01 00 00 incq 0x148(%rax) ffffffff804c17d8: 0 8b 85 cc 04 00 00 mov 0x4cc(%rbp),%eax ffffffff804c17de: 0 03 85 d0 04 00 00 add 0x4d0(%rbp),%eax ffffffff804c17e4: 0 3b 85 74 04 00 00 cmp 0x474(%rbp),%eax ffffffff804c17ea: 0 76 11 jbe ffffffff804c17fd <tcp_ack+0xce6> ffffffff804c17ec: 0 be 2e 0a 00 00 mov $0xa2e,%esi ffffffff804c17f1: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c17f8: 0 e8 b8 49 d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c17fd: 0 8a 85 78 03 00 00 mov 0x378(%rbp),%al ffffffff804c1803: 0 84 c0 test %al,%al ffffffff804c1805: 0 75 29 jne ffffffff804c1830 <tcp_ack+0xd19> ffffffff804c1807: 0 83 bd 78 04 00 00 00 cmpl $0x0,0x478(%rbp) ffffffff804c180e: 0 74 11 je ffffffff804c1821 <tcp_ack+0xd0a> ffffffff804c1810: 0 be 33 0a 00 00 mov $0xa33,%esi ffffffff804c1815: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c181c: 0 e8 94 49 d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c1821: 0 c7 85 74 05 00 00 00 movl $0x0,0x574(%rbp) ffffffff804c1828: 0 00 00 00 ffffffff804c182b: 0 e9 c4 00 00 00 jmpq ffffffff804c18f4 <tcp_ack+0xddd> ffffffff804c1830: 0 8b 8d 70 05 00 00 mov 0x570(%rbp),%ecx ffffffff804c1836: 0 8b 95 00 04 00 00 mov 0x400(%rbp),%edx ffffffff804c183c: 0 39 ca cmp %ecx,%edx ffffffff804c183e: 0 0f 88 b0 00 00 00 js ffffffff804c18f4 <tcp_ack+0xddd> ffffffff804c1844: 0 3c 02 cmp $0x2,%al ffffffff804c1846: 0 74 31 je ffffffff804c1879 <tcp_ack+0xd62> ffffffff804c1848: 0 77 0a ja ffffffff804c1854 <tcp_ack+0xd3d> ffffffff804c184a: 0 fe c8 dec %al ffffffff804c184c: 0 0f 85 a2 00 00 00 jne ffffffff804c18f4 <tcp_ack+0xddd> ffffffff804c1852: 0 eb 33 jmp ffffffff804c1887 <tcp_ack+0xd70> ffffffff804c1854: 0 3c 03 cmp $0x3,%al ffffffff804c1856: 0 74 6f je ffffffff804c18c7 <tcp_ack+0xdb0> ffffffff804c1858: 0 3c 04 cmp $0x4,%al ffffffff804c185a: 0 0f 85 94 00 00 00 jne ffffffff804c18f4 <tcp_ack+0xddd> ffffffff804c1860: 0 c6 85 79 03 00 00 00 movb $0x0,0x379(%rbp) ffffffff804c1867: 0 48 89 ef mov %rbp,%rdi ffffffff804c186a: 0 e8 fb d8 ff ff callq ffffffff804bf16a <tcp_try_undo_recovery> ffffffff804c186f: 0 85 c0 test %eax,%eax ffffffff804c1871: 0 0f 85 e8 05 00 00 jne ffffffff804c1e5f <tcp_ack+0x1348> ffffffff804c1877: 0 eb 7b jmp ffffffff804c18f4 <tcp_ack+0xddd> ffffffff804c1879: 0 39 ca cmp %ecx,%edx ffffffff804c187b: 0 74 77 je ffffffff804c18f4 <tcp_ack+0xddd> ffffffff804c187d: 0 48 89 ef mov %rbp,%rdi ffffffff804c1880: 0 e8 b8 d9 ff ff callq ffffffff804bf23d <tcp_complete_cwr> ffffffff804c1885: 0 eb 34 jmp ffffffff804c18bb <tcp_ack+0xda4> ffffffff804c1887: 0 48 89 ef mov %rbp,%rdi ffffffff804c188a: 0 e8 63 d9 ff ff callq ffffffff804bf1f2 <tcp_try_undo_dsack> ffffffff804c188f: 0 83 bd 78 05 00 00 00 cmpl $0x0,0x578(%rbp) ffffffff804c1896: 0 74 19 je ffffffff804c18b1 <tcp_ack+0xd9a> ffffffff804c1898: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c189e: 0 c0 e8 04 shr $0x4,%al ffffffff804c18a1: 0 74 0e je ffffffff804c18b1 <tcp_ack+0xd9a> ffffffff804c18a3: 0 8b 85 70 05 00 00 mov 0x570(%rbp),%eax ffffffff804c18a9: 0 39 85 00 04 00 00 cmp %eax,0x400(%rbp) ffffffff804c18af: 0 74 43 je ffffffff804c18f4 <tcp_ack+0xddd> ffffffff804c18b1: 0 c7 85 78 05 00 00 00 movl $0x0,0x578(%rbp) ffffffff804c18b8: 0 00 00 00 ffffffff804c18bb: 0 31 f6 xor %esi,%esi ffffffff804c18bd: 0 48 89 ef mov %rbp,%rdi ffffffff804c18c0: 0 e8 b4 cf ff ff callq ffffffff804be879 <tcp_set_ca_state> ffffffff804c18c5: 0 eb 2d jmp ffffffff804c18f4 <tcp_ack+0xddd> ffffffff804c18c7: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c18cd: 0 c0 e8 04 shr $0x4,%al ffffffff804c18d0: 0 75 0a jne ffffffff804c18dc <tcp_ack+0xdc5> ffffffff804c18d2: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp) ffffffff804c18d9: 0 00 00 00 ffffffff804c18dc: 0 48 89 ef mov %rbp,%rdi ffffffff804c18df: 0 e8 86 d8 ff ff callq ffffffff804bf16a <tcp_try_undo_recovery> ffffffff804c18e4: 0 85 c0 test %eax,%eax ffffffff804c18e6: 0 0f 85 73 05 00 00 jne ffffffff804c1e5f <tcp_ack+0x1348> ffffffff804c18ec: 0 48 89 ef mov %rbp,%rdi ffffffff804c18ef: 0 e8 49 d9 ff ff callq ffffffff804bf23d <tcp_complete_cwr> ffffffff804c18f4: 0 8a 85 78 03 00 00 mov 0x378(%rbp),%al ffffffff804c18fa: 0 3c 03 cmp $0x3,%al ffffffff804c18fc: 0 74 0d je ffffffff804c190b <tcp_ack+0xdf4> ffffffff804c18fe: 0 3c 04 cmp $0x4,%al ffffffff804c1900: 0 0f 85 b8 01 00 00 jne ffffffff804c1abe <tcp_ack+0xfa7> ffffffff804c1906: 0 e9 c4 00 00 00 jmpq ffffffff804c19cf <tcp_ack+0xeb8> ffffffff804c190b: 0 41 f7 c4 00 04 00 00 test $0x400,%r12d ffffffff804c1912: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c1918: 0 75 1e jne ffffffff804c1938 <tcp_ack+0xe21> ffffffff804c191a: 0 c0 e8 04 shr $0x4,%al ffffffff804c191d: 0 0f 85 fd 03 00 00 jne ffffffff804c1d20 <tcp_ack+0x1209> ffffffff804c1923: 0 85 db test %ebx,%ebx ffffffff804c1925: 0 0f 84 f5 03 00 00 je ffffffff804c1d20 <tcp_ack+0x1209> ffffffff804c192b: 0 48 89 ef mov %rbp,%rdi ffffffff804c192e: 0 e8 54 dd ff ff callq ffffffff804bf687 <tcp_add_reno_sack> ffffffff804c1933: 0 e9 e8 03 00 00 jmpq ffffffff804c1d20 <tcp_ack+0x1209> ffffffff804c1938: 0 c0 e8 04 shr $0x4,%al ffffffff804c193b: 0 41 bd 01 00 00 00 mov $0x1,%r13d ffffffff804c1941: 0 74 18 je ffffffff804c195b <tcp_ack+0xe44> ffffffff804c1943: 0 48 89 ef mov %rbp,%rdi ffffffff804c1946: 0 45 31 ed xor %r13d,%r13d ffffffff804c1949: 0 e8 08 d6 ff ff callq ffffffff804bef56 <tcp_fackets_out> ffffffff804c194e: 0 0f b6 95 7f 04 00 00 movzbl 0x47f(%rbp),%edx ffffffff804c1955: 0 39 d0 cmp %edx,%eax ffffffff804c1957: 0 41 0f 9f c5 setg %r13b ffffffff804c195b: 0 48 89 ef mov %rbp,%rdi ffffffff804c195e: 0 e8 c9 d7 ff ff callq ffffffff804bf12c <tcp_may_undo> ffffffff804c1963: 0 85 c0 test %eax,%eax ffffffff804c1965: 0 0f 84 b5 03 00 00 je ffffffff804c1d20 <tcp_ack+0x1209> ffffffff804c196b: 0 83 bd 78 04 00 00 00 cmpl $0x0,0x478(%rbp) ffffffff804c1972: 0 75 0a jne ffffffff804c197e <tcp_ack+0xe67> ffffffff804c1974: 0 c7 85 74 05 00 00 00 movl $0x0,0x574(%rbp) ffffffff804c197b: 0 00 00 00 ffffffff804c197e: 0 48 89 ef mov %rbp,%rdi ffffffff804c1981: 0 45 31 ed xor %r13d,%r13d ffffffff804c1984: 0 e8 cd d5 ff ff callq ffffffff804bef56 <tcp_fackets_out> ffffffff804c1989: 0 44 29 7c 24 14 sub %r15d,0x14(%rsp) ffffffff804c198e: 0 ba 01 00 00 00 mov $0x1,%edx ffffffff804c1993: 0 48 89 ef mov %rbp,%rdi ffffffff804c1996: 0 8b 74 24 14 mov 0x14(%rsp),%esi ffffffff804c199a: 0 01 c6 add %eax,%esi ffffffff804c199c: 0 e8 ed d3 ff ff callq ffffffff804bed8e <tcp_update_reordering> ffffffff804c19a1: 0 31 f6 xor %esi,%esi ffffffff804c19a3: 0 48 89 ef mov %rbp,%rdi ffffffff804c19a6: 0 e8 ed d6 ff ff callq ffffffff804bf098 <tcp_undo_cwr> ffffffff804c19ab: 0 48 8b 15 06 fd 5e 00 mov 0x5efd06(%rip),%rdx # ffffffff80ab16b8 <init_net+0xe8> ffffffff804c19b2: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax ffffffff804c19b9: 0 00 ffffffff804c19ba: 0 89 c0 mov %eax,%eax ffffffff804c19bc: 0 48 f7 d2 not %rdx ffffffff804c19bf: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax ffffffff804c19c3: 0 48 ff 80 30 01 00 00 incq 0x130(%rax) ffffffff804c19ca: 0 e9 51 03 00 00 jmpq ffffffff804c1d20 <tcp_ack+0x1209> ffffffff804c19cf: 0 45 84 f6 test %r14b,%r14b ffffffff804c19d2: 0 74 07 je ffffffff804c19db <tcp_ack+0xec4> ffffffff804c19d4: 0 c6 85 79 03 00 00 00 movb $0x0,0x379(%rbp) ffffffff804c19db: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c19e1: 0 c0 e8 04 shr $0x4,%al ffffffff804c19e4: 0 75 13 jne ffffffff804c19f9 <tcp_ack+0xee2> ffffffff804c19e6: 0 41 f7 c4 00 04 00 00 test $0x400,%r12d ffffffff804c19ed: 0 74 0a je ffffffff804c19f9 <tcp_ack+0xee2> ffffffff804c19ef: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp) ffffffff804c19f6: 0 00 00 00 ffffffff804c19f9: 0 48 89 ef mov %rbp,%rdi ffffffff804c19fc: 0 e8 2b d7 ff ff callq ffffffff804bf12c <tcp_may_undo> ffffffff804c1a01: 0 85 c0 test %eax,%eax ffffffff804c1a03: 0 0f 84 6e 05 00 00 je ffffffff804c1f77 <tcp_ack+0x1460> ffffffff804c1a09: 0 48 8b 95 c0 00 00 00 mov 0xc0(%rbp),%rdx ffffffff804c1a10: 0 48 8d 8d c0 00 00 00 lea 0xc0(%rbp),%rcx ffffffff804c1a17: 0 eb 10 jmp ffffffff804c1a29 <tcp_ack+0xf12> ffffffff804c1a19: 0 48 3b 95 d8 01 00 00 cmp 0x1d8(%rbp),%rdx ffffffff804c1a20: 0 74 12 je ffffffff804c1a34 <tcp_ack+0xf1d> ffffffff804c1a22: 0 80 62 5d fb andb $0xfb,0x5d(%rdx) ffffffff804c1a26: 0 48 8b 12 mov (%rdx),%rdx ffffffff804c1a29: 0 48 8b 02 mov (%rdx),%rax ffffffff804c1a2c: 0 48 39 ca cmp %rcx,%rdx ffffffff804c1a2f: 0 0f 18 08 prefetcht0 (%rax) ffffffff804c1a32: 0 75 e5 jne ffffffff804c1a19 <tcp_ack+0xf02> ffffffff804c1a34: 0 48 c7 85 e0 04 00 00 movq $0x0,0x4e0(%rbp) ffffffff804c1a3b: 0 00 00 00 00 ffffffff804c1a3f: 0 48 c7 85 e8 04 00 00 movq $0x0,0x4e8(%rbp) ffffffff804c1a46: 0 00 00 00 00 ffffffff804c1a4a: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804c1a4f: 0 48 c7 85 f0 04 00 00 movq $0x0,0x4f0(%rbp) ffffffff804c1a56: 0 00 00 00 00 ffffffff804c1a5a: 0 c7 85 cc 04 00 00 00 movl $0x0,0x4cc(%rbp) ffffffff804c1a61: 0 00 00 00 ffffffff804c1a64: 0 48 89 ef mov %rbp,%rdi ffffffff804c1a67: 0 e8 2c d6 ff ff callq ffffffff804bf098 <tcp_undo_cwr> ffffffff804c1a6c: 0 48 8b 15 45 fc 5e 00 mov 0x5efc45(%rip),%rdx # ffffffff80ab16b8 <init_net+0xe8> ffffffff804c1a73: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax ffffffff804c1a7a: 0 00 ffffffff804c1a7b: 0 89 c0 mov %eax,%eax ffffffff804c1a7d: 0 48 f7 d2 not %rdx ffffffff804c1a80: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax ffffffff804c1a84: 0 48 ff 80 40 01 00 00 incq 0x140(%rax) ffffffff804c1a8b: 0 c6 85 79 03 00 00 00 movb $0x0,0x379(%rbp) ffffffff804c1a92: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c1a98: 0 c7 85 78 05 00 00 00 movl $0x0,0x578(%rbp) ffffffff804c1a9f: 0 00 00 00 ffffffff804c1aa2: 0 c0 e8 04 shr $0x4,%al ffffffff804c1aa5: 0 74 0a je ffffffff804c1ab1 <tcp_ack+0xf9a> ffffffff804c1aa7: 0 31 f6 xor %esi,%esi ffffffff804c1aa9: 0 48 89 ef mov %rbp,%rdi ffffffff804c1aac: 0 e8 c8 cd ff ff callq ffffffff804be879 <tcp_set_ca_state> ffffffff804c1ab1: 0 80 bd 78 03 00 00 00 cmpb $0x0,0x378(%rbp) ffffffff804c1ab8: 0 0f 85 a1 03 00 00 jne ffffffff804c1e5f <tcp_ack+0x1348> ffffffff804c1abe: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c1ac4: 0 c0 e8 04 shr $0x4,%al ffffffff804c1ac7: 0 75 1f jne ffffffff804c1ae8 <tcp_ack+0xfd1> ffffffff804c1ac9: 0 41 f7 c4 00 04 00 00 test $0x400,%r12d ffffffff804c1ad0: 0 74 0a je ffffffff804c1adc <tcp_ack+0xfc5> ffffffff804c1ad2: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp) ffffffff804c1ad9: 0 00 00 00 ffffffff804c1adc: 0 85 db test %ebx,%ebx ffffffff804c1ade: 0 74 08 je ffffffff804c1ae8 <tcp_ack+0xfd1> ffffffff804c1ae0: 0 48 89 ef mov %rbp,%rdi ffffffff804c1ae3: 0 e8 9f db ff ff callq ffffffff804bf687 <tcp_add_reno_sack> ffffffff804c1ae8: 0 80 bd 78 03 00 00 01 cmpb $0x1,0x378(%rbp) ffffffff804c1aef: 0 75 08 jne ffffffff804c1af9 <tcp_ack+0xfe2> ffffffff804c1af1: 0 48 89 ef mov %rbp,%rdi ffffffff804c1af4: 0 e8 f9 d6 ff ff callq ffffffff804bf1f2 <tcp_try_undo_dsack> ffffffff804c1af9: 0 80 bd 5e 04 00 00 00 cmpb $0x0,0x45e(%rbp) ffffffff804c1b00: 0 0f 85 90 00 00 00 jne ffffffff804c1b96 <tcp_ack+0x107f> ffffffff804c1b06: 0 83 bd cc 04 00 00 00 cmpl $0x0,0x4cc(%rbp) ffffffff804c1b0d: 0 0f 85 79 04 00 00 jne ffffffff804c1f8c <tcp_ack+0x1475> ffffffff804c1b13: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c1b19: 0 c0 e8 04 shr $0x4,%al ffffffff804c1b1c: 0 a8 02 test $0x2,%al ffffffff804c1b1e: 0 74 08 je ffffffff804c1b28 <tcp_ack+0x1011> ffffffff804c1b20: 0 8b 95 d4 04 00 00 mov 0x4d4(%rbp),%edx ffffffff804c1b26: 0 eb 08 jmp ffffffff804c1b30 <tcp_ack+0x1019> ffffffff804c1b28: 0 8b 95 d0 04 00 00 mov 0x4d0(%rbp),%edx ffffffff804c1b2e: 0 ff c2 inc %edx ffffffff804c1b30: 0 0f b6 85 7f 04 00 00 movzbl 0x47f(%rbp),%eax ffffffff804c1b37: 0 39 c2 cmp %eax,%edx ffffffff804c1b39: 0 0f 8f 4d 04 00 00 jg ffffffff804c1f8c <tcp_ack+0x1475> ffffffff804c1b3f: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c1b45: 0 c0 e8 04 shr $0x4,%al ffffffff804c1b48: 0 a8 02 test $0x2,%al ffffffff804c1b4a: 0 74 10 je ffffffff804c1b5c <tcp_ack+0x1045> ffffffff804c1b4c: 0 48 89 ef mov %rbp,%rdi ffffffff804c1b4f: 0 e8 1d d4 ff ff callq ffffffff804bef71 <tcp_head_timedout> ffffffff804c1b54: 0 85 c0 test %eax,%eax ffffffff804c1b56: 0 0f 85 30 04 00 00 jne ffffffff804c1f8c <tcp_ack+0x1475> ffffffff804c1b5c: 0 0f b6 85 7f 04 00 00 movzbl 0x47f(%rbp),%eax ffffffff804c1b63: 0 8b 95 74 04 00 00 mov 0x474(%rbp),%edx ffffffff804c1b69: 0 39 c2 cmp %eax,%edx ffffffff804c1b6b: 0 77 29 ja ffffffff804c1b96 <tcp_ack+0x107f> ffffffff804c1b6d: 0 89 d0 mov %edx,%eax ffffffff804c1b6f: 0 d1 e8 shr %eax ffffffff804c1b71: 0 39 05 c1 68 3f 00 cmp %eax,0x3f68c1(%rip) # ffffffff808b8438 <sysctl_tcp_reordering> ffffffff804c1b77: 0 0f 43 05 ba 68 3f 00 cmovae 0x3f68ba(%rip),%eax # ffffffff808b8438 <sysctl_tcp_reordering> ffffffff804c1b7e: 0 39 85 d0 04 00 00 cmp %eax,0x4d0(%rbp) ffffffff804c1b84: 0 72 10 jb ffffffff804c1b96 <tcp_ack+0x107f> ffffffff804c1b86: 0 48 89 ef mov %rbp,%rdi ffffffff804c1b89: 0 e8 82 37 00 00 callq ffffffff804c5310 <tcp_may_send_now> ffffffff804c1b8e: 0 85 c0 test %eax,%eax ffffffff804c1b90: 0 0f 84 f6 03 00 00 je ffffffff804c1f8c <tcp_ack+0x1475> ffffffff804c1b96: 0 8b 85 cc 04 00 00 mov 0x4cc(%rbp),%eax ffffffff804c1b9c: 0 03 85 d0 04 00 00 add 0x4d0(%rbp),%eax ffffffff804c1ba2: 0 3b 85 74 04 00 00 cmp 0x474(%rbp),%eax ffffffff804c1ba8: 0 76 11 jbe ffffffff804c1bbb <tcp_ack+0x10a4> ffffffff804c1baa: 0 be d7 09 00 00 mov $0x9d7,%esi ffffffff804c1baf: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c1bb6: 0 e8 fa 45 d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c1bbb: 0 80 bd 5e 04 00 00 00 cmpb $0x0,0x45e(%rbp) ffffffff804c1bc2: 0 75 13 jne ffffffff804c1bd7 <tcp_ack+0x10c0> ffffffff804c1bc4: 0 83 bd 78 04 00 00 00 cmpl $0x0,0x478(%rbp) ffffffff804c1bcb: 0 75 0a jne ffffffff804c1bd7 <tcp_ack+0x10c0> ffffffff804c1bcd: 0 c7 85 74 05 00 00 00 movl $0x0,0x574(%rbp) ffffffff804c1bd4: 0 00 00 00 ffffffff804c1bd7: 0 83 7c 24 58 00 cmpl $0x0,0x58(%rsp) ffffffff804c1bdc: 0 74 0d je ffffffff804c1beb <tcp_ack+0x10d4> ffffffff804c1bde: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804c1be3: 0 48 89 ef mov %rbp,%rdi ffffffff804c1be6: 0 e8 cf d0 ff ff callq ffffffff804becba <tcp_enter_cwr> ffffffff804c1beb: 0 80 bd 78 03 00 00 02 cmpb $0x2,0x378(%rbp) ffffffff804c1bf2: 0 74 15 je ffffffff804c1c09 <tcp_ack+0x10f2> ffffffff804c1bf4: 0 48 89 ef mov %rbp,%rdi ffffffff804c1bf7: 0 e8 71 d6 ff ff callq ffffffff804bf26d <tcp_try_keep_open> ffffffff804c1bfc: 0 48 89 ef mov %rbp,%rdi ffffffff804c1bff: 0 e8 a9 d3 ff ff callq ffffffff804befad <tcp_moderate_cwnd> ffffffff804c1c04: 0 e9 56 02 00 00 jmpq ffffffff804c1e5f <tcp_ack+0x1348> ffffffff804c1c09: 0 44 89 e6 mov %r12d,%esi ffffffff804c1c0c: 0 48 89 ef mov %rbp,%rdi ffffffff804c1c0f: 0 e8 d9 d3 ff ff callq ffffffff804befed <tcp_cwnd_down> ffffffff804c1c14: 0 e9 46 02 00 00 jmpq ffffffff804c1e5f <tcp_ack+0x1348> ffffffff804c1c19: 0 8b 95 a4 03 00 00 mov 0x3a4(%rbp),%edx ffffffff804c1c1f: 0 85 d2 test %edx,%edx ffffffff804c1c21: 0 74 34 je ffffffff804c1c57 <tcp_ack+0x1140> ffffffff804c1c23: 0 8b 85 b0 05 00 00 mov 0x5b0(%rbp),%eax ffffffff804c1c29: 0 39 85 00 04 00 00 cmp %eax,0x400(%rbp) ffffffff804c1c2f: 0 75 26 jne ffffffff804c1c57 <tcp_ack+0x1140> ffffffff804c1c31: 0 ff 85 ac 04 00 00 incl 0x4ac(%rbp) ffffffff804c1c37: 0 8d 42 ff lea -0x1(%rdx),%eax ffffffff804c1c3a: 0 c7 85 a4 03 00 00 00 movl $0x0,0x3a4(%rbp) ffffffff804c1c41: 0 00 00 00 ffffffff804c1c44: 0 48 89 ef mov %rbp,%rdi ffffffff804c1c47: 0 89 85 9c 03 00 00 mov %eax,0x39c(%rbp) ffffffff804c1c4d: 0 e8 86 54 00 00 callq ffffffff804c70d8 <tcp_simple_retransmit> ffffffff804c1c52: 0 e9 08 02 00 00 jmpq ffffffff804c1e5f <tcp_ack+0x1348> ffffffff804c1c57: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c1c5d: 0 48 8b 15 54 fa 5e 00 mov 0x5efa54(%rip),%rdx # ffffffff80ab16b8 <init_net+0xe8> ffffffff804c1c64: 0 c0 e8 04 shr $0x4,%al ffffffff804c1c67: 0 48 f7 d2 not %rdx ffffffff804c1c6a: 0 3c 01 cmp $0x1,%al ffffffff804c1c6c: 0 19 c9 sbb %ecx,%ecx ffffffff804c1c6e: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax ffffffff804c1c75: 0 00 ffffffff804c1c76: 0 89 c0 mov %eax,%eax ffffffff804c1c78: 0 83 c1 1f add $0x1f,%ecx ffffffff804c1c7b: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax ffffffff804c1c7f: 0 48 63 c9 movslq %ecx,%rcx ffffffff804c1c82: 0 48 ff 04 c8 incq (%rax,%rcx,8) ffffffff804c1c86: 0 c7 85 6c 05 00 00 00 movl $0x0,0x56c(%rbp) ffffffff804c1c8d: 0 00 00 00 ffffffff804c1c90: 0 8b 85 fc 03 00 00 mov 0x3fc(%rbp),%eax ffffffff804c1c96: 0 80 bd 78 03 00 00 01 cmpb $0x1,0x378(%rbp) ffffffff804c1c9d: 0 89 85 70 05 00 00 mov %eax,0x570(%rbp) ffffffff804c1ca3: 0 8b 85 00 04 00 00 mov 0x400(%rbp),%eax ffffffff804c1ca9: 0 89 85 78 05 00 00 mov %eax,0x578(%rbp) ffffffff804c1caf: 0 8b 85 78 04 00 00 mov 0x478(%rbp),%eax ffffffff804c1cb5: 0 89 85 7c 05 00 00 mov %eax,0x57c(%rbp) ffffffff804c1cbb: 0 77 3b ja ffffffff804c1cf8 <tcp_ack+0x11e1> ffffffff804c1cbd: 0 83 7c 24 58 00 cmpl $0x0,0x58(%rsp) ffffffff804c1cc2: 0 75 0e jne ffffffff804c1cd2 <tcp_ack+0x11bb> ffffffff804c1cc4: 0 48 89 ef mov %rbp,%rdi ffffffff804c1cc7: 0 e8 f0 cb ff ff callq ffffffff804be8bc <tcp_current_ssthresh> ffffffff804c1ccc: 0 89 85 6c 05 00 00 mov %eax,0x56c(%rbp) ffffffff804c1cd2: 0 48 8b 85 60 03 00 00 mov 0x360(%rbp),%rax ffffffff804c1cd9: 0 48 89 ef mov %rbp,%rdi ffffffff804c1cdc: 0 ff 50 28 callq *0x28(%rax) ffffffff804c1cdf: 0 89 85 a8 04 00 00 mov %eax,0x4a8(%rbp) ffffffff804c1ce5: 0 8a 85 7e 04 00 00 mov 0x47e(%rbp),%al ffffffff804c1ceb: 0 a8 01 test $0x1,%al ffffffff804c1ced: 0 74 09 je ffffffff804c1cf8 <tcp_ack+0x11e1> ffffffff804c1cef: 0 83 c8 02 or $0x2,%eax ffffffff804c1cf2: 0 88 85 7e 04 00 00 mov %al,0x47e(%rbp) ffffffff804c1cf8: 0 c7 85 dc 04 00 00 00 movl $0x0,0x4dc(%rbp) ffffffff804c1cff: 0 00 00 00 ffffffff804c1d02: 0 c7 85 b0 04 00 00 00 movl $0x0,0x4b0(%rbp) ffffffff804c1d09: 0 00 00 00 ffffffff804c1d0c: 0 be 03 00 00 00 mov $0x3,%esi ffffffff804c1d11: 0 48 89 ef mov %rbp,%rdi ffffffff804c1d14: 0 bb 01 00 00 00 mov $0x1,%ebx ffffffff804c1d19: 0 e8 5b cb ff ff callq ffffffff804be879 <tcp_set_ca_state> ffffffff804c1d1e: 0 eb 02 jmp ffffffff804c1d22 <tcp_ack+0x120b> ffffffff804c1d20: 0 31 db xor %ebx,%ebx ffffffff804c1d22: 0 45 85 ed test %r13d,%r13d ffffffff804c1d25: 0 75 21 jne ffffffff804c1d48 <tcp_ack+0x1231> ffffffff804c1d27: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c1d2d: 0 c0 e8 04 shr $0x4,%al ffffffff804c1d30: 0 a8 02 test $0x2,%al ffffffff804c1d32: 0 0f 84 0b 01 00 00 je ffffffff804c1e43 <tcp_ack+0x132c> ffffffff804c1d38: 0 48 89 ef mov %rbp,%rdi ffffffff804c1d3b: 0 e8 31 d2 ff ff callq ffffffff804bef71 <tcp_head_timedout> ffffffff804c1d40: 0 85 c0 test %eax,%eax ffffffff804c1d42: 0 0f 84 fb 00 00 00 je ffffffff804c1e43 <tcp_ack+0x132c> ffffffff804c1d48: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c1d4e: 0 c0 e8 04 shr $0x4,%al ffffffff804c1d51: 0 75 07 jne ffffffff804c1d5a <tcp_ack+0x1243> ffffffff804c1d53: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804c1d58: 0 eb 31 jmp ffffffff804c1d8b <tcp_ack+0x1274> ffffffff804c1d5a: 0 a8 02 test $0x2,%al ffffffff804c1d5c: 0 8a 85 7f 04 00 00 mov 0x47f(%rbp),%al ffffffff804c1d62: 0 74 17 je ffffffff804c1d7b <tcp_ack+0x1264> ffffffff804c1d64: 0 8b b5 d4 04 00 00 mov 0x4d4(%rbp),%esi ffffffff804c1d6a: 0 0f b6 c0 movzbl %al,%eax ffffffff804c1d6d: 0 29 c6 sub %eax,%esi ffffffff804c1d6f: 0 b8 01 00 00 00 mov $0x1,%eax ffffffff804c1d74: 0 85 f6 test %esi,%esi ffffffff804c1d76: 0 0f 4e f0 cmovle %eax,%esi ffffffff804c1d79: 0 eb 10 jmp ffffffff804c1d8b <tcp_ack+0x1274> ffffffff804c1d7b: 0 8b b5 d0 04 00 00 mov 0x4d0(%rbp),%esi ffffffff804c1d81: 0 0f b6 c0 movzbl %al,%eax ffffffff804c1d84: 0 29 c6 sub %eax,%esi ffffffff804c1d86: 0 39 f3 cmp %esi,%ebx ffffffff804c1d88: 0 0f 4d f3 cmovge %ebx,%esi ffffffff804c1d8b: 0 48 89 ef mov %rbp,%rdi ffffffff804c1d8e: 0 e8 7e e0 ff ff callq ffffffff804bfe11 <tcp_mark_head_lost> ffffffff804c1d93: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c1d99: 0 c0 e8 04 shr $0x4,%al ffffffff804c1d9c: 0 a8 02 test $0x2,%al ffffffff804c1d9e: 0 0f 84 9f 00 00 00 je ffffffff804c1e43 <tcp_ack+0x132c> ffffffff804c1da4: 0 48 89 ef mov %rbp,%rdi ffffffff804c1da7: 0 e8 c5 d1 ff ff callq ffffffff804bef71 <tcp_head_timedout> ffffffff804c1dac: 0 85 c0 test %eax,%eax ffffffff804c1dae: 0 0f 84 8f 00 00 00 je ffffffff804c1e43 <tcp_ack+0x132c> ffffffff804c1db4: 0 48 8b 85 e8 04 00 00 mov 0x4e8(%rbp),%rax ffffffff804c1dbb: 0 48 85 c0 test %rax,%rax ffffffff804c1dbe: 0 48 89 c3 mov %rax,%rbx ffffffff804c1dc1: 0 75 42 jne ffffffff804c1e05 <tcp_ack+0x12ee> ffffffff804c1dc3: 0 48 8b 9d c0 00 00 00 mov 0xc0(%rbp),%rbx ffffffff804c1dca: 0 48 8d 85 c0 00 00 00 lea 0xc0(%rbp),%rax ffffffff804c1dd1: 0 48 39 c3 cmp %rax,%rbx ffffffff804c1dd4: 0 75 2f jne ffffffff804c1e05 <tcp_ack+0x12ee> ffffffff804c1dd6: 0 31 db xor %ebx,%ebx ffffffff804c1dd8: 0 eb 2b jmp ffffffff804c1e05 <tcp_ack+0x12ee> ffffffff804c1dda: 0 48 3b 9d d8 01 00 00 cmp 0x1d8(%rbp),%rbx ffffffff804c1de1: 0 74 34 je ffffffff804c1e17 <tcp_ack+0x1300> ffffffff804c1de3: 0 48 8b 05 96 7a 3f 00 mov 0x3f7a96(%rip),%rax # ffffffff808b9880 <jiffies> ffffffff804c1dea: 0 2b 43 58 sub 0x58(%rbx),%eax ffffffff804c1ded: 0 3b 85 58 03 00 00 cmp 0x358(%rbp),%eax ffffffff804c1df3: 0 76 22 jbe ffffffff804c1e17 <tcp_ack+0x1300> ffffffff804c1df5: 0 48 89 de mov %rbx,%rsi ffffffff804c1df8: 0 48 89 ef mov %rbp,%rdi ffffffff804c1dfb: 0 e8 28 d0 ff ff callq ffffffff804bee28 <tcp_skb_mark_lost> ffffffff804c1e00: 0 48 8b 1b mov (%rbx),%rbx ffffffff804c1e03: 0 eb 07 jmp ffffffff804c1e0c <tcp_ack+0x12f5> ffffffff804c1e05: 0 4c 8d ad c0 00 00 00 lea 0xc0(%rbp),%r13 ffffffff804c1e0c: 0 48 8b 03 mov (%rbx),%rax ffffffff804c1e0f: 0 4c 39 eb cmp %r13,%rbx ffffffff804c1e12: 0 0f 18 08 prefetcht0 (%rax) ffffffff804c1e15: 0 75 c3 jne ffffffff804c1dda <tcp_ack+0x12c3> ffffffff804c1e17: 0 8b 85 cc 04 00 00 mov 0x4cc(%rbp),%eax ffffffff804c1e1d: 0 03 85 d0 04 00 00 add 0x4d0(%rbp),%eax ffffffff804c1e23: 0 3b 85 74 04 00 00 cmp 0x474(%rbp),%eax ffffffff804c1e29: 0 48 89 9d e8 04 00 00 mov %rbx,0x4e8(%rbp) ffffffff804c1e30: 0 76 11 jbe ffffffff804c1e43 <tcp_ack+0x132c> ffffffff804c1e32: 0 be e5 08 00 00 mov $0x8e5,%esi ffffffff804c1e37: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi ffffffff804c1e3e: 0 e8 72 43 d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804c1e43: 0 44 89 e6 mov %r12d,%esi ffffffff804c1e46: 0 48 89 ef mov %rbp,%rdi ffffffff804c1e49: 0 e8 9f d1 ff ff callq ffffffff804befed <tcp_cwnd_down> ffffffff804c1e4e: 0 e9 2c 01 00 00 jmpq ffffffff804c1f7f <tcp_ack+0x1468> ffffffff804c1e53: 47 8b 74 24 1c mov 0x1c(%rsp),%esi ffffffff804c1e57: 513 48 89 ef mov %rbp,%rdi ffffffff804c1e5a: 0 e8 62 d4 ff ff callq ffffffff804bf2c1 <tcp_cong_avoid> ffffffff804c1e5f: 427 41 80 e4 34 and $0x34,%r12b ffffffff804c1e63: 1234 75 07 jne ffffffff804c1e6c <tcp_ack+0x1355> ffffffff804c1e65: 0 83 7c 24 54 00 cmpl $0x0,0x54(%rsp) ffffffff804c1e6a: 0 75 3c jne ffffffff804c1ea8 <tcp_ack+0x1391> ffffffff804c1e6c: 0 48 8b 7d 78 mov 0x78(%rbp),%rdi ffffffff804c1e70: 916 e8 8d c9 ff ff callq ffffffff804be802 <dst_confirm> ffffffff804c1e75: 3 eb 31 jmp ffffffff804c1ea8 <tcp_ack+0x1391> ffffffff804c1e77: 0 48 8b 95 d8 01 00 00 mov 0x1d8(%rbp),%rdx ffffffff804c1e7e: 99 48 85 d2 test %rdx,%rdx ffffffff804c1e81: 16 74 25 je ffffffff804c1ea8 <tcp_ack+0x1391> ffffffff804c1e83: 0 8b 85 44 04 00 00 mov 0x444(%rbp),%eax ffffffff804c1e89: 0 03 85 00 04 00 00 add 0x400(%rbp),%eax ffffffff804c1e8f: 0 3b 42 54 cmp 0x54(%rdx),%eax ffffffff804c1e92: 0 78 1e js ffffffff804c1eb2 <tcp_ack+0x139b> ffffffff804c1e94: 0 c6 85 7b 03 00 00 00 movb $0x0,0x37b(%rbp) ffffffff804c1e9b: 0 be 03 00 00 00 mov $0x3,%esi ffffffff804c1ea0: 0 48 89 ef mov %rbp,%rdi ffffffff804c1ea3: 0 e8 ab c9 ff ff callq ffffffff804be853 <inet_csk_clear_xmit_timer> ffffffff804c1ea8: 520 b8 01 00 00 00 mov $0x1,%eax ffffffff804c1ead: 994 e9 ec 00 00 00 jmpq ffffffff804c1f9e <tcp_ack+0x1487> ffffffff804c1eb2: 0 0f b6 8d 7b 03 00 00 movzbl 0x37b(%rbp),%ecx ffffffff804c1eb9: 0 8b 95 58 03 00 00 mov 0x358(%rbp),%edx ffffffff804c1ebf: 0 b8 30 75 00 00 mov $0x7530,%eax ffffffff804c1ec4: 0 be 03 00 00 00 mov $0x3,%esi ffffffff804c1ec9: 0 48 89 ef mov %rbp,%rdi ffffffff804c1ecc: 0 d3 e2 shl %cl,%edx ffffffff804c1ece: 0 b9 30 75 00 00 mov $0x7530,%ecx ffffffff804c1ed3: 0 81 fa 30 75 00 00 cmp $0x7530,%edx ffffffff804c1ed9: 0 0f 47 d0 cmova %eax,%edx ffffffff804c1edc: 0 89 d2 mov %edx,%edx ffffffff804c1ede: 0 e8 00 d7 ff ff callq ffffffff804bf5e3 <inet_csk_reset_xmit_timer> ffffffff804c1ee3: 0 eb c3 jmp ffffffff804c1ea8 <tcp_ack+0x1391> ffffffff804c1ee5: 0 80 78 25 00 cmpb $0x0,0x25(%rax) ffffffff804c1ee9: 0 74 1a je ffffffff804c1f05 <tcp_ack+0x13ee> ffffffff804c1eeb: 0 8b 54 24 18 mov 0x18(%rsp),%edx ffffffff804c1eef: 0 e8 cc e3 ff ff callq ffffffff804c02c0 <tcp_sacktag_write_queue> ffffffff804c1ef4: 0 80 bd 78 03 00 00 00 cmpb $0x0,0x378(%rbp) ffffffff804c1efb: 0 75 08 jne ffffffff804c1f05 <tcp_ack+0x13ee> ffffffff804c1efd: 0 48 89 ef mov %rbp,%rdi ffffffff804c1f00: 0 e8 68 d3 ff ff callq ffffffff804bf26d <tcp_try_keep_open> ffffffff804c1f05: 0 48 85 ed test %rbp,%rbp ffffffff804c1f08: 0 74 2f je ffffffff804c1f39 <tcp_ack+0x1422> ffffffff804c1f0a: 0 be 0a 00 00 00 mov $0xa,%esi ffffffff804c1f0f: 0 48 89 ef mov %rbp,%rdi ffffffff804c1f12: 0 e8 f1 d5 ff ff callq ffffffff804bf508 <sock_flag> ffffffff804c1f17: 0 85 c0 test %eax,%eax ffffffff804c1f19: 0 74 1e je ffffffff804c1f39 <tcp_ack+0x1422> ffffffff804c1f1b: 0 8b 8d fc 03 00 00 mov 0x3fc(%rbp),%ecx ffffffff804c1f21: 0 8b 95 00 04 00 00 mov 0x400(%rbp),%edx ffffffff804c1f27: 0 48 c7 c7 e5 d9 6a 80 mov $0xffffffff806ad9e5,%rdi ffffffff804c1f2e: 0 8b 74 24 1c mov 0x1c(%rsp),%esi ffffffff804c1f32: 0 31 c0 xor %eax,%eax ffffffff804c1f34: 0 e8 3b 4e d7 ff callq ffffffff80236d74 <printk> ffffffff804c1f39: 0 31 c0 xor %eax,%eax ffffffff804c1f3b: 0 eb 61 jmp ffffffff804c1f9e <tcp_ack+0x1487> ffffffff804c1f3d: 0 c7 44 24 44 00 00 00 movl $0x0,0x44(%rsp) ffffffff804c1f44: 0 00 ffffffff804c1f45: 0 e9 c3 ef ff ff jmpq ffffffff804c0f0d <tcp_ack+0x3f6> ffffffff804c1f4a: 54 41 f6 c4 04 test $0x4,%r12b ffffffff804c1f4e: 424 0f 84 0b ff ff ff je ffffffff804c1e5f <tcp_ack+0x1348> ffffffff804c1f54: 364 85 c9 test %ecx,%ecx ffffffff804c1f56: 0 0f 84 f7 fe ff ff je ffffffff804c1e53 <tcp_ack+0x133c> ffffffff804c1f5c: 0 e9 fe fe ff ff jmpq ffffffff804c1e5f <tcp_ack+0x1348> ffffffff804c1f61: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al ffffffff804c1f67: 0 c0 e8 04 shr $0x4,%al ffffffff804c1f6a: 0 a8 02 test $0x2,%al ffffffff804c1f6c: 0 0f 85 10 f8 ff ff jne ffffffff804c1782 <tcp_ack+0xc6b> ffffffff804c1f72: 0 e9 61 f8 ff ff jmpq ffffffff804c17d8 <tcp_ack+0xcc1> ffffffff804c1f77: 0 48 89 ef mov %rbp,%rdi ffffffff804c1f7a: 0 e8 2e d0 ff ff callq ffffffff804befad <tcp_moderate_cwnd> ffffffff804c1f7f: 0 48 89 ef mov %rbp,%rdi ffffffff804c1f82: 0 e8 f7 47 00 00 callq ffffffff804c677e <tcp_xmit_retransmit_queue> ffffffff804c1f87: 0 e9 d3 fe ff ff jmpq ffffffff804c1e5f <tcp_ack+0x1348> ffffffff804c1f8c: 0 80 bd 78 03 00 00 01 cmpb $0x1,0x378(%rbp) ffffffff804c1f93: 0 0f 87 be fc ff ff ja ffffffff804c1c57 <tcp_ack+0x1140> ffffffff804c1f99: 0 e9 7b fc ff ff jmpq ffffffff804c1c19 <tcp_ack+0x1102> ffffffff804c1f9e: 493 48 81 c4 88 00 00 00 add $0x88,%rsp ffffffff804c1fa5: 1288 5b pop %rbx ffffffff804c1fa6: 0 5d pop %rbp ffffffff804c1fa7: 446 41 5c pop %r12 ffffffff804c1fa9: 0 41 5d pop %r13 ffffffff804c1fab: 2 41 5e pop %r14 ffffffff804c1fad: 447 41 5f pop %r15 ffffffff804c1faf: 0 c3 retq No real obvious single-instruction hotspots i can see. But i can see another problem: the function is too large and its flow is not fall-through in any way. As you can see it from the profile distribution it is broken into 25-30 separate code sequences. The function consists of more than 1200 instructions and is 5200 bytes large. According to the profile above, only 350 instructions are used and about 850 of those instructions are never used by this workload. So in theory this function should only take up ~1.5K of the instruction cache. But because execution is spread out into 25+ smaller pieces, it takes up ~4K of the instruction cache instead (there's a single ~1.2K hole in the middle, i subtracted that) - 2-3 times larger than it should. So this code could make good use of the (brand-new ;-) branch-tracer ftrace plugin and grow a few well-placed likely()/unlikely() places - at least for this workload. I think. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* tcp_recvmsg(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (7 preceding siblings ...) 2008-11-17 21:09 ` tcp_ack(): " Ingo Molnar @ 2008-11-17 21:19 ` Ingo Molnar 2008-11-17 21:26 ` eth_type_trans(): " Ingo Molnar ` (5 subsequent siblings) 14 siblings, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 21:19 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 1.833688 tcp_recvmsg hits (total: 183368) ......... ffffffff804bd46e: 882 <tcp_recvmsg>: ffffffff804bd46e: 882 41 57 push %r15 ffffffff804bd470: 15507 48 89 f7 mov %rsi,%rdi ffffffff804bd473: 179 41 56 push %r14 ffffffff804bd475: 0 49 89 ce mov %rcx,%r14 ffffffff804bd478: 744 41 55 push %r13 ffffffff804bd47a: 165 41 54 push %r12 ffffffff804bd47c: 0 45 89 c4 mov %r8d,%r12d ffffffff804bd47f: 692 55 push %rbp ffffffff804bd480: 178 44 89 cd mov %r9d,%ebp ffffffff804bd483: 3434 53 push %rbx ffffffff804bd484: 685 48 89 f3 mov %rsi,%rbx ffffffff804bd487: 11 48 83 ec 68 sub $0x68,%rsp ffffffff804bd48b: 949 48 89 54 24 30 mov %rdx,0x30(%rsp) ffffffff804bd490: 7 e8 e8 e8 ff ff callq ffffffff804bbd7d <lock_sock> ffffffff804bd495: 1771 8a 43 02 mov 0x2(%rbx),%al ffffffff804bd498: 6176 3c 0a cmp $0xa,%al ffffffff804bd49a: 0 0f 84 3a 06 00 00 je ffffffff804bdada <tcp_recvmsg+0x66c> ffffffff804bd4a0: 3121 31 c0 xor %eax,%eax ffffffff804bd4a2: 195 45 85 e4 test %r12d,%r12d ffffffff804bd4a5: 0 75 07 jne ffffffff804bd4ae <tcp_recvmsg+0x40> ffffffff804bd4a7: 926 48 8b 83 68 01 00 00 mov 0x168(%rbx),%rax ffffffff804bd4ae: 189 40 f6 c5 01 test $0x1,%bpl ffffffff804bd4b2: 0 48 89 44 24 58 mov %rax,0x58(%rsp) ffffffff804bd4b7: 819 0f 85 33 06 00 00 jne ffffffff804bdaf0 <tcp_recvmsg+0x682> ffffffff804bd4bd: 216 89 e8 mov %ebp,%eax ffffffff804bd4bf: 0 83 e0 02 and $0x2,%eax ffffffff804bd4c2: 638 89 44 24 3c mov %eax,0x3c(%rsp) ffffffff804bd4c6: 177 75 0e jne ffffffff804bd4d6 <tcp_recvmsg+0x68> ffffffff804bd4c8: 0 48 8d 93 f4 03 00 00 lea 0x3f4(%rbx),%rdx ffffffff804bd4cf: 661 48 89 54 24 40 mov %rdx,0x40(%rsp) ffffffff804bd4d4: 195 eb 14 jmp ffffffff804bd4ea <tcp_recvmsg+0x7c> ffffffff804bd4d6: 0 8b 83 f4 03 00 00 mov 0x3f4(%rbx),%eax ffffffff804bd4dc: 0 48 8d 4c 24 60 lea 0x60(%rsp),%rcx ffffffff804bd4e1: 0 48 89 4c 24 40 mov %rcx,0x40(%rsp) ffffffff804bd4e6: 0 89 44 24 60 mov %eax,0x60(%rsp) ffffffff804bd4ea: 867 89 ee mov %ebp,%esi ffffffff804bd4ec: 210 44 89 f2 mov %r14d,%edx ffffffff804bd4ef: 0 48 89 df mov %rbx,%rdi ffffffff804bd4f2: 894 81 e6 00 01 00 00 and $0x100,%esi ffffffff804bd4f8: 192 45 31 ff xor %r15d,%r15d ffffffff804bd4fb: 0 e8 fc df ff ff callq ffffffff804bb4fc <sock_rcvlowat> ffffffff804bd500: 853 89 44 24 4c mov %eax,0x4c(%rsp) ffffffff804bd504: 1857 48 8d 83 a8 00 00 00 lea 0xa8(%rbx),%rax ffffffff804bd50b: 0 89 e9 mov %ebp,%ecx ffffffff804bd50d: 595 48 8d 93 10 04 00 00 lea 0x410(%rbx),%rdx ffffffff804bd514: 263 83 e1 22 and $0x22,%ecx ffffffff804bd517: 0 83 e5 20 and $0x20,%ebp ffffffff804bd51a: 601 48 89 44 24 28 mov %rax,0x28(%rsp) ffffffff804bd51f: 254 48 8d 83 f8 04 00 00 lea 0x4f8(%rbx),%rax ffffffff804bd526: 2 48 c7 44 24 50 00 00 movq $0x0,0x50(%rsp) ffffffff804bd52d: 0 00 00 ffffffff804bd52f: 578 48 89 54 24 20 mov %rdx,0x20(%rsp) ffffffff804bd534: 290 89 4c 24 1c mov %ecx,0x1c(%rsp) ffffffff804bd538: 1 48 89 44 24 10 mov %rax,0x10(%rsp) ffffffff804bd53d: 593 89 6c 24 0c mov %ebp,0xc(%rsp) ffffffff804bd541: 568 66 83 bb 7c 04 00 00 cmpw $0x0,0x47c(%rbx) ffffffff804bd548: 0 00 ffffffff804bd549: 3956 74 55 je ffffffff804bd5a0 <tcp_recvmsg+0x132> ffffffff804bd54b: 0 48 8b 54 24 40 mov 0x40(%rsp),%rdx ffffffff804bd550: 0 8b 83 84 05 00 00 mov 0x584(%rbx),%eax ffffffff804bd556: 0 3b 02 cmp (%rdx),%eax ffffffff804bd558: 0 75 46 jne ffffffff804bd5a0 <tcp_recvmsg+0x132> ffffffff804bd55a: 0 45 85 ff test %r15d,%r15d ffffffff804bd55d: 0 0f 85 e6 04 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd563: 0 65 48 8b 3c 25 00 00 mov %gs:0x0,%rdi ffffffff804bd56a: 0 00 00 ffffffff804bd56c: 0 e8 4c e1 ff ff callq ffffffff804bb6bd <signal_pending> ffffffff804bd571: 0 85 c0 test %eax,%eax ffffffff804bd573: 0 74 2b je ffffffff804bd5a0 <tcp_recvmsg+0x132> ffffffff804bd575: 0 48 8b 54 24 58 mov 0x58(%rsp),%rdx ffffffff804bd57a: 0 41 bf f5 ff ff ff mov $0xfffffff5,%r15d ffffffff804bd580: 0 48 85 d2 test %rdx,%rdx ffffffff804bd583: 0 0f 84 c0 04 00 00 je ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd589: 0 48 b8 ff ff ff ff ff mov $0x7fffffffffffffff,%rax ffffffff804bd590: 0 ff ff 7f ffffffff804bd593: 0 66 41 bf 00 fe mov $0xfe00,%r15w ffffffff804bd598: 0 48 39 c2 cmp %rax,%rdx ffffffff804bd59b: 0 e9 89 01 00 00 jmpq ffffffff804bd729 <tcp_recvmsg+0x2bb> ffffffff804bd5a0: 597 48 8b ab a8 00 00 00 mov 0xa8(%rbx),%rbp ffffffff804bd5a7: 4601 48 3b 6c 24 28 cmp 0x28(%rsp),%rbp ffffffff804bd5ac: 1 b8 00 00 00 00 mov $0x0,%eax ffffffff804bd5b1: 1769 48 0f 44 e8 cmove %rax,%rbp ffffffff804bd5b5: 473 48 85 ed test %rbp,%rbp ffffffff804bd5b8: 0 74 76 je ffffffff804bd630 <tcp_recvmsg+0x1c2> ffffffff804bd5ba: 595 48 8b 4c 24 40 mov 0x40(%rsp),%rcx ffffffff804bd5bf: 897 8b 55 50 mov 0x50(%rbp),%edx ffffffff804bd5c2: 89 8b 31 mov (%rcx),%esi ffffffff804bd5c4: 581 41 89 f5 mov %esi,%r13d ffffffff804bd5c7: 301 41 29 d5 sub %edx,%r13d ffffffff804bd5ca: 33 79 10 jns ffffffff804bd5dc <tcp_recvmsg+0x16e> ffffffff804bd5cc: 0 48 c7 c7 48 d9 6a 80 mov $0xffffffff806ad948,%rdi ffffffff804bd5d3: 0 31 c0 xor %eax,%eax ffffffff804bd5d5: 0 e8 9a 97 d7 ff callq ffffffff80236d74 <printk> ffffffff804bd5da: 0 eb 54 jmp ffffffff804bd630 <tcp_recvmsg+0x1c2> ffffffff804bd5dc: 584 8b 85 b8 00 00 00 mov 0xb8(%rbp),%eax ffffffff804bd5e2: 1061 48 8b 95 d0 00 00 00 mov 0xd0(%rbp),%rdx ffffffff804bd5e9: 1 8a 54 02 0d mov 0xd(%rdx,%rax,1),%dl ffffffff804bd5ed: 0 88 d0 mov %dl,%al ffffffff804bd5ef: 876 83 e0 02 and $0x2,%eax ffffffff804bd5f2: 0 3c 01 cmp $0x1,%al ffffffff804bd5f4: 0 8b 45 68 mov 0x68(%rbp),%eax ffffffff804bd5f7: 909 41 83 d5 ff adc $0xffffffffffffffff,%r13d ffffffff804bd5fb: 0 41 39 c5 cmp %eax,%r13d ffffffff804bd5fe: 0 0f 82 df 02 00 00 jb ffffffff804bd8e3 <tcp_recvmsg+0x475> ffffffff804bd604: 0 80 e2 01 and $0x1,%dl ffffffff804bd607: 0 0f 85 16 04 00 00 jne ffffffff804bda23 <tcp_recvmsg+0x5b5> ffffffff804bd60d: 0 83 7c 24 3c 00 cmpl $0x0,0x3c(%rsp) ffffffff804bd612: 0 75 11 jne ffffffff804bd625 <tcp_recvmsg+0x1b7> ffffffff804bd614: 0 be 53 05 00 00 mov $0x553,%esi ffffffff804bd619: 0 48 c7 c7 13 d9 6a 80 mov $0xffffffff806ad913,%rdi ffffffff804bd620: 0 e8 90 8b d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804bd625: 0 48 8b 6d 00 mov 0x0(%rbp),%rbp ffffffff804bd629: 0 48 3b 6c 24 28 cmp 0x28(%rsp),%rbp ffffffff804bd62e: 0 75 85 jne ffffffff804bd5b5 <tcp_recvmsg+0x147> ffffffff804bd630: 80 44 3b 7c 24 4c cmp 0x4c(%rsp),%r15d ffffffff804bd635: 4164 7c 0b jl ffffffff804bd642 <tcp_recvmsg+0x1d4> ffffffff804bd637: 0 48 83 7b 68 00 cmpq $0x0,0x68(%rbx) ffffffff804bd63c: 0 0f 84 07 04 00 00 je ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd642: 1 45 85 ff test %r15d,%r15d ffffffff804bd645: 3438 74 49 je ffffffff804bd690 <tcp_recvmsg+0x222> ffffffff804bd647: 0 83 bb 44 01 00 00 00 cmpl $0x0,0x144(%rbx) ffffffff804bd64e: 0 0f 85 f5 03 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd654: 0 8a 43 02 mov 0x2(%rbx),%al ffffffff804bd657: 0 3c 07 cmp $0x7,%al ffffffff804bd659: 0 0f 84 ea 03 00 00 je ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd65f: 0 f6 43 38 01 testb $0x1,0x38(%rbx) ffffffff804bd663: 0 0f 85 e0 03 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd669: 0 48 83 7c 24 58 00 cmpq $0x0,0x58(%rsp) ffffffff804bd66f: 0 0f 84 d4 03 00 00 je ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd675: 0 65 48 8b 3c 25 00 00 mov %gs:0x0,%rdi ffffffff804bd67c: 0 00 00 ffffffff804bd67e: 0 e8 3a e0 ff ff callq ffffffff804bb6bd <signal_pending> ffffffff804bd683: 0 85 c0 test %eax,%eax ffffffff804bd685: 0 0f 84 ac 00 00 00 je ffffffff804bd737 <tcp_recvmsg+0x2c9> ffffffff804bd68b: 0 e9 b9 03 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd690: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804bd695: 4166 48 89 df mov %rbx,%rdi ffffffff804bd698: 0 e8 7b de ff ff callq ffffffff804bb518 <sock_flag> ffffffff804bd69d: 0 85 c0 test %eax,%eax ffffffff804bd69f: 276 0f 85 a4 03 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd6a5: 126 83 bb 44 01 00 00 00 cmpl $0x0,0x144(%rbx) ffffffff804bd6ac: 0 74 10 je ffffffff804bd6be <tcp_recvmsg+0x250> ffffffff804bd6ae: 0 48 89 df mov %rbx,%rdi ffffffff804bd6b1: 0 e8 00 df ff ff callq ffffffff804bb5b6 <sock_error> ffffffff804bd6b6: 0 41 89 c7 mov %eax,%r15d ffffffff804bd6b9: 0 e9 8b 03 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd6be: 112 f6 43 38 01 testb $0x1,0x38(%rbx) ffffffff804bd6c2: 3451 0f 85 81 03 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd6c8: 497 8a 43 02 mov 0x2(%rbx),%al ffffffff804bd6cb: 0 3c 07 cmp $0x7,%al ffffffff804bd6cd: 113 75 20 jne ffffffff804bd6ef <tcp_recvmsg+0x281> ffffffff804bd6cf: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804bd6d4: 0 48 89 df mov %rbx,%rdi ffffffff804bd6d7: 0 e8 3c de ff ff callq ffffffff804bb518 <sock_flag> ffffffff804bd6dc: 0 85 c0 test %eax,%eax ffffffff804bd6de: 0 0f 85 65 03 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd6e4: 0 41 bf 95 ff ff ff mov $0xffffff95,%r15d ffffffff804bd6ea: 0 e9 5a 03 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd6ef: 118 48 83 7c 24 58 00 cmpq $0x0,0x58(%rsp) ffffffff804bd6f5: 398 75 0b jne ffffffff804bd702 <tcp_recvmsg+0x294> ffffffff804bd6f7: 0 41 bf f5 ff ff ff mov $0xfffffff5,%r15d ffffffff804bd6fd: 0 e9 47 03 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd702: 0 65 48 8b 3c 25 00 00 mov %gs:0x0,%rdi ffffffff804bd709: 0 00 00 ffffffff804bd70b: 2993 e8 ad df ff ff callq ffffffff804bb6bd <signal_pending> ffffffff804bd710: 200 85 c0 test %eax,%eax ffffffff804bd712: 0 74 23 je ffffffff804bd737 <tcp_recvmsg+0x2c9> ffffffff804bd714: 0 48 b8 ff ff ff ff ff mov $0x7fffffffffffffff,%rax ffffffff804bd71b: 0 ff ff 7f ffffffff804bd71e: 0 48 39 44 24 58 cmp %rax,0x58(%rsp) ffffffff804bd723: 0 41 bf 00 fe ff ff mov $0xfffffe00,%r15d ffffffff804bd729: 0 b8 fc ff ff ff mov $0xfffffffc,%eax ffffffff804bd72e: 0 44 0f 45 f8 cmovne %eax,%r15d ffffffff804bd732: 0 e9 12 03 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd737: 207 44 89 fe mov %r15d,%esi ffffffff804bd73a: 198 48 89 df mov %rbx,%rdi ffffffff804bd73d: 0 e8 cc e9 ff ff callq ffffffff804bc10e <tcp_cleanup_rbuf> ffffffff804bd742: 227 83 3d 9b ad 3f 00 00 cmpl $0x0,0x3fad9b(%rip) # ffffffff808b84e4 <sysctl_tcp_low_latency> ffffffff804bd749: 210 0f 85 81 00 00 00 jne ffffffff804bd7d0 <tcp_recvmsg+0x362> ffffffff804bd74f: 0 48 8b ab 28 04 00 00 mov 0x428(%rbx),%rbp ffffffff804bd756: 0 48 3b 6c 24 50 cmp 0x50(%rsp),%rbp ffffffff804bd75b: 232 75 73 jne ffffffff804bd7d0 <tcp_recvmsg+0x362> ffffffff804bd75d: 0 48 83 7c 24 50 00 cmpq $0x0,0x50(%rsp) ffffffff804bd763: 7 75 27 jne ffffffff804bd78c <tcp_recvmsg+0x31e> ffffffff804bd765: 229 83 7c 24 1c 00 cmpl $0x0,0x1c(%rsp) ffffffff804bd76a: 30 75 20 jne ffffffff804bd78c <tcp_recvmsg+0x31e> ffffffff804bd76c: 7 48 8b 54 24 30 mov 0x30(%rsp),%rdx ffffffff804bd771: 191 65 48 8b 2c 25 00 00 mov %gs:0x0,%rbp ffffffff804bd778: 0 00 00 ffffffff804bd77a: 12 48 89 ab 28 04 00 00 mov %rbp,0x428(%rbx) ffffffff804bd781: 2617 48 8b 42 10 mov 0x10(%rdx),%rax ffffffff804bd785: 670 48 89 83 30 04 00 00 mov %rax,0x430(%rbx) ffffffff804bd78c: 11 8b 83 f4 03 00 00 mov 0x3f4(%rbx),%eax ffffffff804bd792: 188 3b 83 f0 03 00 00 cmp 0x3f0(%rbx),%eax ffffffff804bd798: 166 44 89 b3 3c 04 00 00 mov %r14d,0x43c(%rbx) ffffffff804bd79f: 5 74 18 je ffffffff804bd7b9 <tcp_recvmsg+0x34b> ffffffff804bd7a1: 0 83 7c 24 1c 00 cmpl $0x0,0x1c(%rsp) ffffffff804bd7a6: 0 75 11 jne ffffffff804bd7b9 <tcp_recvmsg+0x34b> ffffffff804bd7a8: 0 be 92 05 00 00 mov $0x592,%esi ffffffff804bd7ad: 0 48 c7 c7 13 d9 6a 80 mov $0xffffffff806ad913,%rdi ffffffff804bd7b4: 0 e8 fc 89 d7 ff callq ffffffff802361b5 <warn_on_slowpath> ffffffff804bd7b9: 336 48 8b 4c 24 20 mov 0x20(%rsp),%rcx ffffffff804bd7be: 302 48 39 8b 10 04 00 00 cmp %rcx,0x410(%rbx) ffffffff804bd7c5: 1176 48 89 6c 24 50 mov %rbp,0x50(%rsp) ffffffff804bd7ca: 244 0f 85 81 00 00 00 jne ffffffff804bd851 <tcp_recvmsg+0x3e3> ffffffff804bd7d0: 135 44 3b 7c 24 4c cmp 0x4c(%rsp),%r15d ffffffff804bd7d5: 112 7c 12 jl ffffffff804bd7e9 <tcp_recvmsg+0x37b> ffffffff804bd7d7: 0 48 89 df mov %rbx,%rdi ffffffff804bd7da: 0 e8 57 7f fc ff callq ffffffff80485736 <release_sock> ffffffff804bd7df: 0 48 89 df mov %rbx,%rdi ffffffff804bd7e2: 0 e8 96 e5 ff ff callq ffffffff804bbd7d <lock_sock> ffffffff804bd7e7: 0 eb 0d jmp ffffffff804bd7f6 <tcp_recvmsg+0x388> ffffffff804bd7e9: 152 48 8d 74 24 58 lea 0x58(%rsp),%rsi ffffffff804bd7ee: 563 48 89 df mov %rbx,%rdi ffffffff804bd7f1: 59 e8 83 99 fc ff callq ffffffff80487179 <sk_wait_data> ffffffff804bd7f6: 86 48 83 7c 24 50 00 cmpq $0x0,0x50(%rsp) ffffffff804bd7fc: 8550 0f 84 8a 00 00 00 je ffffffff804bd88c <tcp_recvmsg+0x41e> ffffffff804bd802: 4038 44 89 f1 mov %r14d,%ecx ffffffff804bd805: 900 2b 8b 3c 04 00 00 sub 0x43c(%rbx),%ecx ffffffff804bd80b: 5 74 28 je ffffffff804bd835 <tcp_recvmsg+0x3c7> ffffffff804bd80d: 0 48 8b 05 ac 3e 5f 00 mov 0x5f3eac(%rip),%rax # ffffffff80ab16c0 <init_net+0xf0> ffffffff804bd814: 1 41 01 cf add %ecx,%r15d ffffffff804bd817: 0 65 8b 14 25 24 00 00 mov %gs:0x24,%edx ffffffff804bd81e: 0 00 ffffffff804bd81f: 0 89 d2 mov %edx,%edx ffffffff804bd821: 0 48 f7 d0 not %rax ffffffff804bd824: 0 48 8b 04 d0 mov (%rax,%rdx,8),%rax ffffffff804bd828: 0 48 63 d1 movslq %ecx,%rdx ffffffff804bd82b: 0 49 29 d6 sub %rdx,%r14 ffffffff804bd82e: 0 48 01 90 b8 00 00 00 add %rdx,0xb8(%rax) ffffffff804bd835: 4 8b 83 f0 03 00 00 mov 0x3f0(%rbx),%eax ffffffff804bd83b: 373 3b 83 f4 03 00 00 cmp 0x3f4(%rbx),%eax ffffffff804bd841: 3604 75 49 jne ffffffff804bd88c <tcp_recvmsg+0x41e> ffffffff804bd843: 0 48 8b 44 24 20 mov 0x20(%rsp),%rax ffffffff804bd848: 971 48 39 83 10 04 00 00 cmp %rax,0x410(%rbx) ffffffff804bd84f: 11 74 3b je ffffffff804bd88c <tcp_recvmsg+0x41e> ffffffff804bd851: 6 48 89 df mov %rbx,%rdi ffffffff804bd854: 267 e8 94 e6 ff ff callq ffffffff804bbeed <tcp_prequeue_process> ffffffff804bd859: 0 44 89 f1 mov %r14d,%ecx ffffffff804bd85c: 879 2b 8b 3c 04 00 00 sub 0x43c(%rbx),%ecx ffffffff804bd862: 256 74 28 je ffffffff804bd88c <tcp_recvmsg+0x41e> ffffffff804bd864: 0 48 8b 05 55 3e 5f 00 mov 0x5f3e55(%rip),%rax # ffffffff80ab16c0 <init_net+0xf0> ffffffff804bd86b: 116 41 01 cf add %ecx,%r15d ffffffff804bd86e: 17 65 8b 14 25 24 00 00 mov %gs:0x24,%edx ffffffff804bd875: 0 00 ffffffff804bd876: 0 89 d2 mov %edx,%edx ffffffff804bd878: 1 48 f7 d0 not %rax ffffffff804bd87b: 5 48 8b 04 d0 mov (%rax,%rdx,8),%rax ffffffff804bd87f: 0 48 63 d1 movslq %ecx,%rdx ffffffff804bd882: 6 49 29 d6 sub %rdx,%r14 ffffffff804bd885: 7 48 01 90 c0 00 00 00 add %rdx,0xc0(%rax) ffffffff804bd88c: 11 83 7c 24 3c 00 cmpl $0x0,0x3c(%rsp) ffffffff804bd891: 438 0f 84 a9 01 00 00 je ffffffff804bda40 <tcp_recvmsg+0x5d2> ffffffff804bd897: 0 8b 44 24 60 mov 0x60(%rsp),%eax ffffffff804bd89b: 0 3b 83 f4 03 00 00 cmp 0x3f4(%rbx),%eax ffffffff804bd8a1: 0 0f 84 99 01 00 00 je ffffffff804bda40 <tcp_recvmsg+0x5d2> ffffffff804bd8a7: 0 e8 19 ad fd ff callq ffffffff804985c5 <net_ratelimit> ffffffff804bd8ac: 0 85 c0 test %eax,%eax ffffffff804bd8ae: 0 74 24 je ffffffff804bd8d4 <tcp_recvmsg+0x466> ffffffff804bd8b0: 0 65 48 8b 34 25 00 00 mov %gs:0x0,%rsi ffffffff804bd8b7: 0 00 00 ffffffff804bd8b9: 0 8b 96 70 01 00 00 mov 0x170(%rsi),%edx ffffffff804bd8bf: 0 48 c7 c7 6a d9 6a 80 mov $0xffffffff806ad96a,%rdi ffffffff804bd8c6: 0 48 81 c6 68 03 00 00 add $0x368,%rsi ffffffff804bd8cd: 0 31 c0 xor %eax,%eax ffffffff804bd8cf: 0 e8 a0 94 d7 ff callq ffffffff80236d74 <printk> ffffffff804bd8d4: 0 8b 83 f4 03 00 00 mov 0x3f4(%rbx),%eax ffffffff804bd8da: 0 89 44 24 60 mov %eax,0x60(%rsp) ffffffff804bd8de: 0 e9 5d 01 00 00 jmpq ffffffff804bda40 <tcp_recvmsg+0x5d2> ffffffff804bd8e3: 4077 44 29 e8 sub %r13d,%eax ffffffff804bd8e6: 6031 4d 89 f4 mov %r14,%r12 ffffffff804bd8e9: 0 4c 39 f0 cmp %r14,%rax ffffffff804bd8ec: 0 4c 0f 46 e0 cmovbe %rax,%r12 ffffffff804bd8f0: 934 66 83 bb 7c 04 00 00 cmpw $0x0,0x47c(%rbx) ffffffff804bd8f7: 0 00 ffffffff804bd8f8: 0 74 38 je ffffffff804bd932 <tcp_recvmsg+0x4c4> ffffffff804bd8fa: 0 8b 83 84 05 00 00 mov 0x584(%rbx),%eax ffffffff804bd900: 0 29 f0 sub %esi,%eax ffffffff804bd902: 0 89 c2 mov %eax,%edx ffffffff804bd904: 0 4c 39 e2 cmp %r12,%rdx ffffffff804bd907: 0 73 29 jae ffffffff804bd932 <tcp_recvmsg+0x4c4> ffffffff804bd909: 0 85 c0 test %eax,%eax ffffffff804bd90b: 0 74 05 je ffffffff804bd912 <tcp_recvmsg+0x4a4> ffffffff804bd90d: 0 49 89 d4 mov %rdx,%r12 ffffffff804bd910: 0 eb 20 jmp ffffffff804bd932 <tcp_recvmsg+0x4c4> ffffffff804bd912: 0 be 02 00 00 00 mov $0x2,%esi ffffffff804bd917: 0 48 89 df mov %rbx,%rdi ffffffff804bd91a: 0 e8 f9 db ff ff callq ffffffff804bb518 <sock_flag> ffffffff804bd91f: 0 85 c0 test %eax,%eax ffffffff804bd921: 0 75 0f jne ffffffff804bd932 <tcp_recvmsg+0x4c4> ffffffff804bd923: 0 48 8b 54 24 40 mov 0x40(%rsp),%rdx ffffffff804bd928: 0 41 ff c5 inc %r13d ffffffff804bd92b: 0 ff 02 incl (%rdx) ffffffff804bd92d: 0 49 ff cc dec %r12 ffffffff804bd930: 0 74 4c je ffffffff804bd97e <tcp_recvmsg+0x510> ffffffff804bd932: 906 83 7c 24 0c 00 cmpl $0x0,0xc(%rsp) ffffffff804bd937: 6039 75 2f jne ffffffff804bd968 <tcp_recvmsg+0x4fa> ffffffff804bd939: 48 48 8b 4c 24 30 mov 0x30(%rsp),%rcx ffffffff804bd93e: 1412 44 89 ee mov %r13d,%esi ffffffff804bd941: 6648 48 89 ef mov %rbp,%rdi ffffffff804bd944: 0 48 8b 51 10 mov 0x10(%rcx),%rdx ffffffff804bd948: 1524 44 89 e1 mov %r12d,%ecx ffffffff804bd94b: 167 e8 c5 d3 fc ff callq ffffffff8048ad15 <skb_copy_datagram_iovec> ffffffff804bd950: 0 85 c0 test %eax,%eax ffffffff804bd952: 1038 74 14 je ffffffff804bd968 <tcp_recvmsg+0x4fa> ffffffff804bd954: 0 45 85 ff test %r15d,%r15d ffffffff804bd957: 0 0f 85 ec 00 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd95d: 0 41 bf f2 ff ff ff mov $0xfffffff2,%r15d ffffffff804bd963: 0 e9 e1 00 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bd968: 28 48 8b 54 24 40 mov 0x40(%rsp),%rdx ffffffff804bd96d: 5713 48 89 df mov %rbx,%rdi ffffffff804bd970: 241 45 01 e7 add %r12d,%r15d ffffffff804bd973: 27 4d 29 e6 sub %r12,%r14 ffffffff804bd976: 626 44 01 22 add %r12d,(%rdx) ffffffff804bd979: 221 e8 fe 11 00 00 callq ffffffff804beb7c <tcp_rcv_space_adjust> ffffffff804bd97e: 1425 66 83 bb 7c 04 00 00 cmpw $0x0,0x47c(%rbx) ffffffff804bd985: 0 00 ffffffff804bd986: 3430 74 63 je ffffffff804bd9eb <tcp_recvmsg+0x57d> ffffffff804bd988: 0 8b 8b f4 03 00 00 mov 0x3f4(%rbx),%ecx ffffffff804bd98e: 0 39 8b 84 05 00 00 cmp %ecx,0x584(%rbx) ffffffff804bd994: 0 79 55 jns ffffffff804bd9eb <tcp_recvmsg+0x57d> ffffffff804bd996: 0 48 8b 44 24 10 mov 0x10(%rsp),%rax ffffffff804bd99b: 0 48 39 83 f8 04 00 00 cmp %rax,0x4f8(%rbx) ffffffff804bd9a2: 0 66 c7 83 7c 04 00 00 movw $0x0,0x47c(%rbx) ffffffff804bd9a9: 0 00 00 ffffffff804bd9ab: 0 75 3e jne ffffffff804bd9eb <tcp_recvmsg+0x57d> ffffffff804bd9ad: 0 83 bb c0 04 00 00 00 cmpl $0x0,0x4c0(%rbx) ffffffff804bd9b4: 0 74 35 je ffffffff804bd9eb <tcp_recvmsg+0x57d> ffffffff804bd9b6: 0 8b 83 94 00 00 00 mov 0x94(%rbx),%eax ffffffff804bd9bc: 0 3b 43 3c cmp 0x3c(%rbx),%eax ffffffff804bd9bf: 0 7d 2a jge ffffffff804bd9eb <tcp_recvmsg+0x57d> ffffffff804bd9c1: 0 0f b7 83 e8 03 00 00 movzwl 0x3e8(%rbx),%eax ffffffff804bd9c8: 0 8a 8b 9d 04 00 00 mov 0x49d(%rbx),%cl ffffffff804bd9ce: 0 8b 93 44 04 00 00 mov 0x444(%rbx),%edx ffffffff804bd9d4: 0 83 e1 0f and $0xf,%ecx ffffffff804bd9d7: 0 c1 e0 1a shl $0x1a,%eax ffffffff804bd9da: 0 d3 ea shr %cl,%edx ffffffff804bd9dc: 0 09 d0 or %edx,%eax ffffffff804bd9de: 0 0d 00 00 10 00 or $0x100000,%eax ffffffff804bd9e3: 0 0f c8 bswap %eax ffffffff804bd9e5: 0 89 83 ec 03 00 00 mov %eax,0x3ec(%rbx) ffffffff804bd9eb: 0 8b 55 68 mov 0x68(%rbp),%edx ffffffff804bd9ee: 1655 44 89 e8 mov %r13d,%eax ffffffff804bd9f1: 32 4c 01 e0 add %r12,%rax ffffffff804bd9f4: 0 48 39 d0 cmp %rdx,%rax ffffffff804bd9f7: 847 72 47 jb ffffffff804bda40 <tcp_recvmsg+0x5d2> ffffffff804bd9f9: 0 8b 95 b8 00 00 00 mov 0xb8(%rbp),%edx ffffffff804bd9ff: 80 48 8b 85 d0 00 00 00 mov 0xd0(%rbp),%rax ffffffff804bda06: 441 f6 44 02 0d 01 testb $0x1,0xd(%rdx,%rax,1) ffffffff804bda0b: 0 75 16 jne ffffffff804bda23 <tcp_recvmsg+0x5b5> ffffffff804bda0d: 0 83 7c 24 3c 00 cmpl $0x0,0x3c(%rsp) ffffffff804bda12: 453 75 2c jne ffffffff804bda40 <tcp_recvmsg+0x5d2> ffffffff804bda14: 0 31 d2 xor %edx,%edx ffffffff804bda16: 0 48 89 ee mov %rbp,%rsi ffffffff804bda19: 477 48 89 df mov %rbx,%rdi ffffffff804bda1c: 0 e8 0f e4 ff ff callq ffffffff804bbe30 <sk_eat_skb> ffffffff804bda21: 562 eb 1d jmp ffffffff804bda40 <tcp_recvmsg+0x5d2> ffffffff804bda23: 0 48 8b 54 24 40 mov 0x40(%rsp),%rdx ffffffff804bda28: 0 ff 02 incl (%rdx) ffffffff804bda2a: 0 83 7c 24 3c 00 cmpl $0x0,0x3c(%rsp) ffffffff804bda2f: 0 75 18 jne ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bda31: 0 31 d2 xor %edx,%edx ffffffff804bda33: 0 48 89 ee mov %rbp,%rsi ffffffff804bda36: 0 48 89 df mov %rbx,%rdi ffffffff804bda39: 0 e8 f2 e3 ff ff callq ffffffff804bbe30 <sk_eat_skb> ffffffff804bda3e: 0 eb 09 jmp ffffffff804bda49 <tcp_recvmsg+0x5db> ffffffff804bda40: 959 4d 85 f6 test %r14,%r14 ffffffff804bda43: 4766 0f 85 f8 fa ff ff jne ffffffff804bd541 <tcp_recvmsg+0xd3> ffffffff804bda49: 217 48 83 7c 24 50 00 cmpq $0x0,0x50(%rsp) ffffffff804bda4f: 2084 74 71 je ffffffff804bdac2 <tcp_recvmsg+0x654> ffffffff804bda51: 40 48 8d 83 10 04 00 00 lea 0x410(%rbx),%rax ffffffff804bda58: 448 48 39 83 10 04 00 00 cmp %rax,0x410(%rbx) ffffffff804bda5f: 4 74 4c je ffffffff804bdaad <tcp_recvmsg+0x63f> ffffffff804bda61: 0 31 c0 xor %eax,%eax ffffffff804bda63: 0 45 85 ff test %r15d,%r15d ffffffff804bda66: 0 48 89 df mov %rbx,%rdi ffffffff804bda69: 0 41 0f 4f c6 cmovg %r14d,%eax ffffffff804bda6d: 0 89 83 3c 04 00 00 mov %eax,0x43c(%rbx) ffffffff804bda73: 0 e8 75 e4 ff ff callq ffffffff804bbeed <tcp_prequeue_process> ffffffff804bda78: 0 45 85 ff test %r15d,%r15d ffffffff804bda7b: 0 7e 30 jle ffffffff804bdaad <tcp_recvmsg+0x63f> ffffffff804bda7d: 0 44 89 f1 mov %r14d,%ecx ffffffff804bda80: 0 2b 8b 3c 04 00 00 sub 0x43c(%rbx),%ecx ffffffff804bda86: 0 74 25 je ffffffff804bdaad <tcp_recvmsg+0x63f> ffffffff804bda88: 0 48 8b 05 31 3c 5f 00 mov 0x5f3c31(%rip),%rax # ffffffff80ab16c0 <init_net+0xf0> ffffffff804bda8f: 0 41 01 cf add %ecx,%r15d ffffffff804bda92: 0 65 8b 14 25 24 00 00 mov %gs:0x24,%edx ffffffff804bda99: 0 00 ffffffff804bda9a: 0 89 d2 mov %edx,%edx ffffffff804bda9c: 0 48 f7 d0 not %rax ffffffff804bda9f: 0 48 8b 14 d0 mov (%rax,%rdx,8),%rdx ffffffff804bdaa3: 0 48 63 c1 movslq %ecx,%rax ffffffff804bdaa6: 0 48 01 82 c0 00 00 00 add %rax,0xc0(%rdx) ffffffff804bdaad: 214 48 c7 83 28 04 00 00 movq $0x0,0x428(%rbx) ffffffff804bdab4: 0 00 00 00 00 ffffffff804bdab8: 1530 c7 83 3c 04 00 00 00 movl $0x0,0x43c(%rbx) ffffffff804bdabf: 0 00 00 00 ffffffff804bdac2: 1135 48 89 df mov %rbx,%rdi ffffffff804bdac5: 3909 44 89 fe mov %r15d,%esi ffffffff804bdac8: 0 e8 41 e6 ff ff callq ffffffff804bc10e <tcp_cleanup_rbuf> ffffffff804bdacd: 1724 48 89 df mov %rbx,%rdi ffffffff804bdad0: 932 e8 61 7c fc ff callq ffffffff80485736 <release_sock> ffffffff804bdad5: 4661 e9 12 01 00 00 jmpq ffffffff804bdbec <tcp_recvmsg+0x77e> ffffffff804bdada: 0 41 bc 95 ff ff ff mov $0xffffff95,%r12d ffffffff804bdae0: 0 48 89 df mov %rbx,%rdi ffffffff804bdae3: 0 45 89 e7 mov %r12d,%r15d ffffffff804bdae6: 0 e8 4b 7c fc ff callq ffffffff80485736 <release_sock> ffffffff804bdaeb: 0 e9 fc 00 00 00 jmpq ffffffff804bdbec <tcp_recvmsg+0x77e> ffffffff804bdaf0: 0 be 02 00 00 00 mov $0x2,%esi ffffffff804bdaf5: 0 48 89 df mov %rbx,%rdi ffffffff804bdaf8: 0 e8 1b da ff ff callq ffffffff804bb518 <sock_flag> ffffffff804bdafd: 0 85 c0 test %eax,%eax ffffffff804bdaff: 0 0f 85 d4 00 00 00 jne ffffffff804bdbd9 <tcp_recvmsg+0x76b> ffffffff804bdb05: 0 8b 83 7c 04 00 00 mov 0x47c(%rbx),%eax ffffffff804bdb0b: 0 66 85 c0 test %ax,%ax ffffffff804bdb0e: 0 0f 84 c5 00 00 00 je ffffffff804bdbd9 <tcp_recvmsg+0x76b> ffffffff804bdb14: 0 66 3d 00 04 cmp $0x400,%ax ffffffff804bdb18: 0 0f 84 bb 00 00 00 je ffffffff804bdbd9 <tcp_recvmsg+0x76b> ffffffff804bdb1e: 0 8a 43 02 mov 0x2(%rbx),%al ffffffff804bdb21: 0 3c 07 cmp $0x7,%al ffffffff804bdb23: 0 75 17 jne ffffffff804bdb3c <tcp_recvmsg+0x6ce> ffffffff804bdb25: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804bdb2a: 0 48 89 df mov %rbx,%rdi ffffffff804bdb2d: 0 41 bc 95 ff ff ff mov $0xffffff95,%r12d ffffffff804bdb33: 0 e8 e0 d9 ff ff callq ffffffff804bb518 <sock_flag> ffffffff804bdb38: 0 85 c0 test %eax,%eax ffffffff804bdb3a: 0 74 a4 je ffffffff804bdae0 <tcp_recvmsg+0x672> ffffffff804bdb3c: 0 8b 83 7c 04 00 00 mov 0x47c(%rbx),%eax ffffffff804bdb42: 0 f6 c4 01 test $0x1,%ah ffffffff804bdb45: 0 74 79 je ffffffff804bdbc0 <tcp_recvmsg+0x752> ffffffff804bdb47: 0 40 f6 c5 02 test $0x2,%bpl ffffffff804bdb4b: 0 88 44 24 67 mov %al,0x67(%rsp) ffffffff804bdb4f: 0 75 09 jne ffffffff804bdb5a <tcp_recvmsg+0x6ec> ffffffff804bdb51: 0 66 c7 83 7c 04 00 00 movw $0x400,0x47c(%rbx) ffffffff804bdb58: 0 00 04 ffffffff804bdb5a: 0 48 8b 4c 24 30 mov 0x30(%rsp),%rcx ffffffff804bdb5f: 0 45 89 f4 mov %r14d,%r12d ffffffff804bdb62: 0 8b 51 30 mov 0x30(%rcx),%edx ffffffff804bdb65: 0 89 d0 mov %edx,%eax ffffffff804bdb67: 0 83 c8 01 or $0x1,%eax ffffffff804bdb6a: 0 45 85 f6 test %r14d,%r14d ffffffff804bdb6d: 0 89 41 30 mov %eax,0x30(%rcx) ffffffff804bdb70: 0 7e 33 jle ffffffff804bdba5 <tcp_recvmsg+0x737> ffffffff804bdb72: 0 40 80 e5 20 and $0x20,%bpl ffffffff804bdb76: 0 41 bc 01 00 00 00 mov $0x1,%r12d ffffffff804bdb7c: 0 0f 85 5e ff ff ff jne ffffffff804bdae0 <tcp_recvmsg+0x672> ffffffff804bdb82: 0 48 8b 79 10 mov 0x10(%rcx),%rdi ffffffff804bdb86: 0 48 8d 74 24 67 lea 0x67(%rsp),%rsi ffffffff804bdb8b: 0 ba 01 00 00 00 mov $0x1,%edx ffffffff804bdb90: 0 41 bc f2 ff ff ff mov $0xfffffff2,%r12d ffffffff804bdb96: 0 e8 8a cb fc ff callq ffffffff8048a725 <memcpy_toiovec> ffffffff804bdb9b: 0 85 c0 test %eax,%eax ffffffff804bdb9d: 0 0f 85 3d ff ff ff jne ffffffff804bdae0 <tcp_recvmsg+0x672> ffffffff804bdba3: 0 eb 10 jmp ffffffff804bdbb5 <tcp_recvmsg+0x747> ffffffff804bdba5: 0 48 8b 44 24 30 mov 0x30(%rsp),%rax ffffffff804bdbaa: 0 83 ca 21 or $0x21,%edx ffffffff804bdbad: 0 89 50 30 mov %edx,0x30(%rax) ffffffff804bdbb0: 0 e9 2b ff ff ff jmpq ffffffff804bdae0 <tcp_recvmsg+0x672> ffffffff804bdbb5: 0 41 bc 01 00 00 00 mov $0x1,%r12d ffffffff804bdbbb: 0 e9 20 ff ff ff jmpq ffffffff804bdae0 <tcp_recvmsg+0x672> ffffffff804bdbc0: 0 8a 43 02 mov 0x2(%rbx),%al ffffffff804bdbc3: 0 3c 07 cmp $0x7,%al ffffffff804bdbc5: 0 74 1d je ffffffff804bdbe4 <tcp_recvmsg+0x776> ffffffff804bdbc7: 0 f6 43 38 01 testb $0x1,0x38(%rbx) ffffffff804bdbcb: 0 41 bc f5 ff ff ff mov $0xfffffff5,%r12d ffffffff804bdbd1: 0 0f 84 09 ff ff ff je ffffffff804bdae0 <tcp_recvmsg+0x672> ffffffff804bdbd7: 0 eb 0b jmp ffffffff804bdbe4 <tcp_recvmsg+0x776> ffffffff804bdbd9: 0 41 bc ea ff ff ff mov $0xffffffea,%r12d ffffffff804bdbdf: 0 e9 fc fe ff ff jmpq ffffffff804bdae0 <tcp_recvmsg+0x672> ffffffff804bdbe4: 0 45 31 e4 xor %r12d,%r12d ffffffff804bdbe7: 0 e9 f4 fe ff ff jmpq ffffffff804bdae0 <tcp_recvmsg+0x672> ffffffff804bdbec: 1206 48 83 c4 68 add $0x68,%rsp ffffffff804bdbf0: 498 44 89 f8 mov %r15d,%eax ffffffff804bdbf3: 387 5b pop %rbx ffffffff804bdbf4: 462 5d pop %rbp ffffffff804bdbf5: 0 41 5c pop %r12 ffffffff804bdbf7: 485 41 5d pop %r13 ffffffff804bdbf9: 466 41 5e pop %r14 ffffffff804bdbfb: 0 41 5f pop %r15 ffffffff804bdbfd: 796 c3 retq no real hotspots either - but a bit too fractured code sequence, so this function's icache footprint is too probably double the size of what it could be. a bit of overhead (8%) leaks in from a callsite: ffffffff804bd46e: 882 41 57 push %r15 ffffffff804bd470: 15507 48 89 f7 mov %rsi,%rdi (this is used as a dynamic function pointer too so i'm just guessing that the common callsite would be sock_common_recvmsg().) perhaps this sequence, about 7% of the total overhead of this function, warrants mention: ffffffff804bd7e2: 0 e8 96 e5 ff ff callq ffffffff804bbd7d <lock_sock> ffffffff804bd7e7: 0 eb 0d jmp ffffffff804bd7f6 <tcp_recvmsg+0x388> ffffffff804bd7e9: 152 48 8d 74 24 58 lea 0x58(%rsp),%rsi ffffffff804bd7ee: 563 48 89 df mov %rbx,%rdi ffffffff804bd7f1: 59 e8 83 99 fc ff callq ffffffff80487179 <sk_wait_data> ffffffff804bd7f6: 86 48 83 7c 24 50 00 cmpq $0x0,0x50(%rsp) ffffffff804bd7fc: 8550 0f 84 8a 00 00 00 je ffffffff804bd88c <tcp_recvmsg+0x41e> ffffffff804bd802: 4038 44 89 f1 mov %r14d,%ecx that's most likely lock_sock[_nested]()'s overhead leaking over into this function: ffffffff804857cb: 9392 <lock_sock_nested>: ffffffff804857cb: 9392 41 55 push %r13 ffffffff804857cd: 4112 41 54 push %r12 ffffffff804857cf: 2 55 push %rbp ffffffff804857d0: 7 48 8d 6f 40 lea 0x40(%rdi),%rbp ffffffff804857d4: 1515 53 push %rbx ffffffff804857d5: 0 48 89 fb mov %rdi,%rbx ffffffff804857d8: 4 48 89 ef mov %rbp,%rdi ffffffff804857db: 1461 48 83 ec 38 sub $0x38,%rsp ffffffff804857df: 8 e8 78 11 09 00 callq ffffffff8051695c <_spin_lock_bh> ffffffff804857e4: 4827 83 7b 44 00 cmpl $0x0,0x44(%rbx) ffffffff804857e8: 2937 74 6d je ffffffff80485857 <lock_sock_nested+0x8c> ffffffff804857ea: 0 65 48 8b 14 25 00 00 mov %gs:0x0,%rdx ffffffff804857f1: 0 00 00 ffffffff804857f3: 0 fc cld ffffffff804857f4: 0 31 c0 xor %eax,%eax ffffffff804857f6: 0 48 89 e7 mov %rsp,%rdi ffffffff804857f9: 0 b9 0a 00 00 00 mov $0xa,%ecx ffffffff804857fe: 0 f3 ab rep stos %eax,%es:(%rdi) ffffffff80485800: 0 48 8d 44 24 18 lea 0x18(%rsp),%rax ffffffff80485805: 0 4c 8d 63 48 lea 0x48(%rbx),%r12 ffffffff80485809: 0 48 89 54 24 08 mov %rdx,0x8(%rsp) ffffffff8048580e: 0 48 c7 44 24 10 80 78 movq $0xffffffff80247880,0x10(%rsp) ffffffff80485815: 0 24 80 ffffffff80485817: 0 48 89 44 24 18 mov %rax,0x18(%rsp) ffffffff8048581c: 0 48 89 44 24 20 mov %rax,0x20(%rsp) ffffffff80485821: 0 ba 02 00 00 00 mov $0x2,%edx ffffffff80485826: 0 48 89 e6 mov %rsp,%rsi ffffffff80485829: 0 4c 89 e7 mov %r12,%rdi ffffffff8048582c: 0 e8 fd 20 dc ff callq ffffffff8024792e <prepare_to_wait_exclusive> ffffffff80485831: 0 48 89 ef mov %rbp,%rdi ffffffff80485834: 0 e8 18 11 09 00 callq ffffffff80516951 <_spin_unlock_bh> ffffffff80485839: 0 e8 52 f9 08 00 callq ffffffff80515190 <schedule> ffffffff8048583e: 0 48 89 ef mov %rbp,%rdi ffffffff80485841: 0 e8 16 11 09 00 callq ffffffff8051695c <_spin_lock_bh> ffffffff80485846: 0 83 7b 44 00 cmpl $0x0,0x44(%rbx) ffffffff8048584a: 0 75 d5 jne ffffffff80485821 <lock_sock_nested+0x56> ffffffff8048584c: 0 48 89 e6 mov %rsp,%rsi ffffffff8048584f: 0 4c 89 e7 mov %r12,%rdi ffffffff80485852: 0 e8 7a 20 dc ff callq ffffffff802478d1 <finish_wait> ffffffff80485857: 88 c7 43 44 01 00 00 00 movl $0x1,0x44(%rbx) ffffffff8048585e: 3431 fe 43 40 incb 0x40(%rbx) ffffffff80485861: 1568 e8 00 4e db ff callq ffffffff8023a666 <local_bh_enable> ffffffff80485866: 1548 48 83 c4 38 add $0x38,%rsp ffffffff8048586a: 61 5b pop %rbx ffffffff8048586b: 1568 5d pop %rbp ffffffff8048586c: 36 41 5c pop %r12 ffffffff8048586e: 0 41 5d pop %r13 ffffffff80485870: 2753 c3 retq which is: 1748 void lock_sock_nested(struct sock *sk, int subclass) 1749 { 1750 might_sleep(); 1751 spin_lock_bh(&sk->sk_lock.slock); 1752 if (sk->sk_lock.owned) 1753 __lock_sock(sk); 1754 sk->sk_lock.owned = 1; 1755 spin_unlock(&sk->sk_lock.slock); that branch in the middle should perhaps be: if (unlikely(sk->sk_lock.owned)) to make this function fall-through. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (8 preceding siblings ...) 2008-11-17 21:19 ` tcp_recvmsg(): " Ingo Molnar @ 2008-11-17 21:26 ` Ingo Molnar 2008-11-17 21:40 ` Eric Dumazet ` (2 more replies) 2008-11-17 21:35 ` __inet_lookup_established(): " Ingo Molnar ` (4 subsequent siblings) 14 siblings, 3 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 21:26 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 1.717771 eth_type_trans hits (total: 171777) ......... ffffffff8049e215: 457 <eth_type_trans>: ffffffff8049e215: 457 41 54 push %r12 ffffffff8049e217: 6514 55 push %rbp ffffffff8049e218: 0 48 89 f5 mov %rsi,%rbp ffffffff8049e21b: 0 53 push %rbx ffffffff8049e21c: 441 48 8b 87 d8 00 00 00 mov 0xd8(%rdi),%rax ffffffff8049e223: 5 48 89 fb mov %rdi,%rbx ffffffff8049e226: 0 2b 87 d0 00 00 00 sub 0xd0(%rdi),%eax ffffffff8049e22c: 493 48 89 73 20 mov %rsi,0x20(%rbx) ffffffff8049e230: 2 be 0e 00 00 00 mov $0xe,%esi ffffffff8049e235: 0 89 87 c0 00 00 00 mov %eax,0xc0(%rdi) ffffffff8049e23b: 472 e8 2c 98 fe ff callq ffffffff80487a6c <skb_pull> ffffffff8049e240: 501 44 8b a3 c0 00 00 00 mov 0xc0(%rbx),%r12d ffffffff8049e247: 763 4c 03 a3 d0 00 00 00 add 0xd0(%rbx),%r12 ffffffff8049e24e: 0 41 f6 04 24 01 testb $0x1,(%r12) ffffffff8049e253: 497 74 26 je ffffffff8049e27b <eth_type_trans+0x66> ffffffff8049e255: 0 48 8d b5 38 02 00 00 lea 0x238(%rbp),%rsi ffffffff8049e25c: 0 4c 89 e7 mov %r12,%rdi ffffffff8049e25f: 0 e8 49 fc ff ff callq ffffffff8049dead <compare_ether_addr> ffffffff8049e264: 0 85 c0 test %eax,%eax ffffffff8049e266: 0 8a 43 7d mov 0x7d(%rbx),%al ffffffff8049e269: 0 75 08 jne ffffffff8049e273 <eth_type_trans+0x5e> ffffffff8049e26b: 0 83 e0 f8 and $0xfffffffffffffff8,%eax ffffffff8049e26e: 0 83 c8 01 or $0x1,%eax ffffffff8049e271: 0 eb 24 jmp ffffffff8049e297 <eth_type_trans+0x82> ffffffff8049e273: 0 83 e0 f8 and $0xfffffffffffffff8,%eax ffffffff8049e276: 0 83 c8 02 or $0x2,%eax ffffffff8049e279: 0 eb 1c jmp ffffffff8049e297 <eth_type_trans+0x82> ffffffff8049e27b: 82 48 8d b5 18 02 00 00 lea 0x218(%rbp),%rsi ffffffff8049e282: 8782 4c 89 e7 mov %r12,%rdi ffffffff8049e285: 1752 e8 23 fc ff ff callq ffffffff8049dead <compare_ether_addr> ffffffff8049e28a: 0 85 c0 test %eax,%eax ffffffff8049e28c: 757 74 0c je ffffffff8049e29a <eth_type_trans+0x85> ffffffff8049e28e: 0 8a 43 7d mov 0x7d(%rbx),%al ffffffff8049e291: 0 83 e0 f8 and $0xfffffffffffffff8,%eax ffffffff8049e294: 0 83 c8 03 or $0x3,%eax ffffffff8049e297: 0 88 43 7d mov %al,0x7d(%rbx) ffffffff8049e29a: 107 66 41 8b 44 24 0c mov 0xc(%r12),%ax ffffffff8049e2a0: 1031 0f b7 c8 movzwl %ax,%ecx ffffffff8049e2a3: 518 66 c1 e8 08 shr $0x8,%ax ffffffff8049e2a7: 0 89 ca mov %ecx,%edx ffffffff8049e2a9: 0 c1 e2 08 shl $0x8,%edx ffffffff8049e2ac: 484 09 d0 or %edx,%eax ffffffff8049e2ae: 0 0f b7 c0 movzwl %ax,%eax ffffffff8049e2b1: 0 3d ff 05 00 00 cmp $0x5ff,%eax ffffffff8049e2b6: 468 7f 18 jg ffffffff8049e2d0 <eth_type_trans+0xbb> ffffffff8049e2b8: 0 48 8b 83 d8 00 00 00 mov 0xd8(%rbx),%rax ffffffff8049e2bf: 0 b9 00 01 00 00 mov $0x100,%ecx ffffffff8049e2c4: 0 66 83 38 ff cmpw $0xffffffffffffffff,(%rax) ffffffff8049e2c8: 0 b8 00 04 00 00 mov $0x400,%eax ffffffff8049e2cd: 0 0f 45 c8 cmovne %eax,%ecx ffffffff8049e2d0: 0 5b pop %rbx ffffffff8049e2d1: 85064 5d pop %rbp ffffffff8049e2d2: 63776 41 5c pop %r12 ffffffff8049e2d4: 1 89 c8 mov %ecx,%eax ffffffff8049e2d6: 474 c3 retq small function, big bang - 1.7% of the total overhead. 90% of this function's cost is in the closing sequence. My guess would be that it originates from ffffffff8049e2ae (the branch after that is not taken), which corresponds to this source code context: (gdb) list *0xffffffff8049e2ae 0xffffffff8049e2ae is in eth_type_trans (net/ethernet/eth.c:199). 194 if (netdev_uses_dsa_tags(dev)) 195 return htons(ETH_P_DSA); 196 if (netdev_uses_trailer_tags(dev)) 197 return htons(ETH_P_TRAILER); 198 199 if (ntohs(eth->h_proto) >= 1536) 200 return eth->h_proto; 201 202 rawp = skb->data; 203 eth->h_proto access. Given that this workload does localhost networking, my guess would be that eth->h_proto is bouncing around between 16 CPUs? At minimum this read-mostly field should be separated from the bouncing bits. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 21:26 ` eth_type_trans(): " Ingo Molnar @ 2008-11-17 21:40 ` Eric Dumazet 2008-11-17 23:41 ` Eric Dumazet 2008-11-17 21:52 ` Linus Torvalds 2008-11-18 5:16 ` David Miller 2 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 21:40 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger Ingo Molnar a écrit : > * Ingo Molnar <mingo@elte.hu> wrote: > >> 100.000000 total >> ................ >> 1.717771 eth_type_trans > > hits (total: 171777) > ......... > ffffffff8049e215: 457 <eth_type_trans>: > ffffffff8049e215: 457 41 54 push %r12 > ffffffff8049e217: 6514 55 push %rbp > ffffffff8049e218: 0 48 89 f5 mov %rsi,%rbp > ffffffff8049e21b: 0 53 push %rbx > ffffffff8049e21c: 441 48 8b 87 d8 00 00 00 mov 0xd8(%rdi),%rax > ffffffff8049e223: 5 48 89 fb mov %rdi,%rbx > ffffffff8049e226: 0 2b 87 d0 00 00 00 sub 0xd0(%rdi),%eax > ffffffff8049e22c: 493 48 89 73 20 mov %rsi,0x20(%rbx) > ffffffff8049e230: 2 be 0e 00 00 00 mov $0xe,%esi > ffffffff8049e235: 0 89 87 c0 00 00 00 mov %eax,0xc0(%rdi) > ffffffff8049e23b: 472 e8 2c 98 fe ff callq ffffffff80487a6c <skb_pull> > ffffffff8049e240: 501 44 8b a3 c0 00 00 00 mov 0xc0(%rbx),%r12d > ffffffff8049e247: 763 4c 03 a3 d0 00 00 00 add 0xd0(%rbx),%r12 > ffffffff8049e24e: 0 41 f6 04 24 01 testb $0x1,(%r12) > ffffffff8049e253: 497 74 26 je ffffffff8049e27b <eth_type_trans+0x66> > ffffffff8049e255: 0 48 8d b5 38 02 00 00 lea 0x238(%rbp),%rsi > ffffffff8049e25c: 0 4c 89 e7 mov %r12,%rdi > ffffffff8049e25f: 0 e8 49 fc ff ff callq ffffffff8049dead <compare_ether_addr> > ffffffff8049e264: 0 85 c0 test %eax,%eax > ffffffff8049e266: 0 8a 43 7d mov 0x7d(%rbx),%al > ffffffff8049e269: 0 75 08 jne ffffffff8049e273 <eth_type_trans+0x5e> > ffffffff8049e26b: 0 83 e0 f8 and $0xfffffffffffffff8,%eax > ffffffff8049e26e: 0 83 c8 01 or $0x1,%eax > ffffffff8049e271: 0 eb 24 jmp ffffffff8049e297 <eth_type_trans+0x82> > ffffffff8049e273: 0 83 e0 f8 and $0xfffffffffffffff8,%eax > ffffffff8049e276: 0 83 c8 02 or $0x2,%eax > ffffffff8049e279: 0 eb 1c jmp ffffffff8049e297 <eth_type_trans+0x82> > ffffffff8049e27b: 82 48 8d b5 18 02 00 00 lea 0x218(%rbp),%rsi > ffffffff8049e282: 8782 4c 89 e7 mov %r12,%rdi > ffffffff8049e285: 1752 e8 23 fc ff ff callq ffffffff8049dead <compare_ether_addr> > ffffffff8049e28a: 0 85 c0 test %eax,%eax > ffffffff8049e28c: 757 74 0c je ffffffff8049e29a <eth_type_trans+0x85> > ffffffff8049e28e: 0 8a 43 7d mov 0x7d(%rbx),%al > ffffffff8049e291: 0 83 e0 f8 and $0xfffffffffffffff8,%eax > ffffffff8049e294: 0 83 c8 03 or $0x3,%eax > ffffffff8049e297: 0 88 43 7d mov %al,0x7d(%rbx) > ffffffff8049e29a: 107 66 41 8b 44 24 0c mov 0xc(%r12),%ax > ffffffff8049e2a0: 1031 0f b7 c8 movzwl %ax,%ecx > ffffffff8049e2a3: 518 66 c1 e8 08 shr $0x8,%ax > ffffffff8049e2a7: 0 89 ca mov %ecx,%edx > ffffffff8049e2a9: 0 c1 e2 08 shl $0x8,%edx > ffffffff8049e2ac: 484 09 d0 or %edx,%eax > ffffffff8049e2ae: 0 0f b7 c0 movzwl %ax,%eax > ffffffff8049e2b1: 0 3d ff 05 00 00 cmp $0x5ff,%eax > ffffffff8049e2b6: 468 7f 18 jg ffffffff8049e2d0 <eth_type_trans+0xbb> > ffffffff8049e2b8: 0 48 8b 83 d8 00 00 00 mov 0xd8(%rbx),%rax > ffffffff8049e2bf: 0 b9 00 01 00 00 mov $0x100,%ecx > ffffffff8049e2c4: 0 66 83 38 ff cmpw $0xffffffffffffffff,(%rax) > ffffffff8049e2c8: 0 b8 00 04 00 00 mov $0x400,%eax > ffffffff8049e2cd: 0 0f 45 c8 cmovne %eax,%ecx > ffffffff8049e2d0: 0 5b pop %rbx > ffffffff8049e2d1: 85064 5d pop %rbp > ffffffff8049e2d2: 63776 41 5c pop %r12 > ffffffff8049e2d4: 1 89 c8 mov %ecx,%eax > ffffffff8049e2d6: 474 c3 retq > > small function, big bang - 1.7% of the total overhead. > > 90% of this function's cost is in the closing sequence. My guess would > be that it originates from ffffffff8049e2ae (the branch after that is > not taken), which corresponds to this source code context: > > (gdb) list *0xffffffff8049e2ae > 0xffffffff8049e2ae is in eth_type_trans (net/ethernet/eth.c:199). > 194 if (netdev_uses_dsa_tags(dev)) > 195 return htons(ETH_P_DSA); > 196 if (netdev_uses_trailer_tags(dev)) > 197 return htons(ETH_P_TRAILER); > 198 > 199 if (ntohs(eth->h_proto) >= 1536) > 200 return eth->h_proto; > 201 > 202 rawp = skb->data; > 203 > > eth->h_proto access. > > Given that this workload does localhost networking, my guess would be > that eth->h_proto is bouncing around between 16 CPUs? At minimum this > read-mostly field should be separated from the bouncing bits. > "eth" is on the frame itself, so each cpu is handling a skb it owns. If there is a cache line miss, then scheduler might have done a wrong schedule ? (tbench server and tbench client on different cpus) But seeing your disassembly, I can see compare_ether_addr() is not inlined. This sucks. /** * compare_ether_addr - Compare two Ethernet addresses * @addr1: Pointer to a six-byte array containing the Ethernet address * @addr2: Pointer other six-byte array containing the Ethernet address * * Compare two ethernet addresses, returns 0 if equal */ static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2) { const u16 *a = (const u16 *) addr1; const u16 *b = (const u16 *) addr2; BUILD_BUG_ON(ETH_ALEN != 6); return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0; } On my machine/compiler, it is inlined, that makes a big difference. c0420750 <eth_type_trans>: /* eth_type_trans total: 14417 0.4101 */ ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 21:40 ` Eric Dumazet @ 2008-11-17 23:41 ` Eric Dumazet 2008-11-18 0:01 ` Linus Torvalds 0 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 23:41 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger [-- Attachment #1: Type: text/plain, Size: 1648 bytes --] Eric Dumazet a écrit : > > But seeing your disassembly, I can see compare_ether_addr() is not inlined. > > This sucks. > > /** > * compare_ether_addr - Compare two Ethernet addresses > * @addr1: Pointer to a six-byte array containing the Ethernet address > * @addr2: Pointer other six-byte array containing the Ethernet address > * > * Compare two ethernet addresses, returns 0 if equal > */ > static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2) > { > const u16 *a = (const u16 *) addr1; > const u16 *b = (const u16 *) addr2; > > BUILD_BUG_ON(ETH_ALEN != 6); > return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0; > } > > On my machine/compiler, it is inlined, that makes a big difference. old gcc compiler... OK understood... > > c0420750 <eth_type_trans>: /* eth_type_trans total: 14417 0.4101 */ > > Could you try this patch Ingo ? Thanks [PATCH] net: eth_type_trans() should be a leaf function In old days, eth_type_trans() was a leaf function. It is not anymore the case. eth_type_trans() is a critical network function, called for each incoming packet. We should make sure it is not calling functions, especially trivial ones. 1) Adds an __always_inline to compare_ether_addr() : This one was created to be faster than memcmp(). It really should be faster (and inlined) 2) Hand code skb_put() call in eth_type_trans() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/etherdevice.h | 2 +- net/ethernet/eth.c | 7 ++++++- 2 files changed, 7 insertions(+), 2 deletions(-) [-- Attachment #2: eth_type_trans_speedup.patch --] [-- Type: text/plain, Size: 1053 bytes --] diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h index 25d62e6..94af6a7 100644 --- a/include/linux/etherdevice.h +++ b/include/linux/etherdevice.h @@ -128,7 +128,7 @@ static inline void random_ether_addr(u8 *addr) * * Compare two ethernet addresses, returns 0 if equal */ -static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2) +static __always_inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2) { const u16 *a = (const u16 *) addr1; const u16 *b = (const u16 *) addr2; diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c index b9d85af..30b60b2 100644 --- a/net/ethernet/eth.c +++ b/net/ethernet/eth.c @@ -162,7 +162,12 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev) skb->dev = dev; skb_reset_mac_header(skb); - skb_pull(skb, ETH_HLEN); + /* + * Hand coded skb_pull(skb, ETH_HLEN) to avoid a function call + */ + if (likely(skb->len >= ETH_HLEN)) + __skb_pull(skb, ETH_HLEN); + eth = eth_hdr(skb); if (is_multicast_ether_addr(eth->h_dest)) { ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 23:41 ` Eric Dumazet @ 2008-11-18 0:01 ` Linus Torvalds 2008-11-18 8:35 ` Eric Dumazet 0 siblings, 1 reply; 191+ messages in thread From: Linus Torvalds @ 2008-11-18 0:01 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger On Tue, 18 Nov 2008, Eric Dumazet wrote: > > * > > * Compare two ethernet addresses, returns 0 if equal > > */ > > static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2) > > { > > const u16 *a = (const u16 *) addr1; > > const u16 *b = (const u16 *) addr2; > > > > BUILD_BUG_ON(ETH_ALEN != 6); > > return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0; Btw, at least on some Intel CPU's, it would be faster to do this as a 32-bit xor and a 16-bit xor. And if we can know that there is always 2 bytes at the end (because of how the thing was allocated), it's faster still to do it as a 64-bit xor and a mask. And that's true even if the addresses are only 2-byte aligned. The code that gcc generates for "memcmp()" for a constant-size small data thing is sadly crap. It always generates a "rep cmpsb", even if the size is something really trivial like 4 bytes, and even if you compare for exact equality rather than a smaller/greater-than. Gaah. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-18 0:01 ` Linus Torvalds @ 2008-11-18 8:35 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-18 8:35 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger [-- Attachment #1: Type: text/plain, Size: 1937 bytes --] Linus Torvalds a écrit : > > On Tue, 18 Nov 2008, Eric Dumazet wrote: >>> * >>> * Compare two ethernet addresses, returns 0 if equal >>> */ >>> static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2) >>> { >>> const u16 *a = (const u16 *) addr1; >>> const u16 *b = (const u16 *) addr2; >>> >>> BUILD_BUG_ON(ETH_ALEN != 6); >>> return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0; > > Btw, at least on some Intel CPU's, it would be faster to do this as a > 32-bit xor and a 16-bit xor. And if we can know that there is always 2 > bytes at the end (because of how the thing was allocated), it's faster > still to do it as a 64-bit xor and a mask. > > And that's true even if the addresses are only 2-byte aligned. > Yes, this is allowed, we always have at least 8 bytes for both arrays, when called from eth_type_trans() at least. I tried this idea and got nice assembly on 32 bits: 158: 33 82 38 01 00 00 xor 0x138(%edx),%eax 15e: 33 8a 34 01 00 00 xor 0x134(%edx),%ecx 164: c1 e0 10 shl $0x10,%eax 167: 09 c1 or %eax,%ecx 169: 74 0b je 176 <eth_type_trans+0x87> And very nice assembly on 64 bits of course (one xor, one shl) About alignments, we have aligned addr2, but not addr1 Nice oprofile improvement in eth_type_trans(), 0.17 % instead of 0.41 % opreport -l vmlinux | grep eth_type_trans 38797 0.1710 eth_type_trans [PATCH] eth: Declare an optimized compare_ether_addr_64bits() function Linus mentioned we could try to perform long word operations, even on potentially unaligned addresses, on x86 at least. This patch implements a compare_ether_addr_64bits() function, that handles the case of x86 cpus, but might be used on other arches as well. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- [-- Attachment #2: compare_ether_addr_64bits.patch --] [-- Type: text/plain, Size: 2438 bytes --] diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h index 25d62e6..ee0df09 100644 --- a/include/linux/etherdevice.h +++ b/include/linux/etherdevice.h @@ -136,6 +136,47 @@ static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2) BUILD_BUG_ON(ETH_ALEN != 6); return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0; } + +static inline unsigned long zap_last_2bytes(unsigned long value) +{ +#ifdef __BIG_ENDIAN + return value >> 16; +#else + return value << 16; +#endif +} + +/** + * compare_ether_addr_64bits - Compare two Ethernet addresses + * @addr1: Pointer to an array of 8 bytes + * @addr2: Pointer to an other array of 8 bytes + * + * Compare two ethernet addresses, returns 0 if equal. + * Same result than "memcmp(addr1, addr2, ETH_ALEN)" but without conditional + * branches, and possibly long word memory accesses on CPU allowing cheap + * unaligned memory reads. + * arrays = { byte1, byte2, byte3, byte4, byte6, byte7, pad1, pad2} + * + * Please note that alignment of addr1 & addr2 is only guaranted to be 16 bits. + */ + +static inline unsigned compare_ether_addr_64bits(const u8 addr1[6+2], + const u8 addr2[6+2]) +{ +#if defined(CONFIG_X86) + unsigned long fold = *(const unsigned long *)addr1 ^ + *(const unsigned long *)addr2; + + if (sizeof(fold) == 8) + return zap_last_2bytes(fold) != 0; + + fold |= zap_last_2bytes(*(const unsigned long *)(addr1 + 4) ^ + *(const unsigned long *)(addr2 + 4)); + return fold != 0; +#else + return compare_ether_addr(addr1, addr2); +#endif +} #endif /* __KERNEL__ */ #endif /* _LINUX_ETHERDEVICE_H */ diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c index b9d85af..dcfeb9b 100644 --- a/net/ethernet/eth.c +++ b/net/ethernet/eth.c @@ -166,7 +166,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev) eth = eth_hdr(skb); if (is_multicast_ether_addr(eth->h_dest)) { - if (!compare_ether_addr(eth->h_dest, dev->broadcast)) + if (!compare_ether_addr_64bits(eth->h_dest, dev->broadcast)) skb->pkt_type = PACKET_BROADCAST; else skb->pkt_type = PACKET_MULTICAST; @@ -181,7 +181,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev) */ else if (1 /*dev->flags&IFF_PROMISC */ ) { - if (unlikely(compare_ether_addr(eth->h_dest, dev->dev_addr))) + if (unlikely(compare_ether_addr_64bits(eth->h_dest, dev->dev_addr))) skb->pkt_type = PACKET_OTHERHOST; } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 21:26 ` eth_type_trans(): " Ingo Molnar 2008-11-17 21:40 ` Eric Dumazet @ 2008-11-17 21:52 ` Linus Torvalds 2008-11-18 5:16 ` David Miller 2 siblings, 0 replies; 191+ messages in thread From: Linus Torvalds @ 2008-11-17 21:52 UTC (permalink / raw) To: Ingo Molnar Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger On Mon, 17 Nov 2008, Ingo Molnar wrote: > ffffffff8049e2ae: 0 0f b7 c0 movzwl %ax,%eax > ffffffff8049e2b1: 0 3d ff 05 00 00 cmp $0x5ff,%eax > ffffffff8049e2b6: 468 7f 18 jg ffffffff8049e2d0 <eth_type_trans+0xbb> > ffffffff8049e2b8: 0 48 8b 83 d8 00 00 00 mov 0xd8(%rbx),%rax > ffffffff8049e2bf: 0 b9 00 01 00 00 mov $0x100,%ecx > ffffffff8049e2c4: 0 66 83 38 ff cmpw $0xffffffffffffffff,(%rax) > ffffffff8049e2c8: 0 b8 00 04 00 00 mov $0x400,%eax > ffffffff8049e2cd: 0 0f 45 c8 cmovne %eax,%ecx > ffffffff8049e2d0: 0 5b pop %rbx > ffffffff8049e2d1: 85064 5d pop %rbp > ffffffff8049e2d2: 63776 41 5c pop %r12 > ffffffff8049e2d4: 1 89 c8 mov %ecx,%eax > ffffffff8049e2d6: 474 c3 retq > > small function, big bang - 1.7% of the total overhead. > > 90% of this function's cost is in the closing sequence. My guess would > be that it originates from ffffffff8049e2ae (the branch after that is > not taken), which corresponds to this source code context: I would actually suspect that branch mispredicts may be an issue. If that thing falls out of the branch prediction table (which it could easily do), then a forward branch will be predicted as "not taken". And if it then turns out that the _common_ case is the other way around, the incorrectly predicted destination is often the one that shows up in profiles. Giving gcc likely()/unlikely() hints usually doesn't much help, I'm afraid. It _can_ make a difference, but often not for -Os in particular. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 21:26 ` eth_type_trans(): " Ingo Molnar 2008-11-17 21:40 ` Eric Dumazet 2008-11-17 21:52 ` Linus Torvalds @ 2008-11-18 5:16 ` David Miller 2008-11-18 5:35 ` Eric Dumazet 2008-11-18 8:30 ` Ingo Molnar 2 siblings, 2 replies; 191+ messages in thread From: David Miller @ 2008-11-18 5:16 UTC (permalink / raw) To: mingo Cc: torvalds, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger From: Ingo Molnar <mingo@elte.hu> Date: Mon, 17 Nov 2008 22:26:57 +0100 > eth->h_proto access. Yes, this is the first time a packet is touched on receive. > Given that this workload does localhost networking, my guess would be > that eth->h_proto is bouncing around between 16 CPUs? At minimum this > read-mostly field should be separated from the bouncing bits. It's the packet contents, there is no way to "seperate it". And it should be unlikely bouncing on your system under tbench, the senders and receivers should hang out on the same cpu unless the something completely stupid is happening. That's why I like running tbench with a num_threads command line argument equal to the number of cpus, every cpu gets the two thread talking to eachother over the TCP socket. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-18 5:16 ` David Miller @ 2008-11-18 5:35 ` Eric Dumazet 2008-11-18 7:00 ` David Miller 2008-11-18 8:30 ` Ingo Molnar 1 sibling, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-18 5:35 UTC (permalink / raw) To: David Miller Cc: mingo, torvalds, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger David Miller a écrit : > From: Ingo Molnar <mingo@elte.hu> > Date: Mon, 17 Nov 2008 22:26:57 +0100 > >> eth->h_proto access. > > Yes, this is the first time a packet is touched on receive. Well, not exactly, since we do a if (is_multicast_ether_addr(eth->h_dest)) { ...} and one of the compare_ether_addr(eth->h_dest, {dev->dev_addr | dev->broadcast}) probably its a profiling effect... ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-18 5:35 ` Eric Dumazet @ 2008-11-18 7:00 ` David Miller 0 siblings, 0 replies; 191+ messages in thread From: David Miller @ 2008-11-18 7:00 UTC (permalink / raw) To: dada1 Cc: mingo, torvalds, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger From: Eric Dumazet <dada1@cosmosbay.com> Date: Tue, 18 Nov 2008 06:35:46 +0100 > David Miller a écrit : > > From: Ingo Molnar <mingo@elte.hu> > > Date: Mon, 17 Nov 2008 22:26:57 +0100 > > > >> eth->h_proto access. > > Yes, this is the first time a packet is touched on receive. > > Well, not exactly, since we do a > > if (is_multicast_ether_addr(eth->h_dest)) { > ...} > > and one of the > compare_ether_addr(eth->h_dest, {dev->dev_addr | dev->broadcast}) > > probably its a profiling effect... True. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-18 5:16 ` David Miller 2008-11-18 5:35 ` Eric Dumazet @ 2008-11-18 8:30 ` Ingo Molnar 2008-11-18 8:49 ` Eric Dumazet 1 sibling, 1 reply; 191+ messages in thread From: Ingo Molnar @ 2008-11-18 8:30 UTC (permalink / raw) To: David Miller Cc: torvalds, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger * David Miller <davem@davemloft.net> wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Mon, 17 Nov 2008 22:26:57 +0100 > > > eth->h_proto access. > > Yes, this is the first time a packet is touched on receive. > > > Given that this workload does localhost networking, my guess would be > > that eth->h_proto is bouncing around between 16 CPUs? At minimum this > > read-mostly field should be separated from the bouncing bits. > > It's the packet contents, there is no way to "seperate it". > > And it should be unlikely bouncing on your system under tbench, the > senders and receivers should hang out on the same cpu unless the > something completely stupid is happening. > > That's why I like running tbench with a num_threads command line > argument equal to the number of cpus, every cpu gets the two thread > talking to eachother over the TCP socket. yeah - and i posted the numbers for that too - it's the same throughput, within ~1% of noise. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-18 8:30 ` Ingo Molnar @ 2008-11-18 8:49 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-18 8:49 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, torvalds, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger Ingo Molnar a écrit : > * David Miller <davem@davemloft.net> wrote: > >> From: Ingo Molnar <mingo@elte.hu> >> Date: Mon, 17 Nov 2008 22:26:57 +0100 >> >>> eth->h_proto access. >> Yes, this is the first time a packet is touched on receive. >> >>> Given that this workload does localhost networking, my guess would be >>> that eth->h_proto is bouncing around between 16 CPUs? At minimum this >>> read-mostly field should be separated from the bouncing bits. >> It's the packet contents, there is no way to "seperate it". >> >> And it should be unlikely bouncing on your system under tbench, the >> senders and receivers should hang out on the same cpu unless the >> something completely stupid is happening. >> >> That's why I like running tbench with a num_threads command line >> argument equal to the number of cpus, every cpu gets the two thread >> talking to eachother over the TCP socket. > > yeah - and i posted the numbers for that too - it's the same > throughput, within ~1% of noise. Thinking once again about loopback driver, I recall a previous attempt to call netif_receive_skb() instead of netif_rx() and pay the price of cache line ping-pongs between cpus. http://kerneltrap.org/mailarchive/linux-netdev/2008/2/21/939644 Maybe we could do that, with a temporary percpu stack, like we do in softirq when CONFIG_4KSTACKS=y (arch/x86/kernel/irq_32.c : call_on_stack(func, stack) And do this only if the current cpu doesnt already use its softirq_stack (think about loopback re-entering loopback xmit because of TCP ACK for example) Oh well... black magic, you are going to kill me :) ^ permalink raw reply [flat|nested] 191+ messages in thread
* __inet_lookup_established(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (9 preceding siblings ...) 2008-11-17 21:26 ` eth_type_trans(): " Ingo Molnar @ 2008-11-17 21:35 ` Ingo Molnar 2008-11-17 22:14 ` Eric Dumazet 2008-11-17 21:59 ` system_call() - " Ingo Molnar ` (3 subsequent siblings) 14 siblings, 1 reply; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 21:35 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 1.673249 __inet_lookup_established hits (total: 167324) ......... ffffffff804b9b12: 446 <__inet_lookup_established>: ffffffff804b9b12: 446 41 57 push %r15 ffffffff804b9b14: 4810 89 d0 mov %edx,%eax ffffffff804b9b16: 0 0f b7 c9 movzwl %cx,%ecx ffffffff804b9b19: 0 41 56 push %r14 ffffffff804b9b1b: 456 41 55 push %r13 ffffffff804b9b1d: 0 41 54 push %r12 ffffffff804b9b1f: 0 55 push %rbp ffffffff804b9b20: 427 53 push %rbx ffffffff804b9b21: 4 48 89 f3 mov %rsi,%rbx ffffffff804b9b24: 2 44 89 c6 mov %r8d,%esi ffffffff804b9b27: 504 41 89 c8 mov %ecx,%r8d ffffffff804b9b2a: 1 49 89 f7 mov %rsi,%r15 ffffffff804b9b2d: 1 48 83 ec 08 sub $0x8,%rsp ffffffff804b9b31: 462 49 c1 e7 20 shl $0x20,%r15 ffffffff804b9b35: 0 48 89 3c 24 mov %rdi,(%rsp) ffffffff804b9b39: 507 89 d7 mov %edx,%edi ffffffff804b9b3b: 38 41 0f b7 d1 movzwl %r9w,%edx ffffffff804b9b3f: 0 41 89 d6 mov %edx,%r14d ffffffff804b9b42: 863 49 09 c7 or %rax,%r15 ffffffff804b9b45: 24 41 c1 e6 10 shl $0x10,%r14d ffffffff804b9b49: 0 41 09 ce or %ecx,%r14d ffffffff804b9b4c: 479 89 f9 mov %edi,%ecx ffffffff804b9b4e: 8 48 8b 3c 24 mov (%rsp),%rdi ffffffff804b9b52: 0 e8 cc f4 ff ff callq ffffffff804b9023 <inet_ehashfn> ffffffff804b9b57: 413 48 89 df mov %rbx,%rdi ffffffff804b9b5a: 122 41 89 c5 mov %eax,%r13d ffffffff804b9b5d: 0 89 c6 mov %eax,%esi ffffffff804b9b5f: 635 e8 3e f5 ff ff callq ffffffff804b90a2 <inet_ehash_bucket> ffffffff804b9b64: 511 48 89 c5 mov %rax,%rbp ffffffff804b9b67: 6 44 89 e8 mov %r13d,%eax ffffffff804b9b6a: 0 23 43 14 and 0x14(%rbx),%eax ffffffff804b9b6d: 497 4c 8d 24 85 00 00 00 lea 0x0(,%rax,4),%r12 ffffffff804b9b74: 0 00 ffffffff804b9b75: 1 4c 03 63 08 add 0x8(%rbx),%r12 ffffffff804b9b79: 0 48 8b 45 00 mov 0x0(%rbp),%rax ffffffff804b9b7d: 470 0f 18 08 prefetcht0 (%rax) ffffffff804b9b80: 0 4c 89 e7 mov %r12,%rdi ffffffff804b9b83: 1089 e8 32 cd 05 00 callq ffffffff805168ba <_read_lock> ffffffff804b9b88: 6752 48 8b 55 00 mov 0x0(%rbp),%rdx ffffffff804b9b8c: 598 eb 2c jmp ffffffff804b9bba <__inet_lookup_established+0xa8> ffffffff804b9b8e: 447 48 81 3c 24 d0 15 ab cmpq $0xffffffff80ab15d0,(%rsp) ffffffff804b9b95: 0 80 ffffffff804b9b96: 1119 75 1f jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> ffffffff804b9b98: 21 4c 39 b8 30 02 00 00 cmp %r15,0x230(%rax) ffffffff804b9b9f: 0 75 16 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> ffffffff804b9ba1: 492 44 39 b0 38 02 00 00 cmp %r14d,0x238(%rax) ffffffff804b9ba8: 0 75 0d jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> ffffffff804b9baa: 0 8b 52 fc mov -0x4(%rdx),%edx ffffffff804b9bad: 451 85 d2 test %edx,%edx ffffffff804b9baf: 0 74 67 je ffffffff804b9c18 <__inet_lookup_established+0x106> ffffffff804b9bb1: 0 3b 54 24 40 cmp 0x40(%rsp),%edx ffffffff804b9bb5: 0 74 61 je ffffffff804b9c18 <__inet_lookup_established+0x106> ffffffff804b9bb7: 0 48 89 ca mov %rcx,%rdx ffffffff804b9bba: 402 48 85 d2 test %rdx,%rdx ffffffff804b9bbd: 1006 74 12 je ffffffff804b9bd1 <__inet_lookup_established+0xbf> ffffffff804b9bbf: 0 48 8d 42 f8 lea -0x8(%rdx),%rax ffffffff804b9bc3: 821 48 8b 0a mov (%rdx),%rcx ffffffff804b9bc6: 78 44 39 68 2c cmp %r13d,0x2c(%rax) ffffffff804b9bca: 4 0f 18 09 prefetcht0 (%rcx) ffffffff804b9bcd: 685 75 e8 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> ffffffff804b9bcf: 139502 eb bd jmp ffffffff804b9b8e <__inet_lookup_established+0x7c> ffffffff804b9bd1: 0 48 8b 55 08 mov 0x8(%rbp),%rdx ffffffff804b9bd5: 0 eb 26 jmp ffffffff804b9bfd <__inet_lookup_established+0xeb> ffffffff804b9bd7: 0 48 81 3c 24 d0 15 ab cmpq $0xffffffff80ab15d0,(%rsp) ffffffff804b9bde: 0 80 ffffffff804b9bdf: 0 75 19 jne ffffffff804b9bfa <__inet_lookup_established+0xe8> ffffffff804b9be1: 0 4c 39 78 40 cmp %r15,0x40(%rax) ffffffff804b9be5: 0 75 13 jne ffffffff804b9bfa <__inet_lookup_established+0xe8> ffffffff804b9be7: 0 44 39 70 48 cmp %r14d,0x48(%rax) ffffffff804b9beb: 0 75 0d jne ffffffff804b9bfa <__inet_lookup_established+0xe8> ffffffff804b9bed: 0 8b 52 fc mov -0x4(%rdx),%edx ffffffff804b9bf0: 0 85 d2 test %edx,%edx ffffffff804b9bf2: 0 74 24 je ffffffff804b9c18 <__inet_lookup_established+0x106> ffffffff804b9bf4: 0 3b 54 24 40 cmp 0x40(%rsp),%edx ffffffff804b9bf8: 0 74 1e je ffffffff804b9c18 <__inet_lookup_established+0x106> ffffffff804b9bfa: 0 48 89 ca mov %rcx,%rdx ffffffff804b9bfd: 0 48 85 d2 test %rdx,%rdx ffffffff804b9c00: 0 74 12 je ffffffff804b9c14 <__inet_lookup_established+0x102> ffffffff804b9c02: 0 48 8d 42 f8 lea -0x8(%rdx),%rax ffffffff804b9c06: 0 48 8b 0a mov (%rdx),%rcx ffffffff804b9c09: 0 44 39 68 2c cmp %r13d,0x2c(%rax) ffffffff804b9c0d: 0 0f 18 09 prefetcht0 (%rcx) ffffffff804b9c10: 0 75 e8 jne ffffffff804b9bfa <__inet_lookup_established+0xe8> ffffffff804b9c12: 0 eb c3 jmp ffffffff804b9bd7 <__inet_lookup_established+0xc5> ffffffff804b9c14: 0 31 c0 xor %eax,%eax ffffffff804b9c16: 0 eb 04 jmp ffffffff804b9c1c <__inet_lookup_established+0x10a> ffffffff804b9c18: 441 f0 ff 40 28 lock incl 0x28(%rax) ffffffff804b9c1c: 1442 f0 41 ff 04 24 lock incl (%r12) ffffffff804b9c21: 476 41 5b pop %r11 ffffffff804b9c23: 1 5b pop %rbx ffffffff804b9c24: 0 5d pop %rbp ffffffff804b9c25: 475 41 5c pop %r12 ffffffff804b9c27: 0 41 5d pop %r13 ffffffff804b9c29: 1 41 5e pop %r14 ffffffff804b9c2b: 494 41 5f pop %r15 ffffffff804b9c2d: 0 c3 retq ffffffff804b9c2e: 0 90 nop ffffffff804b9c2f: 0 90 nop 80% of the overhead comes from cachemisses here: ffffffff804b9bc6: 78 44 39 68 2c cmp %r13d,0x2c(%rax) ffffffff804b9bca: 4 0f 18 09 prefetcht0 (%rcx) ffffffff804b9bcd: 685 75 e8 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> ffffffff804b9bcf: 139502 eb bd jmp ffffffff804b9b8e <__inet_lookup_established+0x7c> corresponding to: (gdb) list *0xffffffff804b9bc6 0xffffffff804b9bc6 is in __inet_lookup_established (net/ipv4/inet_hashtables.c:237). 232 rwlock_t *lock = inet_ehash_lockp(hashinfo, hash); 233 234 prefetch(head->chain.first); 235 read_lock(lock); 236 sk_for_each(sk, node, &head->chain) { 237 if (INET_MATCH(sk, net, hash, acookie, 238 saddr, daddr, ports, dif)) 239 goto hit; /* You sunk my battleship! */ 240 } 241 Seeing the first hard cachemiss on hash lookups is a familiar and partly expected pattern - it is the first thing that touches cache-cold data structures. Seeing 1.4% of the totaly tbench overhead go into this single cachemiss is a bit surprising to me though: tbench works via long-lived connections (TCP establish costs and nowhere to be seen in the profiles) so the socket hash should be relatively stable and read-mostly on most CPUs in theory. The CPUs here have 2MB of L2 cache per socket. Could we be somehow dirtying these cachelines perhaps, causing unnecessary cachemisses in hash lookups? Is the hash linkage portion of the socket data structure frequently dirtied? Padding that to 64 bytes (or next to 64 bytes worth of read-mostly fields) could perhaps give us a +1.7% tbench speedup. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: __inet_lookup_established(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 21:35 ` __inet_lookup_established(): " Ingo Molnar @ 2008-11-17 22:14 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 22:14 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger Ingo Molnar a écrit : > * Ingo Molnar <mingo@elte.hu> wrote: > >> 100.000000 total >> ................ >> 1.673249 __inet_lookup_established > > hits (total: 167324) > ......... > ffffffff804b9b12: 446 <__inet_lookup_established>: > ffffffff804b9b12: 446 41 57 push %r15 > ffffffff804b9b14: 4810 89 d0 mov %edx,%eax > ffffffff804b9b16: 0 0f b7 c9 movzwl %cx,%ecx > ffffffff804b9b19: 0 41 56 push %r14 > ffffffff804b9b1b: 456 41 55 push %r13 > ffffffff804b9b1d: 0 41 54 push %r12 > ffffffff804b9b1f: 0 55 push %rbp > ffffffff804b9b20: 427 53 push %rbx > ffffffff804b9b21: 4 48 89 f3 mov %rsi,%rbx > ffffffff804b9b24: 2 44 89 c6 mov %r8d,%esi > ffffffff804b9b27: 504 41 89 c8 mov %ecx,%r8d > ffffffff804b9b2a: 1 49 89 f7 mov %rsi,%r15 > ffffffff804b9b2d: 1 48 83 ec 08 sub $0x8,%rsp > ffffffff804b9b31: 462 49 c1 e7 20 shl $0x20,%r15 > ffffffff804b9b35: 0 48 89 3c 24 mov %rdi,(%rsp) > ffffffff804b9b39: 507 89 d7 mov %edx,%edi > ffffffff804b9b3b: 38 41 0f b7 d1 movzwl %r9w,%edx > ffffffff804b9b3f: 0 41 89 d6 mov %edx,%r14d > ffffffff804b9b42: 863 49 09 c7 or %rax,%r15 > ffffffff804b9b45: 24 41 c1 e6 10 shl $0x10,%r14d > ffffffff804b9b49: 0 41 09 ce or %ecx,%r14d > ffffffff804b9b4c: 479 89 f9 mov %edi,%ecx > ffffffff804b9b4e: 8 48 8b 3c 24 mov (%rsp),%rdi > ffffffff804b9b52: 0 e8 cc f4 ff ff callq ffffffff804b9023 <inet_ehashfn> > ffffffff804b9b57: 413 48 89 df mov %rbx,%rdi > ffffffff804b9b5a: 122 41 89 c5 mov %eax,%r13d > ffffffff804b9b5d: 0 89 c6 mov %eax,%esi > ffffffff804b9b5f: 635 e8 3e f5 ff ff callq ffffffff804b90a2 <inet_ehash_bucket> > ffffffff804b9b64: 511 48 89 c5 mov %rax,%rbp > ffffffff804b9b67: 6 44 89 e8 mov %r13d,%eax > ffffffff804b9b6a: 0 23 43 14 and 0x14(%rbx),%eax > ffffffff804b9b6d: 497 4c 8d 24 85 00 00 00 lea 0x0(,%rax,4),%r12 > ffffffff804b9b74: 0 00 > ffffffff804b9b75: 1 4c 03 63 08 add 0x8(%rbx),%r12 > ffffffff804b9b79: 0 48 8b 45 00 mov 0x0(%rbp),%rax > ffffffff804b9b7d: 470 0f 18 08 prefetcht0 (%rax) > ffffffff804b9b80: 0 4c 89 e7 mov %r12,%rdi > ffffffff804b9b83: 1089 e8 32 cd 05 00 callq ffffffff805168ba <_read_lock> > ffffffff804b9b88: 6752 48 8b 55 00 mov 0x0(%rbp),%rdx > ffffffff804b9b8c: 598 eb 2c jmp ffffffff804b9bba <__inet_lookup_established+0xa8> > ffffffff804b9b8e: 447 48 81 3c 24 d0 15 ab cmpq $0xffffffff80ab15d0,(%rsp) > ffffffff804b9b95: 0 80 > ffffffff804b9b96: 1119 75 1f jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> > ffffffff804b9b98: 21 4c 39 b8 30 02 00 00 cmp %r15,0x230(%rax) > ffffffff804b9b9f: 0 75 16 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> > ffffffff804b9ba1: 492 44 39 b0 38 02 00 00 cmp %r14d,0x238(%rax) > ffffffff804b9ba8: 0 75 0d jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> > ffffffff804b9baa: 0 8b 52 fc mov -0x4(%rdx),%edx > ffffffff804b9bad: 451 85 d2 test %edx,%edx > ffffffff804b9baf: 0 74 67 je ffffffff804b9c18 <__inet_lookup_established+0x106> > ffffffff804b9bb1: 0 3b 54 24 40 cmp 0x40(%rsp),%edx > ffffffff804b9bb5: 0 74 61 je ffffffff804b9c18 <__inet_lookup_established+0x106> > ffffffff804b9bb7: 0 48 89 ca mov %rcx,%rdx > ffffffff804b9bba: 402 48 85 d2 test %rdx,%rdx > ffffffff804b9bbd: 1006 74 12 je ffffffff804b9bd1 <__inet_lookup_established+0xbf> > ffffffff804b9bbf: 0 48 8d 42 f8 lea -0x8(%rdx),%rax > ffffffff804b9bc3: 821 48 8b 0a mov (%rdx),%rcx > ffffffff804b9bc6: 78 44 39 68 2c cmp %r13d,0x2c(%rax) > ffffffff804b9bca: 4 0f 18 09 prefetcht0 (%rcx) > ffffffff804b9bcd: 685 75 e8 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> > ffffffff804b9bcf: 139502 eb bd jmp ffffffff804b9b8e <__inet_lookup_established+0x7c> > ffffffff804b9bd1: 0 48 8b 55 08 mov 0x8(%rbp),%rdx > ffffffff804b9bd5: 0 eb 26 jmp ffffffff804b9bfd <__inet_lookup_established+0xeb> > ffffffff804b9bd7: 0 48 81 3c 24 d0 15 ab cmpq $0xffffffff80ab15d0,(%rsp) > ffffffff804b9bde: 0 80 > ffffffff804b9bdf: 0 75 19 jne ffffffff804b9bfa <__inet_lookup_established+0xe8> > ffffffff804b9be1: 0 4c 39 78 40 cmp %r15,0x40(%rax) > ffffffff804b9be5: 0 75 13 jne ffffffff804b9bfa <__inet_lookup_established+0xe8> > ffffffff804b9be7: 0 44 39 70 48 cmp %r14d,0x48(%rax) > ffffffff804b9beb: 0 75 0d jne ffffffff804b9bfa <__inet_lookup_established+0xe8> > ffffffff804b9bed: 0 8b 52 fc mov -0x4(%rdx),%edx > ffffffff804b9bf0: 0 85 d2 test %edx,%edx > ffffffff804b9bf2: 0 74 24 je ffffffff804b9c18 <__inet_lookup_established+0x106> > ffffffff804b9bf4: 0 3b 54 24 40 cmp 0x40(%rsp),%edx > ffffffff804b9bf8: 0 74 1e je ffffffff804b9c18 <__inet_lookup_established+0x106> > ffffffff804b9bfa: 0 48 89 ca mov %rcx,%rdx > ffffffff804b9bfd: 0 48 85 d2 test %rdx,%rdx > ffffffff804b9c00: 0 74 12 je ffffffff804b9c14 <__inet_lookup_established+0x102> > ffffffff804b9c02: 0 48 8d 42 f8 lea -0x8(%rdx),%rax > ffffffff804b9c06: 0 48 8b 0a mov (%rdx),%rcx > ffffffff804b9c09: 0 44 39 68 2c cmp %r13d,0x2c(%rax) > ffffffff804b9c0d: 0 0f 18 09 prefetcht0 (%rcx) > ffffffff804b9c10: 0 75 e8 jne ffffffff804b9bfa <__inet_lookup_established+0xe8> > ffffffff804b9c12: 0 eb c3 jmp ffffffff804b9bd7 <__inet_lookup_established+0xc5> > ffffffff804b9c14: 0 31 c0 xor %eax,%eax > ffffffff804b9c16: 0 eb 04 jmp ffffffff804b9c1c <__inet_lookup_established+0x10a> > ffffffff804b9c18: 441 f0 ff 40 28 lock incl 0x28(%rax) > ffffffff804b9c1c: 1442 f0 41 ff 04 24 lock incl (%r12) > ffffffff804b9c21: 476 41 5b pop %r11 > ffffffff804b9c23: 1 5b pop %rbx > ffffffff804b9c24: 0 5d pop %rbp > ffffffff804b9c25: 475 41 5c pop %r12 > ffffffff804b9c27: 0 41 5d pop %r13 > ffffffff804b9c29: 1 41 5e pop %r14 > ffffffff804b9c2b: 494 41 5f pop %r15 > ffffffff804b9c2d: 0 c3 retq > ffffffff804b9c2e: 0 90 nop > ffffffff804b9c2f: 0 90 nop > > 80% of the overhead comes from cachemisses here: > > ffffffff804b9bc6: 78 44 39 68 2c cmp %r13d,0x2c(%rax) > ffffffff804b9bca: 4 0f 18 09 prefetcht0 (%rcx) > ffffffff804b9bcd: 685 75 e8 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> > ffffffff804b9bcf: 139502 eb bd jmp ffffffff804b9b8e <__inet_lookup_established+0x7c> > > corresponding to: > > (gdb) list *0xffffffff804b9bc6 > 0xffffffff804b9bc6 is in __inet_lookup_established (net/ipv4/inet_hashtables.c:237). > 232 rwlock_t *lock = inet_ehash_lockp(hashinfo, hash); > 233 > 234 prefetch(head->chain.first); > 235 read_lock(lock); > 236 sk_for_each(sk, node, &head->chain) { > 237 if (INET_MATCH(sk, net, hash, acookie, > 238 saddr, daddr, ports, dif)) > 239 goto hit; /* You sunk my battleship! */ > 240 } > 241 > > Seeing the first hard cachemiss on hash lookups is a familiar and > partly expected pattern - it is the first thing that touches > cache-cold data structures. > > Seeing 1.4% of the totaly tbench overhead go into this single > cachemiss is a bit surprising to me though: tbench works via > long-lived connections (TCP establish costs and nowhere to be seen in > the profiles) so the socket hash should be relatively stable and > read-mostly on most CPUs in theory. The CPUs here have 2MB of L2 cache > per socket. > > Could we be somehow dirtying these cachelines perhaps, causing > unnecessary cachemisses in hash lookups? Is the hash linkage portion > of the socket data structure frequently dirtied? Padding that to 64 > bytes (or next to 64 bytes worth of read-mostly fields) could perhaps > give us a +1.7% tbench speedup. > I am not seeing this of course on net-next-2.6 thanks to RCU Could it be that several tbench sockets are hashed on same chain ? tbench uses dst address and src address 127.0.0.1 for its sockets. server binds on port 7003 static inline unsigned int inet_ehashfn(struct net *net, const __be32 laddr, const __u16 lport, const __be32 faddr, const __be16 fport) { return jhash_3words((__force __u32) laddr, (__force __u32) faddr, ((__u32) lport) << 16 | (__force __u32)fport, inet_ehash_secret + net_hash_mix(net)); } Hum... should be OK, thanks to jhash. Maybe same problem than eth_type_trans : You have a cache line miss because the socket we handle in the chain was previously handled by another cpu. (sk->refcnt being dirtied by this other cpu) ffffffff804b9bc6: 78 44 39 68 2c cmp %r13d,0x2c(%rax) ffffffff804b9bca: 4 0f 18 09 prefetcht0 (%rcx) ffffffff804b9bcd: 685 75 e8 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5> < "jne" stalls beccause CPU must bring to its cache 0x2c(%rax) to perform compare > ffffffff804b9bcf: 139502 eb bd jmp ffffffff804b9b8e <__inet_lookup_established+0x7c> Even if you padd/move refcnt somewhere else in sk, you'll need to take a reference on it, so it wont help very much. ^ permalink raw reply [flat|nested] 191+ messages in thread
* system_call() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (10 preceding siblings ...) 2008-11-17 21:35 ` __inet_lookup_established(): " Ingo Molnar @ 2008-11-17 21:59 ` Ingo Molnar 2008-11-17 22:09 ` Linus Torvalds 2008-11-17 22:08 ` Ingo Molnar ` (2 subsequent siblings) 14 siblings, 1 reply; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 21:59 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger, H. Peter Anvin, Thomas Gleixner * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 1.508888 system_call that's an easy one: ffffffff8020be00: 97321 <system_call>: ffffffff8020be00: 97321 0f 01 f8 swapgs ffffffff8020be03: 53089 66 66 66 90 xchg %ax,%ax ffffffff8020be07: 1524 66 66 90 xchg %ax,%ax ffffffff8020be0a: 0 66 66 90 xchg %ax,%ax ffffffff8020be0d: 0 66 66 90 xchg %ax,%ax ffffffff8020be10: 1511 <system_call_after_swapgs>: ffffffff8020be10: 1511 65 48 89 24 25 18 00 mov %rsp,%gs:0x18 ffffffff8020be17: 0 00 00 ffffffff8020be19: 0 65 48 8b 24 25 10 00 mov %gs:0x10,%rsp ffffffff8020be20: 0 00 00 ffffffff8020be22: 1490 fb sti syscall entry instruction costs - unavoidable security checks, etc. - hardware costs. But looking at this profile made me notice this detail: ENTRY(system_call_after_swapgs) Combined with this alignment rule we have in arch/x86/include/asm/linkage.h on 64-bit: #ifdef CONFIG_X86_64 #define __ALIGN .p2align 4,,15 #define __ALIGN_STR ".p2align 4,,15" #endif while it inserts NOP sequences, that is still +13 bytes of excessive, stupid, and straight in our syscall entry path alignment padding. system_call_after_swapgs is an utter slowpath in any case. The interim fix is below - although it needs more thinking and probably should be done via an ENTRY_UNALIGNED() method as well, for slowpath targets. With that we get this much nicer entry sequence: ffffffff8020be00: 544323 <system_call>: ffffffff8020be00: 544323 0f 01 f8 swapgs ffffffff8020be03: 197954 <system_call_after_swapgs>: ffffffff8020be03: 197954 65 48 89 24 25 18 00 mov %rsp,%gs:0x18 ffffffff8020be0a: 0 00 00 ffffffff8020be0c: 6578 65 48 8b 24 25 10 00 mov %gs:0x10,%rsp ffffffff8020be13: 0 00 00 ffffffff8020be15: 0 fb sti ffffffff8020be16: 0 48 83 ec 50 sub $0x50,%rsp And we should probably weaken the generic code alignment rules as well on x86. I'll do some measurements of it. Ingo Index: linux/arch/x86/kernel/entry_64.S =================================================================== --- linux.orig/arch/x86/kernel/entry_64.S +++ linux/arch/x86/kernel/entry_64.S @@ -315,7 +315,8 @@ ENTRY(system_call) * after the swapgs, so that it can do the swapgs * for the guest and jump here on syscall. */ -ENTRY(system_call_after_swapgs) +.globl system_call_after_swapgs +system_call_after_swapgs: movq %rsp,%gs:pda_oldrsp movq %gs:pda_kernelstack,%rsp ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: system_call() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 21:59 ` system_call() - " Ingo Molnar @ 2008-11-17 22:09 ` Linus Torvalds 0 siblings, 0 replies; 191+ messages in thread From: Linus Torvalds @ 2008-11-17 22:09 UTC (permalink / raw) To: Ingo Molnar Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger, H. Peter Anvin, Thomas Gleixner On Mon, 17 Nov 2008, Ingo Molnar wrote: > > syscall entry instruction costs - unavoidable security checks, etc. - > hardware costs. Yes. One thing to look out for on x86 is the system call _return_ path. It doesn't show up in kernel profiles (it shows up as user costs), and we had a bug where auditing essentially always caused us to use 'iret' instead of 'sysret' because it took us the long way around. And profiling doesn't show it, but things like lmbench did, iret being about five times slower than sysret. But yes: > -ENTRY(system_call_after_swapgs) > +.globl system_call_after_swapgs > +system_call_after_swapgs: This definitely makes sense. We definitely do not want to align that special case. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (11 preceding siblings ...) 2008-11-17 21:59 ` system_call() - " Ingo Molnar @ 2008-11-17 22:08 ` Ingo Molnar 2008-11-17 22:15 ` Eric Dumazet 2008-11-17 22:14 ` tcp_transmit_skb() - " Ingo Molnar 2008-11-17 22:19 ` Ingo Molnar 14 siblings, 1 reply; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 22:08 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 1.469183 tcp_current_mss hits (total: 146918) ......... ffffffff804c5237: 526 <tcp_current_mss>: ffffffff804c5237: 526 41 54 push %r12 ffffffff804c5239: 5929 55 push %rbp ffffffff804c523a: 32 53 push %rbx ffffffff804c523b: 294 48 89 fb mov %rdi,%rbx ffffffff804c523e: 539 48 83 ec 30 sub $0x30,%rsp ffffffff804c5242: 2590 85 f6 test %esi,%esi ffffffff804c5244: 444 48 8b 4f 78 mov 0x78(%rdi),%rcx ffffffff804c5248: 521 8b af 4c 04 00 00 mov 0x44c(%rdi),%ebp ffffffff804c524e: 791 74 2a je ffffffff804c527a <tcp_current_mss+0x43> ffffffff804c5250: 433 8b 87 00 01 00 00 mov 0x100(%rdi),%eax ffffffff804c5256: 236 c1 e0 10 shl $0x10,%eax ffffffff804c5259: 191 89 c2 mov %eax,%edx ffffffff804c525b: 487 23 97 fc 00 00 00 and 0xfc(%rdi),%edx ffffffff804c5261: 362 39 c2 cmp %eax,%edx ffffffff804c5263: 342 75 15 jne ffffffff804c527a <tcp_current_mss+0x43> ffffffff804c5265: 473 45 31 e4 xor %r12d,%r12d ffffffff804c5268: 221 8b 87 00 04 00 00 mov 0x400(%rdi),%eax ffffffff804c526e: 194 3b 87 80 04 00 00 cmp 0x480(%rdi),%eax ffffffff804c5274: 445 41 0f 94 c4 sete %r12b ffffffff804c5278: 261 eb 03 jmp ffffffff804c527d <tcp_current_mss+0x46> ffffffff804c527a: 0 45 31 e4 xor %r12d,%r12d ffffffff804c527d: 185 48 85 c9 test %rcx,%rcx ffffffff804c5280: 686 74 15 je ffffffff804c5297 <tcp_current_mss+0x60> ffffffff804c5282: 1806 8b 71 7c mov 0x7c(%rcx),%esi ffffffff804c5285: 1 3b b3 5c 03 00 00 cmp 0x35c(%rbx),%esi ffffffff804c528b: 21 74 0a je ffffffff804c5297 <tcp_current_mss+0x60> ffffffff804c528d: 0 48 89 df mov %rbx,%rdi ffffffff804c5290: 0 e8 8b fb ff ff callq ffffffff804c4e20 <tcp_sync_mss> ffffffff804c5295: 0 89 c5 mov %eax,%ebp ffffffff804c5297: 864 48 8d 4c 24 28 lea 0x28(%rsp),%rcx ffffffff804c529c: 634 48 8d 54 24 10 lea 0x10(%rsp),%rdx ffffffff804c52a1: 995 31 f6 xor %esi,%esi ffffffff804c52a3: 0 48 89 df mov %rbx,%rdi ffffffff804c52a6: 2 e8 f2 fe ff ff callq ffffffff804c519d <tcp_established_options> ffffffff804c52ab: 859 8b 8b e8 03 00 00 mov 0x3e8(%rbx),%ecx ffffffff804c52b1: 936 83 c0 14 add $0x14,%eax ffffffff804c52b4: 6 0f b7 d1 movzwl %cx,%edx ffffffff804c52b7: 0 39 d0 cmp %edx,%eax ffffffff804c52b9: 911 74 04 je ffffffff804c52bf <tcp_current_mss+0x88> ffffffff804c52bb: 0 29 d0 sub %edx,%eax ffffffff804c52bd: 0 29 c5 sub %eax,%ebp ffffffff804c52bf: 0 45 85 e4 test %r12d,%r12d ffffffff804c52c2: 6894 89 e8 mov %ebp,%eax ffffffff804c52c4: 0 74 38 je ffffffff804c52fe <tcp_current_mss+0xc7> ffffffff804c52c6: 990 48 8b 83 68 03 00 00 mov 0x368(%rbx),%rax ffffffff804c52cd: 642 8b b3 04 01 00 00 mov 0x104(%rbx),%esi ffffffff804c52d3: 3 48 89 df mov %rbx,%rdi ffffffff804c52d6: 240 66 2b 70 30 sub 0x30(%rax),%si ffffffff804c52da: 588 66 2b b3 7e 03 00 00 sub 0x37e(%rbx),%si ffffffff804c52e1: 2 66 29 ce sub %cx,%si ffffffff804c52e4: 284 ff ce dec %esi ffffffff804c52e6: 664 0f b7 f6 movzwl %si,%esi ffffffff804c52e9: 2 e8 0a fb ff ff callq ffffffff804c4df8 <tcp_bound_to_half_wnd> ffffffff804c52ee: 68 0f b7 d0 movzwl %ax,%edx ffffffff804c52f1: 1870 89 c1 mov %eax,%ecx ffffffff804c52f3: 0 89 d0 mov %edx,%eax ffffffff804c52f5: 0 31 d2 xor %edx,%edx ffffffff804c52f7: 2135 f7 f5 div %ebp ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax ffffffff804c52fb: 1670 66 29 d0 sub %dx,%ax ffffffff804c52fe: 0 66 89 83 ea 03 00 00 mov %ax,0x3ea(%rbx) ffffffff804c5305: 4 48 83 c4 30 add $0x30,%rsp ffffffff804c5309: 855 89 e8 mov %ebp,%eax ffffffff804c530b: 0 5b pop %rbx ffffffff804c530c: 797 5d pop %rbp ffffffff804c530d: 0 41 5c pop %r12 ffffffff804c530f: 0 c3 retq apparently this division causes 1.0% of tbench overhead: ffffffff804c52f5: 0 31 d2 xor %edx,%edx ffffffff804c52f7: 2135 f7 f5 div %ebp ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax (gdb) list *0xffffffff804c52f7 0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078). 1073 inet_csk(sk)->icsk_af_ops->net_header_len - 1074 inet_csk(sk)->icsk_ext_hdr_len - 1075 tp->tcp_header_len); 1076 1077 xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); 1078 xmit_size_goal -= (xmit_size_goal % mss_now); 1079 } 1080 tp->xmit_size_goal = xmit_size_goal; 1081 1082 return mss_now; (gdb) it's this division: if (doing_tso) { [...] xmit_size_goal -= (xmit_size_goal % mss_now); Has no-one hit this before? Perhaps this is why switching loopback networking to TSO had a performance impact for others? It's still a bit weird ... how can a single division cause this much overhead? tcp_bound_to_half_wnd() [which is called straight before this sequence] seems low-overhead. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 22:08 ` Ingo Molnar @ 2008-11-17 22:15 ` Eric Dumazet 2008-11-17 22:26 ` Ingo Molnar 2008-11-18 5:23 ` David Miller 0 siblings, 2 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 22:15 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger Ingo Molnar a écrit : > * Ingo Molnar <mingo@elte.hu> wrote: > >> 100.000000 total >> ................ >> 1.469183 tcp_current_mss > > hits (total: 146918) > ......... > ffffffff804c5237: 526 <tcp_current_mss>: > ffffffff804c5237: 526 41 54 push %r12 > ffffffff804c5239: 5929 55 push %rbp > ffffffff804c523a: 32 53 push %rbx > ffffffff804c523b: 294 48 89 fb mov %rdi,%rbx > ffffffff804c523e: 539 48 83 ec 30 sub $0x30,%rsp > ffffffff804c5242: 2590 85 f6 test %esi,%esi > ffffffff804c5244: 444 48 8b 4f 78 mov 0x78(%rdi),%rcx > ffffffff804c5248: 521 8b af 4c 04 00 00 mov 0x44c(%rdi),%ebp > ffffffff804c524e: 791 74 2a je ffffffff804c527a <tcp_current_mss+0x43> > ffffffff804c5250: 433 8b 87 00 01 00 00 mov 0x100(%rdi),%eax > ffffffff804c5256: 236 c1 e0 10 shl $0x10,%eax > ffffffff804c5259: 191 89 c2 mov %eax,%edx > ffffffff804c525b: 487 23 97 fc 00 00 00 and 0xfc(%rdi),%edx > ffffffff804c5261: 362 39 c2 cmp %eax,%edx > ffffffff804c5263: 342 75 15 jne ffffffff804c527a <tcp_current_mss+0x43> > ffffffff804c5265: 473 45 31 e4 xor %r12d,%r12d > ffffffff804c5268: 221 8b 87 00 04 00 00 mov 0x400(%rdi),%eax > ffffffff804c526e: 194 3b 87 80 04 00 00 cmp 0x480(%rdi),%eax > ffffffff804c5274: 445 41 0f 94 c4 sete %r12b > ffffffff804c5278: 261 eb 03 jmp ffffffff804c527d <tcp_current_mss+0x46> > ffffffff804c527a: 0 45 31 e4 xor %r12d,%r12d > ffffffff804c527d: 185 48 85 c9 test %rcx,%rcx > ffffffff804c5280: 686 74 15 je ffffffff804c5297 <tcp_current_mss+0x60> > ffffffff804c5282: 1806 8b 71 7c mov 0x7c(%rcx),%esi > ffffffff804c5285: 1 3b b3 5c 03 00 00 cmp 0x35c(%rbx),%esi > ffffffff804c528b: 21 74 0a je ffffffff804c5297 <tcp_current_mss+0x60> > ffffffff804c528d: 0 48 89 df mov %rbx,%rdi > ffffffff804c5290: 0 e8 8b fb ff ff callq ffffffff804c4e20 <tcp_sync_mss> > ffffffff804c5295: 0 89 c5 mov %eax,%ebp > ffffffff804c5297: 864 48 8d 4c 24 28 lea 0x28(%rsp),%rcx > ffffffff804c529c: 634 48 8d 54 24 10 lea 0x10(%rsp),%rdx > ffffffff804c52a1: 995 31 f6 xor %esi,%esi > ffffffff804c52a3: 0 48 89 df mov %rbx,%rdi > ffffffff804c52a6: 2 e8 f2 fe ff ff callq ffffffff804c519d <tcp_established_options> > ffffffff804c52ab: 859 8b 8b e8 03 00 00 mov 0x3e8(%rbx),%ecx > ffffffff804c52b1: 936 83 c0 14 add $0x14,%eax > ffffffff804c52b4: 6 0f b7 d1 movzwl %cx,%edx > ffffffff804c52b7: 0 39 d0 cmp %edx,%eax > ffffffff804c52b9: 911 74 04 je ffffffff804c52bf <tcp_current_mss+0x88> > ffffffff804c52bb: 0 29 d0 sub %edx,%eax > ffffffff804c52bd: 0 29 c5 sub %eax,%ebp > ffffffff804c52bf: 0 45 85 e4 test %r12d,%r12d > ffffffff804c52c2: 6894 89 e8 mov %ebp,%eax > ffffffff804c52c4: 0 74 38 je ffffffff804c52fe <tcp_current_mss+0xc7> > ffffffff804c52c6: 990 48 8b 83 68 03 00 00 mov 0x368(%rbx),%rax > ffffffff804c52cd: 642 8b b3 04 01 00 00 mov 0x104(%rbx),%esi > ffffffff804c52d3: 3 48 89 df mov %rbx,%rdi > ffffffff804c52d6: 240 66 2b 70 30 sub 0x30(%rax),%si > ffffffff804c52da: 588 66 2b b3 7e 03 00 00 sub 0x37e(%rbx),%si > ffffffff804c52e1: 2 66 29 ce sub %cx,%si > ffffffff804c52e4: 284 ff ce dec %esi > ffffffff804c52e6: 664 0f b7 f6 movzwl %si,%esi > ffffffff804c52e9: 2 e8 0a fb ff ff callq ffffffff804c4df8 <tcp_bound_to_half_wnd> > ffffffff804c52ee: 68 0f b7 d0 movzwl %ax,%edx > ffffffff804c52f1: 1870 89 c1 mov %eax,%ecx > ffffffff804c52f3: 0 89 d0 mov %edx,%eax > ffffffff804c52f5: 0 31 d2 xor %edx,%edx > ffffffff804c52f7: 2135 f7 f5 div %ebp > ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax > ffffffff804c52fb: 1670 66 29 d0 sub %dx,%ax > ffffffff804c52fe: 0 66 89 83 ea 03 00 00 mov %ax,0x3ea(%rbx) > ffffffff804c5305: 4 48 83 c4 30 add $0x30,%rsp > ffffffff804c5309: 855 89 e8 mov %ebp,%eax > ffffffff804c530b: 0 5b pop %rbx > ffffffff804c530c: 797 5d pop %rbp > ffffffff804c530d: 0 41 5c pop %r12 > ffffffff804c530f: 0 c3 retq > > apparently this division causes 1.0% of tbench overhead: > > ffffffff804c52f5: 0 31 d2 xor %edx,%edx > ffffffff804c52f7: 2135 f7 f5 div %ebp > ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax > > (gdb) list *0xffffffff804c52f7 > 0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078). > 1073 inet_csk(sk)->icsk_af_ops->net_header_len - > 1074 inet_csk(sk)->icsk_ext_hdr_len - > 1075 tp->tcp_header_len); > 1076 > 1077 xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); > 1078 xmit_size_goal -= (xmit_size_goal % mss_now); > 1079 } > 1080 tp->xmit_size_goal = xmit_size_goal; > 1081 > 1082 return mss_now; > (gdb) > > it's this division: > > if (doing_tso) { > [...] > xmit_size_goal -= (xmit_size_goal % mss_now); > > Has no-one hit this before? Perhaps this is why switching loopback > networking to TSO had a performance impact for others? Yes, I mentioned it later. But apparently you dont read my mails, so I will just stop now. > > It's still a bit weird ... how can a single division cause this much > overhead? tcp_bound_to_half_wnd() [which is called straight before > this sequence] seems low-overhead. > > Ingo > > ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 22:15 ` Eric Dumazet @ 2008-11-17 22:26 ` Ingo Molnar 2008-11-17 22:39 ` Eric Dumazet 2008-11-18 5:23 ` David Miller 1 sibling, 1 reply; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 22:26 UTC (permalink / raw) To: Eric Dumazet Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Eric Dumazet <dada1@cosmosbay.com> wrote: > Ingo Molnar a écrit : >> * Ingo Molnar <mingo@elte.hu> wrote: >> >>> 100.000000 total >>> ................ >>> 1.469183 tcp_current_mss >> >> hits (total: 146918) >> ......... >> ffffffff804c5237: 526 <tcp_current_mss>: >> ffffffff804c5237: 526 41 54 push %r12 >> ffffffff804c5239: 5929 55 push %rbp >> ffffffff804c523a: 32 53 push %rbx >> ffffffff804c523b: 294 48 89 fb mov %rdi,%rbx >> ffffffff804c523e: 539 48 83 ec 30 sub $0x30,%rsp >> ffffffff804c5242: 2590 85 f6 test %esi,%esi >> ffffffff804c5244: 444 48 8b 4f 78 mov 0x78(%rdi),%rcx >> ffffffff804c5248: 521 8b af 4c 04 00 00 mov 0x44c(%rdi),%ebp >> ffffffff804c524e: 791 74 2a je ffffffff804c527a <tcp_current_mss+0x43> >> ffffffff804c5250: 433 8b 87 00 01 00 00 mov 0x100(%rdi),%eax >> ffffffff804c5256: 236 c1 e0 10 shl $0x10,%eax >> ffffffff804c5259: 191 89 c2 mov %eax,%edx >> ffffffff804c525b: 487 23 97 fc 00 00 00 and 0xfc(%rdi),%edx >> ffffffff804c5261: 362 39 c2 cmp %eax,%edx >> ffffffff804c5263: 342 75 15 jne ffffffff804c527a <tcp_current_mss+0x43> >> ffffffff804c5265: 473 45 31 e4 xor %r12d,%r12d >> ffffffff804c5268: 221 8b 87 00 04 00 00 mov 0x400(%rdi),%eax >> ffffffff804c526e: 194 3b 87 80 04 00 00 cmp 0x480(%rdi),%eax >> ffffffff804c5274: 445 41 0f 94 c4 sete %r12b >> ffffffff804c5278: 261 eb 03 jmp ffffffff804c527d <tcp_current_mss+0x46> >> ffffffff804c527a: 0 45 31 e4 xor %r12d,%r12d >> ffffffff804c527d: 185 48 85 c9 test %rcx,%rcx >> ffffffff804c5280: 686 74 15 je ffffffff804c5297 <tcp_current_mss+0x60> >> ffffffff804c5282: 1806 8b 71 7c mov 0x7c(%rcx),%esi >> ffffffff804c5285: 1 3b b3 5c 03 00 00 cmp 0x35c(%rbx),%esi >> ffffffff804c528b: 21 74 0a je ffffffff804c5297 <tcp_current_mss+0x60> >> ffffffff804c528d: 0 48 89 df mov %rbx,%rdi >> ffffffff804c5290: 0 e8 8b fb ff ff callq ffffffff804c4e20 <tcp_sync_mss> >> ffffffff804c5295: 0 89 c5 mov %eax,%ebp >> ffffffff804c5297: 864 48 8d 4c 24 28 lea 0x28(%rsp),%rcx >> ffffffff804c529c: 634 48 8d 54 24 10 lea 0x10(%rsp),%rdx >> ffffffff804c52a1: 995 31 f6 xor %esi,%esi >> ffffffff804c52a3: 0 48 89 df mov %rbx,%rdi >> ffffffff804c52a6: 2 e8 f2 fe ff ff callq ffffffff804c519d <tcp_established_options> >> ffffffff804c52ab: 859 8b 8b e8 03 00 00 mov 0x3e8(%rbx),%ecx >> ffffffff804c52b1: 936 83 c0 14 add $0x14,%eax >> ffffffff804c52b4: 6 0f b7 d1 movzwl %cx,%edx >> ffffffff804c52b7: 0 39 d0 cmp %edx,%eax >> ffffffff804c52b9: 911 74 04 je ffffffff804c52bf <tcp_current_mss+0x88> >> ffffffff804c52bb: 0 29 d0 sub %edx,%eax >> ffffffff804c52bd: 0 29 c5 sub %eax,%ebp >> ffffffff804c52bf: 0 45 85 e4 test %r12d,%r12d >> ffffffff804c52c2: 6894 89 e8 mov %ebp,%eax >> ffffffff804c52c4: 0 74 38 je ffffffff804c52fe <tcp_current_mss+0xc7> >> ffffffff804c52c6: 990 48 8b 83 68 03 00 00 mov 0x368(%rbx),%rax >> ffffffff804c52cd: 642 8b b3 04 01 00 00 mov 0x104(%rbx),%esi >> ffffffff804c52d3: 3 48 89 df mov %rbx,%rdi >> ffffffff804c52d6: 240 66 2b 70 30 sub 0x30(%rax),%si >> ffffffff804c52da: 588 66 2b b3 7e 03 00 00 sub 0x37e(%rbx),%si >> ffffffff804c52e1: 2 66 29 ce sub %cx,%si >> ffffffff804c52e4: 284 ff ce dec %esi >> ffffffff804c52e6: 664 0f b7 f6 movzwl %si,%esi >> ffffffff804c52e9: 2 e8 0a fb ff ff callq ffffffff804c4df8 <tcp_bound_to_half_wnd> >> ffffffff804c52ee: 68 0f b7 d0 movzwl %ax,%edx >> ffffffff804c52f1: 1870 89 c1 mov %eax,%ecx >> ffffffff804c52f3: 0 89 d0 mov %edx,%eax >> ffffffff804c52f5: 0 31 d2 xor %edx,%edx >> ffffffff804c52f7: 2135 f7 f5 div %ebp >> ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax >> ffffffff804c52fb: 1670 66 29 d0 sub %dx,%ax >> ffffffff804c52fe: 0 66 89 83 ea 03 00 00 mov %ax,0x3ea(%rbx) >> ffffffff804c5305: 4 48 83 c4 30 add $0x30,%rsp >> ffffffff804c5309: 855 89 e8 mov %ebp,%eax >> ffffffff804c530b: 0 5b pop %rbx >> ffffffff804c530c: 797 5d pop %rbp >> ffffffff804c530d: 0 41 5c pop %r12 >> ffffffff804c530f: 0 c3 retq >> >> apparently this division causes 1.0% of tbench overhead: >> >> ffffffff804c52f5: 0 31 d2 xor %edx,%edx >> ffffffff804c52f7: 2135 f7 f5 div %ebp >> ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax >> >> (gdb) list *0xffffffff804c52f7 >> 0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078). >> 1073 inet_csk(sk)->icsk_af_ops->net_header_len - >> 1074 inet_csk(sk)->icsk_ext_hdr_len - >> 1075 tp->tcp_header_len); >> 1076 >> 1077 xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); >> 1078 xmit_size_goal -= (xmit_size_goal % mss_now); >> 1079 } >> 1080 tp->xmit_size_goal = xmit_size_goal; >> 1081 >> 1082 return mss_now; >> (gdb) >> >> it's this division: >> >> if (doing_tso) { >> [...] >> xmit_size_goal -= (xmit_size_goal % mss_now); >> >> Has no-one hit this before? Perhaps this is why switching loopback >> networking to TSO had a performance impact for others? > > Yes, I mentioned it later. [...] i see - i just caught up with some of my inbox from today. > [...] But apparently you dont read my mails, so I will just stop > now. Sorry, i spent my time looking at the profile output. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 22:26 ` Ingo Molnar @ 2008-11-17 22:39 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-17 22:39 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger Ingo Molnar a écrit : > * Eric Dumazet <dada1@cosmosbay.com> wrote: > >> Ingo Molnar a écrit : >>> it's this division: >>> >>> if (doing_tso) { >>> [...] >>> xmit_size_goal -= (xmit_size_goal % mss_now); >>> >>> Has no-one hit this before? Perhaps this is why switching loopback >>> networking to TSO had a performance impact for others? >> Yes, I mentioned it later. [...] > > i see - i just caught up with some of my inbox from today. > >> [...] But apparently you dont read my mails, so I will just stop >> now. > > Sorry, i spent my time looking at the profile output. > No problem Ingo, I am very glad you take so much time to profil kernel ;) I had too many problems with profilers on my dev machine lately :( ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 22:15 ` Eric Dumazet 2008-11-17 22:26 ` Ingo Molnar @ 2008-11-18 5:23 ` David Miller 2008-11-18 8:45 ` Ingo Molnar 1 sibling, 1 reply; 191+ messages in thread From: David Miller @ 2008-11-18 5:23 UTC (permalink / raw) To: dada1 Cc: mingo, torvalds, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger From: Eric Dumazet <dada1@cosmosbay.com> Date: Mon, 17 Nov 2008 23:15:50 +0100 > Yes, I mentioned it later. But apparently you dont read my mails, so > I will just stop now. Yeah I was going to mention this too :-/ ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-18 5:23 ` David Miller @ 2008-11-18 8:45 ` Ingo Molnar 0 siblings, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-18 8:45 UTC (permalink / raw) To: David Miller Cc: dada1, torvalds, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, shemminger * David Miller <davem@davemloft.net> wrote: > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Mon, 17 Nov 2008 23:15:50 +0100 > > > Yes, I mentioned it later. But apparently you dont read my mails, > > so I will just stop now. > > Yeah I was going to mention this too :-/ I spent hours profiling the networking code, and no, i didnt read all the incoming emails in parallel - i read them after that. I have established it beyond reasonable doubt that the scheduler is doing the right thing with the config i've posted. Your "wakeup is two orders of magnitude more expensive" claim, which got me to measure and profile this stuff, is not reproducible here and this regression should not be listed as a scheduler regression. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* tcp_transmit_skb() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (12 preceding siblings ...) 2008-11-17 22:08 ` Ingo Molnar @ 2008-11-17 22:14 ` Ingo Molnar 2008-11-17 22:19 ` Ingo Molnar 14 siblings, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 22:14 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 1.431553 tcp_transmit_skb hits (total: 143155) ......... ffffffff804c550e: 485 <tcp_transmit_skb>: ffffffff804c550e: 485 41 57 push %r15 ffffffff804c5510: 5692 41 56 push %r14 ffffffff804c5512: 390 49 89 f6 mov %rsi,%r14 ffffffff804c5515: 0 41 55 push %r13 ffffffff804c5517: 69 41 54 push %r12 ffffffff804c5519: 388 41 89 d4 mov %edx,%r12d ffffffff804c551c: 0 55 push %rbp ffffffff804c551d: 66 48 89 fd mov %rdi,%rbp ffffffff804c5520: 405 53 push %rbx ffffffff804c5521: 0 89 cb mov %ecx,%ebx ffffffff804c5523: 75 48 83 ec 38 sub $0x38,%rsp ffffffff804c5527: 396 48 85 f6 test %rsi,%rsi ffffffff804c552a: 51 74 15 je ffffffff804c5541 <tcp_transmit_skb+0x33> ffffffff804c552c: 396 8b 96 c8 00 00 00 mov 0xc8(%rsi),%edx ffffffff804c5532: 1 48 8b 86 d0 00 00 00 mov 0xd0(%rsi),%rax ffffffff804c5539: 63 66 83 7c 02 08 00 cmpw $0x0,0x8(%rdx,%rax,1) ffffffff804c553f: 417 75 04 jne ffffffff804c5545 <tcp_transmit_skb+0x37> ffffffff804c5541: 0 0f 0b ud2a ffffffff804c5543: 0 eb fe jmp ffffffff804c5543 <tcp_transmit_skb+0x35> ffffffff804c5545: 3719 48 8b 87 60 03 00 00 mov 0x360(%rdi),%rax ffffffff804c554c: 2873 f6 40 10 02 testb $0x2,0x10(%rax) ffffffff804c5550: 1 74 09 je ffffffff804c555b <tcp_transmit_skb+0x4d> ffffffff804c5552: 0 e8 1d 48 d8 ff callq ffffffff80249d74 <ktime_get_real> ffffffff804c5557: 0 49 89 46 18 mov %rax,0x18(%r14) ffffffff804c555b: 487 45 85 e4 test %r12d,%r12d ffffffff804c555e: 456 74 33 je ffffffff804c5593 <tcp_transmit_skb+0x85> ffffffff804c5560: 0 4c 89 f7 mov %r14,%rdi ffffffff804c5563: 482 e8 28 f4 ff ff callq ffffffff804c4990 <skb_cloned> ffffffff804c5568: 1469 85 c0 test %eax,%eax ffffffff804c556a: 1085 74 0c je ffffffff804c5578 <tcp_transmit_skb+0x6a> ffffffff804c556c: 0 89 de mov %ebx,%esi ffffffff804c556e: 0 4c 89 f7 mov %r14,%rdi ffffffff804c5571: 0 e8 47 41 fc ff callq ffffffff804896bd <pskb_copy> ffffffff804c5576: 0 eb 0a jmp ffffffff804c5582 <tcp_transmit_skb+0x74> ffffffff804c5578: 0 89 de mov %ebx,%esi ffffffff804c557a: 906 4c 89 f7 mov %r14,%rdi ffffffff804c557d: 0 e8 ab 35 fc ff callq ffffffff80488b2d <skb_clone> ffffffff804c5582: 0 48 85 c0 test %rax,%rax ffffffff804c5585: 7 49 89 c6 mov %rax,%r14 ffffffff804c5588: 576 bb 97 ff ff ff mov $0xffffff97,%ebx ffffffff804c558d: 0 0f 84 59 05 00 00 je ffffffff804c5aec <tcp_transmit_skb+0x5de> ffffffff804c5593: 0 49 8d 46 38 lea 0x38(%r14),%rax ffffffff804c5597: 699 48 8d 54 24 10 lea 0x10(%rsp),%rdx ffffffff804c559c: 1 fc cld ffffffff804c559d: 452 48 89 04 24 mov %rax,(%rsp) ffffffff804c55a1: 40 48 89 d7 mov %rdx,%rdi ffffffff804c55a4: 1 31 c0 xor %eax,%eax ffffffff804c55a6: 432 ab stos %eax,%es:(%rdi) ffffffff804c55a7: 956 ab stos %eax,%es:(%rdi) ffffffff804c55a8: 959 ab stos %eax,%es:(%rdi) ffffffff804c55a9: 910 ab stos %eax,%es:(%rdi) ffffffff804c55aa: 943 48 8b 0c 24 mov (%rsp),%rcx ffffffff804c55ae: 455 f6 41 24 02 testb $0x2,0x24(%rcx) ffffffff804c55b2: 0 0f 84 b7 00 00 00 je ffffffff804c566f <tcp_transmit_skb+0x161> ffffffff804c55b8: 0 48 8b 85 b8 05 00 00 mov 0x5b8(%rbp),%rax ffffffff804c55bf: 0 48 89 ee mov %rbp,%rsi ffffffff804c55c2: 0 48 89 ef mov %rbp,%rdi ffffffff804c55c5: 0 ff 10 callq *(%rax) ffffffff804c55c7: 0 31 f6 xor %esi,%esi ffffffff804c55c9: 0 48 85 c0 test %rax,%rax ffffffff804c55cc: 0 48 89 44 24 28 mov %rax,0x28(%rsp) ffffffff804c55d1: 0 74 08 je ffffffff804c55db <tcp_transmit_skb+0xcd> ffffffff804c55d3: 0 80 4c 24 10 04 orb $0x4,0x10(%rsp) ffffffff804c55d8: 0 40 b6 14 mov $0x14,%sil ffffffff804c55db: 0 48 8b 55 78 mov 0x78(%rbp),%rdx ffffffff804c55df: 0 0f b7 85 5c 04 00 00 movzwl 0x45c(%rbp),%eax ffffffff804c55e6: 0 48 85 d2 test %rdx,%rdx ffffffff804c55e9: 0 74 13 je ffffffff804c55fe <tcp_transmit_skb+0xf0> ffffffff804c55eb: 0 8b 92 94 00 00 00 mov 0x94(%rdx),%edx ffffffff804c55f1: 0 39 c2 cmp %eax,%edx ffffffff804c55f3: 0 73 09 jae ffffffff804c55fe <tcp_transmit_skb+0xf0> ffffffff804c55f5: 0 89 d0 mov %edx,%eax ffffffff804c55f7: 0 66 89 95 5c 04 00 00 mov %dx,0x45c(%rbp) ffffffff804c55fe: 0 83 3d 23 2e 3f 00 00 cmpl $0x0,0x3f2e23(%rip) # ffffffff808b8428 <sysctl_tcp_timestamps> ffffffff804c5605: 0 66 89 44 24 14 mov %ax,0x14(%rsp) ffffffff804c560a: 0 8d 4e 04 lea 0x4(%rsi),%ecx ffffffff804c560d: 0 74 25 je ffffffff804c5634 <tcp_transmit_skb+0x126> ffffffff804c560f: 0 48 83 7c 24 28 00 cmpq $0x0,0x28(%rsp) ffffffff804c5615: 0 75 1d jne ffffffff804c5634 <tcp_transmit_skb+0x126> ffffffff804c5617: 0 48 8b 14 24 mov (%rsp),%rdx ffffffff804c561b: 0 80 4c 24 10 02 orb $0x2,0x10(%rsp) ffffffff804c5620: 0 8d 4e 10 lea 0x10(%rsi),%ecx ffffffff804c5623: 0 8b 42 20 mov 0x20(%rdx),%eax ffffffff804c5626: 0 89 44 24 18 mov %eax,0x18(%rsp) ffffffff804c562a: 0 8b 85 90 04 00 00 mov 0x490(%rbp),%eax ffffffff804c5630: 0 89 44 24 1c mov %eax,0x1c(%rsp) ffffffff804c5634: 0 83 3d f1 2d 3f 00 00 cmpl $0x0,0x3f2df1(%rip) # ffffffff808b842c <sysctl_tcp_window_scaling> ffffffff804c563b: 0 74 15 je ffffffff804c5652 <tcp_transmit_skb+0x144> ffffffff804c563d: 0 8a 85 9d 04 00 00 mov 0x49d(%rbp),%al ffffffff804c5643: 0 8d 51 04 lea 0x4(%rcx),%edx ffffffff804c5646: 0 c0 e8 04 shr $0x4,%al ffffffff804c5649: 0 84 c0 test %al,%al ffffffff804c564b: 0 88 44 24 11 mov %al,0x11(%rsp) ffffffff804c564f: 0 0f 45 ca cmovne %edx,%ecx ffffffff804c5652: 0 83 3d d7 2d 3f 00 00 cmpl $0x0,0x3f2dd7(%rip) # ffffffff808b8430 <sysctl_tcp_sack> ffffffff804c5659: 0 74 26 je ffffffff804c5681 <tcp_transmit_skb+0x173> ffffffff804c565b: 0 8a 44 24 10 mov 0x10(%rsp),%al ffffffff804c565f: 0 83 c8 01 or $0x1,%eax ffffffff804c5662: 0 a8 02 test $0x2,%al ffffffff804c5664: 0 88 44 24 10 mov %al,0x10(%rsp) ffffffff804c5668: 0 75 17 jne ffffffff804c5681 <tcp_transmit_skb+0x173> ffffffff804c566a: 0 83 c1 04 add $0x4,%ecx ffffffff804c566d: 0 eb 12 jmp ffffffff804c5681 <tcp_transmit_skb+0x173> ffffffff804c566f: 502 48 8d 4c 24 28 lea 0x28(%rsp),%rcx ffffffff804c5674: 638 4c 89 f6 mov %r14,%rsi ffffffff804c5677: 0 48 89 ef mov %rbp,%rdi ffffffff804c567a: 0 e8 1e fb ff ff callq ffffffff804c519d <tcp_established_options> ffffffff804c567f: 468 89 c1 mov %eax,%ecx ffffffff804c5681: 1605 8b 85 74 04 00 00 mov 0x474(%rbp),%eax ffffffff804c5687: 307 03 85 78 04 00 00 add 0x478(%rbp),%eax ffffffff804c568d: 0 44 8d 69 14 lea 0x14(%rcx),%r13d ffffffff804c5691: 409 2b 85 d0 04 00 00 sub 0x4d0(%rbp),%eax ffffffff804c5697: 89 3b 85 cc 04 00 00 cmp 0x4cc(%rbp),%eax ffffffff804c569d: 0 75 0a jne ffffffff804c56a9 <tcp_transmit_skb+0x19b> ffffffff804c569f: 415 31 f6 xor %esi,%esi ffffffff804c56a1: 210 48 89 ef mov %rbp,%rdi ffffffff804c56a4: 0 e8 b0 f3 ff ff callq ffffffff804c4a59 <tcp_ca_event> ffffffff804c56a9: 1050 44 89 ee mov %r13d,%esi ffffffff804c56ac: 1063 4c 89 f7 mov %r14,%rdi ffffffff804c56af: 0 e8 00 34 fc ff callq ffffffff80488ab4 <skb_push> ffffffff804c56b4: 0 4c 89 f7 mov %r14,%rdi ffffffff804c56b7: 789 e8 4f f3 ff ff callq ffffffff804c4a0b <skb_reset_transport_header> ffffffff804c56bc: 509 f0 ff 45 28 lock incl 0x28(%rbp) ffffffff804c56c0: 494 49 89 6e 10 mov %rbp,0x10(%r14) ffffffff804c56c4: 3510 49 c7 86 80 00 00 00 movq $0xffffffff80486679,0x80(%r14) ffffffff804c56cb: 0 79 66 48 80 ffffffff804c56cf: 102 41 8b 86 e0 00 00 00 mov 0xe0(%r14),%eax ffffffff804c56d6: 155 f0 01 85 98 00 00 00 lock add %eax,0x98(%rbp) ffffffff804c56dd: 437 41 8b 9e b8 00 00 00 mov 0xb8(%r14),%ebx ffffffff804c56e4: 219 8b 85 50 02 00 00 mov 0x250(%rbp),%eax ffffffff804c56ea: 71 49 03 9e d0 00 00 00 add 0xd0(%r14),%rbx ffffffff804c56f1: 735 66 89 03 mov %ax,(%rbx) ffffffff804c56f4: 0 8b 85 38 02 00 00 mov 0x238(%rbp),%eax ffffffff804c56fa: 75 66 89 43 02 mov %ax,0x2(%rbx) ffffffff804c56fe: 720 48 8b 0c 24 mov (%rsp),%rcx ffffffff804c5702: 5992 8b 41 18 mov 0x18(%rcx),%eax ffffffff804c5705: 1460 0f c8 bswap %eax ffffffff804c5707: 60 89 43 04 mov %eax,0x4(%rbx) ffffffff804c570a: 69 8b 85 f0 03 00 00 mov 0x3f0(%rbp),%eax ffffffff804c5710: 374 0f c8 bswap %eax ffffffff804c5712: 43 89 43 08 mov %eax,0x8(%rbx) ffffffff804c5715: 76 0f b6 51 24 movzbl 0x24(%rcx),%edx ffffffff804c5719: 337 44 89 e8 mov %r13d,%eax ffffffff804c571c: 36 c1 e8 02 shr $0x2,%eax ffffffff804c571f: 76 c1 e0 0c shl $0xc,%eax ffffffff804c5722: 476 09 d0 or %edx,%eax ffffffff804c5724: 48 66 c1 c0 08 rol $0x8,%ax ffffffff804c5728: 51 66 89 43 0c mov %ax,0xc(%rbx) ffffffff804c572c: 370 0f b6 41 24 movzbl 0x24(%rcx),%eax ffffffff804c5730: 137 89 c2 mov %eax,%edx ffffffff804c5732: 118 83 e2 02 and $0x2,%edx ffffffff804c5735: 377 74 1b je ffffffff804c5752 <tcp_transmit_skb+0x244> ffffffff804c5737: 0 81 bd c0 04 00 00 ff cmpl $0xffff,0x4c0(%rbp) ffffffff804c573e: 0 ff 00 00 ffffffff804c5741: 0 b8 ff ff 00 00 mov $0xffff,%eax ffffffff804c5746: 0 0f 46 85 c0 04 00 00 cmovbe 0x4c0(%rbp),%eax ffffffff804c574d: 0 e9 a0 00 00 00 jmpq ffffffff804c57f2 <tcp_transmit_skb+0x2e4> ffffffff804c5752: 34 8b 85 f8 03 00 00 mov 0x3f8(%rbp),%eax ffffffff804c5758: 5610 03 85 c0 04 00 00 add 0x4c0(%rbp),%eax ffffffff804c575e: 44 41 89 d4 mov %edx,%r12d ffffffff804c5761: 539 2b 85 f0 03 00 00 sub 0x3f0(%rbp),%eax ffffffff804c5767: 1 48 89 ef mov %rbp,%rdi ffffffff804c576a: 51 44 0f 49 e0 cmovns %eax,%r12d ffffffff804c576e: 495 e8 7e f8 ff ff callq ffffffff804c4ff1 <__tcp_select_window> ffffffff804c5773: 484 44 39 e0 cmp %r12d,%eax ffffffff804c5776: 244 89 c2 mov %eax,%edx ffffffff804c5778: 0 73 19 jae ffffffff804c5793 <tcp_transmit_skb+0x285> ffffffff804c577a: 0 8a 8d 9d 04 00 00 mov 0x49d(%rbp),%cl ffffffff804c5780: 0 b8 01 00 00 00 mov $0x1,%eax ffffffff804c5785: 0 c0 e9 04 shr $0x4,%cl ffffffff804c5788: 0 d3 e0 shl %cl,%eax ffffffff804c578a: 0 42 8d 54 20 ff lea -0x1(%rax,%r12,1),%edx ffffffff804c578f: 0 f7 d8 neg %eax ffffffff804c5791: 0 21 c2 and %eax,%edx ffffffff804c5793: 217 f6 85 9d 04 00 00 f0 testb $0xf0,0x49d(%rbp) ffffffff804c579a: 2014 8b 85 f0 03 00 00 mov 0x3f0(%rbp),%eax ffffffff804c57a0: 0 89 95 c0 04 00 00 mov %edx,0x4c0(%rbp) ffffffff804c57a6: 490 89 85 f8 03 00 00 mov %eax,0x3f8(%rbp) ffffffff804c57ac: 1 75 16 jne ffffffff804c57c4 <tcp_transmit_skb+0x2b6> ffffffff804c57ae: 0 83 3d bb 2c 3f 00 00 cmpl $0x0,0x3f2cbb(%rip) # ffffffff808b8470 <sysctl_tcp_workaround_signed_windows> ffffffff804c57b5: 0 74 0d je ffffffff804c57c4 <tcp_transmit_skb+0x2b6> ffffffff804c57b7: 0 b8 ff 7f 00 00 mov $0x7fff,%eax ffffffff804c57bc: 0 81 fa ff 7f 00 00 cmp $0x7fff,%edx ffffffff804c57c2: 0 eb 12 jmp ffffffff804c57d6 <tcp_transmit_skb+0x2c8> ffffffff804c57c4: 0 8a 8d 9d 04 00 00 mov 0x49d(%rbp),%cl ffffffff804c57ca: 7025 b8 ff ff 00 00 mov $0xffff,%eax ffffffff804c57cf: 0 c0 e9 04 shr $0x4,%cl ffffffff804c57d2: 418 d3 e0 shl %cl,%eax ffffffff804c57d4: 102 39 c2 cmp %eax,%edx ffffffff804c57d6: 0 8a 8d 9d 04 00 00 mov 0x49d(%rbp),%cl ffffffff804c57dc: 424 0f 46 c2 cmovbe %edx,%eax ffffffff804c57df: 105 c0 e9 04 shr $0x4,%cl ffffffff804c57e2: 9 d3 e8 shr %cl,%eax ffffffff804c57e4: 389 85 c0 test %eax,%eax ffffffff804c57e6: 76 75 0a jne ffffffff804c57f2 <tcp_transmit_skb+0x2e4> ffffffff804c57e8: 0 c7 85 ec 03 00 00 00 movl $0x0,0x3ec(%rbp) ffffffff804c57ef: 0 00 00 00 ffffffff804c57f2: 2 66 c1 c0 08 rol $0x8,%ax ffffffff804c57f6: 1657 66 c7 43 10 00 00 movw $0x0,0x10(%rbx) ffffffff804c57fc: 35 66 c7 43 12 00 00 movw $0x0,0x12(%rbx) ffffffff804c5802: 4377 66 89 43 0e mov %ax,0xe(%rbx) ffffffff804c5806: 954 8b 95 80 04 00 00 mov 0x480(%rbp),%edx ffffffff804c580c: 31 39 95 00 04 00 00 cmp %edx,0x400(%rbp) ffffffff804c5812: 186 74 27 je ffffffff804c583b <tcp_transmit_skb+0x32d> ffffffff804c5814: 0 48 8b 34 24 mov (%rsp),%rsi ffffffff804c5818: 0 8b 4e 18 mov 0x18(%rsi),%ecx ffffffff804c581b: 0 89 d6 mov %edx,%esi ffffffff804c581d: 0 8d 41 01 lea 0x1(%rcx),%eax ffffffff804c5820: 0 29 c6 sub %eax,%esi ffffffff804c5822: 0 81 fe fe ff 00 00 cmp $0xfffe,%esi ffffffff804c5828: 0 77 11 ja ffffffff804c583b <tcp_transmit_skb+0x32d> ffffffff804c582a: 0 89 d0 mov %edx,%eax ffffffff804c582c: 0 80 4b 0d 20 orb $0x20,0xd(%rbx) ffffffff804c5830: 0 66 29 c8 sub %cx,%ax ffffffff804c5833: 0 66 c1 c0 08 rol $0x8,%ax ffffffff804c5837: 0 66 89 43 12 mov %ax,0x12(%rbx) ffffffff804c583b: 268 48 8d 7b 14 lea 0x14(%rbx),%rdi ffffffff804c583f: 187 48 8d 4c 24 20 lea 0x20(%rsp),%rcx ffffffff804c5844: 4006 48 8d 54 24 10 lea 0x10(%rsp),%rdx ffffffff804c5849: 1117 48 89 ee mov %rbp,%rsi ffffffff804c584c: 0 e8 a9 fb ff ff callq ffffffff804c53fa <tcp_options_write> ffffffff804c5851: 1285 48 8b 04 24 mov (%rsp),%rax ffffffff804c5855: 727 f6 40 24 02 testb $0x2,0x24(%rax) ffffffff804c5859: 0 0f 85 8f 00 00 00 jne ffffffff804c58ee <tcp_transmit_skb+0x3e0> ffffffff804c585f: 0 f6 85 7e 04 00 00 01 testb $0x1,0x47e(%rbp) ffffffff804c5866: 456 0f 84 82 00 00 00 je ffffffff804c58ee <tcp_transmit_skb+0x3e0> ffffffff804c586c: 0 45 39 6e 68 cmp %r13d,0x68(%r14) ffffffff804c5870: 0 74 53 je ffffffff804c58c5 <tcp_transmit_skb+0x3b7> ffffffff804c5872: 0 8b 95 fc 03 00 00 mov 0x3fc(%rbp),%edx ffffffff804c5878: 0 39 50 18 cmp %edx,0x18(%rax) ffffffff804c587b: 0 78 48 js ffffffff804c58c5 <tcp_transmit_skb+0x3b7> ffffffff804c587d: 0 8a 85 7e 04 00 00 mov 0x47e(%rbp),%al ffffffff804c5883: 0 80 8d 54 02 00 00 02 orb $0x2,0x254(%rbp) ffffffff804c588a: 0 a8 02 test $0x2,%al ffffffff804c588c: 0 74 3e je ffffffff804c58cc <tcp_transmit_skb+0x3be> ffffffff804c588e: 0 83 e0 fd and $0xfffffffffffffffd,%eax ffffffff804c5891: 0 88 85 7e 04 00 00 mov %al,0x47e(%rbp) ffffffff804c5897: 0 41 8b 8e b8 00 00 00 mov 0xb8(%r14),%ecx ffffffff804c589e: 0 49 8b 96 d0 00 00 00 mov 0xd0(%r14),%rdx ffffffff804c58a5: 0 8a 44 11 0d mov 0xd(%rcx,%rdx,1),%al ffffffff804c58a9: 0 83 c8 80 or $0xffffffffffffff80,%eax ffffffff804c58ac: 0 88 44 0a 0d mov %al,0xd(%rdx,%rcx,1) ffffffff804c58b0: 0 41 8b 86 c8 00 00 00 mov 0xc8(%r14),%eax ffffffff804c58b7: 0 49 03 86 d0 00 00 00 add 0xd0(%r14),%rax ffffffff804c58be: 0 66 83 48 0a 08 orw $0x8,0xa(%rax) ffffffff804c58c3: 0 eb 07 jmp ffffffff804c58cc <tcp_transmit_skb+0x3be> ffffffff804c58c5: 0 80 a5 54 02 00 00 fc andb $0xfc,0x254(%rbp) ffffffff804c58cc: 0 f6 85 7e 04 00 00 04 testb $0x4,0x47e(%rbp) ffffffff804c58d3: 0 74 19 je ffffffff804c58ee <tcp_transmit_skb+0x3e0> ffffffff804c58d5: 0 41 8b 8e b8 00 00 00 mov 0xb8(%r14),%ecx ffffffff804c58dc: 0 49 8b 96 d0 00 00 00 mov 0xd0(%r14),%rdx ffffffff804c58e3: 0 8a 44 11 0d mov 0xd(%rcx,%rdx,1),%al ffffffff804c58e7: 0 83 c8 40 or $0x40,%eax ffffffff804c58ea: 0 88 44 0a 0d mov %al,0xd(%rdx,%rcx,1) ffffffff804c58ee: 0 48 83 7c 24 28 00 cmpq $0x0,0x28(%rsp) ffffffff804c58f4: 9425 74 26 je ffffffff804c591c <tcp_transmit_skb+0x40e> ffffffff804c58f6: 0 48 8b 85 b8 05 00 00 mov 0x5b8(%rbp),%rax ffffffff804c58fd: 0 81 a5 fc 00 00 00 ff andl $0xffff,0xfc(%rbp) ffffffff804c5904: 0 ff 00 00 ffffffff804c5907: 0 4d 89 f0 mov %r14,%r8 ffffffff804c590a: 0 48 8b 74 24 28 mov 0x28(%rsp),%rsi ffffffff804c590f: 0 48 8b 7c 24 20 mov 0x20(%rsp),%rdi ffffffff804c5914: 0 31 c9 xor %ecx,%ecx ffffffff804c5916: 0 48 89 ea mov %rbp,%rdx ffffffff804c5919: 0 ff 50 08 callq *0x8(%rax) ffffffff804c591c: 0 48 8b 85 68 03 00 00 mov 0x368(%rbp),%rax ffffffff804c5923: 2344 41 8b 76 68 mov 0x68(%r14),%esi ffffffff804c5927: 1 4c 89 f2 mov %r14,%rdx ffffffff804c592a: 0 48 89 ef mov %rbp,%rdi ffffffff804c592d: 486 ff 50 08 callq *0x8(%rax) ffffffff804c5930: 44 48 8b 0c 24 mov (%rsp),%rcx ffffffff804c5934: 836 f6 41 24 10 testb $0x10,0x24(%rcx) ffffffff804c5938: 0 74 4f je ffffffff804c5989 <tcp_transmit_skb+0x47b> ffffffff804c593a: 75 41 8b 96 c8 00 00 00 mov 0xc8(%r14),%edx ffffffff804c5941: 8600 49 8b 86 d0 00 00 00 mov 0xd0(%r14),%rax ffffffff804c5948: 1667 8b 44 10 08 mov 0x8(%rax,%rdx,1),%eax ffffffff804c594c: 13 8a 95 81 03 00 00 mov 0x381(%rbp),%dl ffffffff804c5952: 24 84 d2 test %dl,%dl ffffffff804c5954: 429 74 25 je ffffffff804c597b <tcp_transmit_skb+0x46d> ffffffff804c5956: 0 0f b7 c8 movzwl %ax,%ecx ffffffff804c5959: 3 0f b6 c2 movzbl %dl,%eax ffffffff804c595c: 0 39 c1 cmp %eax,%ecx ffffffff804c595e: 0 72 13 jb ffffffff804c5973 <tcp_transmit_skb+0x465> ffffffff804c5960: 0 c6 85 81 03 00 00 00 movb $0x0,0x381(%rbp) ffffffff804c5967: 1 c7 85 84 03 00 00 0a movl $0xa,0x384(%rbp) ffffffff804c596e: 0 00 00 00 ffffffff804c5971: 0 eb 08 jmp ffffffff804c597b <tcp_transmit_skb+0x46d> ffffffff804c5973: 1 28 ca sub %cl,%dl ffffffff804c5975: 0 88 95 81 03 00 00 mov %dl,0x381(%rbp) ffffffff804c597b: 11 c6 85 80 03 00 00 00 movb $0x0,0x380(%rbp) ffffffff804c5982: 4553 c6 85 83 03 00 00 00 movb $0x0,0x383(%rbp) ffffffff804c5989: 714 45 39 6e 68 cmp %r13d,0x68(%r14) ffffffff804c598d: 1 0f 84 e2 00 00 00 je ffffffff804c5a75 <tcp_transmit_skb+0x567> ffffffff804c5993: 288 83 3d e6 2a 3f 00 00 cmpl $0x0,0x3f2ae6(%rip) # ffffffff808b8480 <sysctl_tcp_slow_start_after_idle> ffffffff804c599a: 247 48 8b 05 df 3e 3f 00 mov 0x3f3edf(%rip),%rax # ffffffff808b9880 <jiffies> ffffffff804c59a1: 711 41 89 c7 mov %eax,%r15d ffffffff804c59a4: 0 0f 84 ad 00 00 00 je ffffffff804c5a57 <tcp_transmit_skb+0x549> ffffffff804c59aa: 159 83 bd 74 04 00 00 00 cmpl $0x0,0x474(%rbp) ffffffff804c59b1: 311 0f 85 a0 00 00 00 jne ffffffff804c5a57 <tcp_transmit_skb+0x549> ffffffff804c59b7: 0 44 8b ad 0c 04 00 00 mov 0x40c(%rbp),%r13d ffffffff804c59be: 183 44 29 e8 sub %r13d,%eax ffffffff804c59c1: 475 3b 85 58 03 00 00 cmp 0x358(%rbp),%eax ffffffff804c59c7: 54 0f 86 8a 00 00 00 jbe ffffffff804c5a57 <tcp_transmit_skb+0x549> ffffffff804c59cd: 0 48 8b 75 78 mov 0x78(%rbp),%rsi ffffffff804c59d1: 1 48 8b 05 a8 3e 3f 00 mov 0x3f3ea8(%rip),%rax # ffffffff808b9880 <jiffies> ffffffff804c59d8: 0 48 89 ef mov %rbp,%rdi ffffffff804c59db: 0 48 89 44 24 08 mov %rax,0x8(%rsp) ffffffff804c59e0: 0 e8 9c 92 ff ff callq ffffffff804bec81 <tcp_init_cwnd> ffffffff804c59e5: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804c59ea: 0 48 89 ef mov %rbp,%rdi ffffffff804c59ed: 0 41 89 c4 mov %eax,%r12d ffffffff804c59f0: 0 8b 9d ac 04 00 00 mov 0x4ac(%rbp),%ebx ffffffff804c59f6: 0 e8 5e f0 ff ff callq ffffffff804c4a59 <tcp_ca_event> ffffffff804c59fb: 0 48 89 ef mov %rbp,%rdi ffffffff804c59fe: 0 e8 6d f0 ff ff callq ffffffff804c4a70 <tcp_current_ssthresh> ffffffff804c5a03: 0 89 85 a8 04 00 00 mov %eax,0x4a8(%rbp) ffffffff804c5a09: 4 8b 85 58 03 00 00 mov 0x358(%rbp),%eax ffffffff804c5a0f: 0 41 39 dc cmp %ebx,%r12d ffffffff804c5a12: 0 8b 54 24 08 mov 0x8(%rsp),%edx ffffffff804c5a16: 0 89 d9 mov %ebx,%ecx ffffffff804c5a18: 0 41 0f 46 cc cmovbe %r12d,%ecx ffffffff804c5a1c: 0 89 c6 mov %eax,%esi ffffffff804c5a1e: 0 44 29 ea sub %r13d,%edx ffffffff804c5a21: 0 f7 de neg %esi ffffffff804c5a23: 0 29 c2 sub %eax,%edx ffffffff804c5a25: 0 89 d8 mov %ebx,%eax ffffffff804c5a27: 0 eb 02 jmp ffffffff804c5a2b <tcp_transmit_skb+0x51d> ffffffff804c5a29: 0 d1 e8 shr %eax ffffffff804c5a2b: 0 85 d2 test %edx,%edx ffffffff804c5a2d: 1 7e 06 jle ffffffff804c5a35 <tcp_transmit_skb+0x527> ffffffff804c5a2f: 0 01 f2 add %esi,%edx ffffffff804c5a31: 0 39 c8 cmp %ecx,%eax ffffffff804c5a33: 0 77 f4 ja ffffffff804c5a29 <tcp_transmit_skb+0x51b> ffffffff804c5a35: 0 39 c8 cmp %ecx,%eax ffffffff804c5a37: 1 0f 43 c8 cmovae %eax,%ecx ffffffff804c5a3a: 0 89 8d ac 04 00 00 mov %ecx,0x4ac(%rbp) ffffffff804c5a40: 0 48 8b 05 39 3e 3f 00 mov 0x3f3e39(%rip),%rax # ffffffff808b9880 <jiffies> ffffffff804c5a47: 0 c7 85 b8 04 00 00 00 movl $0x0,0x4b8(%rbp) ffffffff804c5a4e: 0 00 00 00 ffffffff804c5a51: 0 89 85 bc 04 00 00 mov %eax,0x4bc(%rbp) ffffffff804c5a57: 173 44 89 bd 0c 04 00 00 mov %r15d,0x40c(%rbp) ffffffff804c5a5e: 5224 44 2b bd 90 03 00 00 sub 0x390(%rbp),%r15d ffffffff804c5a65: 478 44 3b bd 84 03 00 00 cmp 0x384(%rbp),%r15d ffffffff804c5a6c: 0 73 07 jae ffffffff804c5a75 <tcp_transmit_skb+0x567> ffffffff804c5a6e: 38 c6 85 82 03 00 00 01 movb $0x1,0x382(%rbp) ffffffff804c5a75: 452 48 8b 14 24 mov (%rsp),%rdx ffffffff804c5a79: 312 8b 42 1c mov 0x1c(%rdx),%eax ffffffff804c5a7c: 33 39 85 fc 03 00 00 cmp %eax,0x3fc(%rbp) ffffffff804c5a82: 4768 78 05 js ffffffff804c5a89 <tcp_transmit_skb+0x57b> ffffffff804c5a84: 0 39 42 18 cmp %eax,0x18(%rdx) ffffffff804c5a87: 20 75 37 jne ffffffff804c5ac0 <tcp_transmit_skb+0x5b2> ffffffff804c5a89: 30 65 48 8b 04 25 10 00 mov %gs:0x10,%rax ffffffff804c5a90: 0 00 00 ffffffff804c5a92: 1059 8b 80 48 e0 ff ff mov -0x1fb8(%rax),%eax ffffffff804c5a98: 21 65 8b 14 25 24 00 00 mov %gs:0x24,%edx ffffffff804c5a9f: 0 00 ffffffff804c5aa0: 14 89 d2 mov %edx,%edx ffffffff804c5aa2: 471 30 c0 xor %al,%al ffffffff804c5aa4: 3 66 83 f8 01 cmp $0x1,%ax ffffffff804c5aa8: 21 48 19 c0 sbb %rax,%rax ffffffff804c5aab: 433 83 e0 08 and $0x8,%eax ffffffff804c5aae: 2 48 8b 80 98 16 ab 80 mov -0x7f54e968(%rax),%rax ffffffff804c5ab5: 16 48 f7 d0 not %rax ffffffff804c5ab8: 457 48 8b 04 d0 mov (%rax,%rdx,8),%rax ffffffff804c5abc: 3 48 ff 40 58 incq 0x58(%rax) ffffffff804c5ac0: 20 48 8b 85 68 03 00 00 mov 0x368(%rbp),%rax ffffffff804c5ac7: 424 31 f6 xor %esi,%esi ffffffff804c5ac9: 2 4c 89 f7 mov %r14,%rdi ffffffff804c5acc: 20 ff 10 callq *(%rax) ffffffff804c5ace: 0 85 c0 test %eax,%eax ffffffff804c5ad0: 9596 89 c3 mov %eax,%ebx ffffffff804c5ad2: 0 7e 18 jle ffffffff804c5aec <tcp_transmit_skb+0x5de> ffffffff804c5ad4: 0 be 01 00 00 00 mov $0x1,%esi ffffffff804c5ad9: 0 48 89 ef mov %rbp,%rdi ffffffff804c5adc: 0 e8 d9 91 ff ff callq ffffffff804becba <tcp_enter_cwr> ffffffff804c5ae1: 0 83 fb 02 cmp $0x2,%ebx ffffffff804c5ae4: 0 b8 00 00 00 00 mov $0x0,%eax ffffffff804c5ae9: 0 0f 44 d8 cmove %eax,%ebx ffffffff804c5aec: 457 48 83 c4 38 add $0x38,%rsp ffffffff804c5af0: 1473 89 d8 mov %ebx,%eax ffffffff804c5af2: 0 5b pop %rbx ffffffff804c5af3: 480 5d pop %rbp ffffffff804c5af4: 0 41 5c pop %r12 ffffffff804c5af6: 0 41 5d pop %r13 ffffffff804c5af8: 449 41 5e pop %r14 ffffffff804c5afa: 0 41 5f pop %r15 ffffffff804c5afc: 0 c3 retq looks like spread-out overhead with no particular bad spike. Just called a lot. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 18:49 ` Ingo Molnar ` (13 preceding siblings ...) 2008-11-17 22:14 ` tcp_transmit_skb() - " Ingo Molnar @ 2008-11-17 22:19 ` Ingo Molnar 14 siblings, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 22:19 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, David Miller, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, Stephen Hemminger * Ingo Molnar <mingo@elte.hu> wrote: > 100.000000 total > ................ > 1.385125 tcp_sendmsg this too is spread out, no spikes i noticed. Seems like the subsequent functions seem to be spread out pretty evenly, with no particular spikes visible. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 17:08 ` Ingo Molnar 2008-11-17 17:25 ` Ingo Molnar @ 2008-11-17 19:36 ` David Miller 1 sibling, 0 replies; 191+ messages in thread From: David Miller @ 2008-11-17 19:36 UTC (permalink / raw) To: mingo Cc: dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, torvalds, shemminger From: Ingo Molnar <mingo@elte.hu> Date: Mon, 17 Nov 2008 18:08:44 +0100 > Mike Galbraith has been spending months trying to pin down all the > issues. Yes Mike has been doing tireless good work. Another thing I noticed is that because all of the scheduler core operations are now function pointer callbacks, the call chain is deeper for core operations like wake_up(). Much of it used to be completely inlined into try_to_wake_up() With the addition of the RB tree stuff, that adds yet another unavoidable depth of function call. wake_up() is usually at the deepest part of the call chain, so this is a big deal ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 16:11 ` Ingo Molnar 2008-11-17 16:35 ` Eric Dumazet @ 2008-11-17 19:31 ` David Miller 2008-11-17 19:47 ` Linus Torvalds 2008-11-17 22:47 ` Ingo Molnar 1 sibling, 2 replies; 191+ messages in thread From: David Miller @ 2008-11-17 19:31 UTC (permalink / raw) To: mingo Cc: dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, torvalds From: Ingo Molnar <mingo@elte.hu> Date: Mon, 17 Nov 2008 17:11:35 +0100 > Ouch, +4% from a oneliner networking change? That's a _huge_ speedup > compared to the things we were after in scheduler land. The scheduler has accounted for at least %10 of the tbench regressions at this point, what are you talking about? ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:31 ` David Miller @ 2008-11-17 19:47 ` Linus Torvalds 2008-11-17 19:51 ` David Miller 2008-11-17 19:53 ` Ingo Molnar 2008-11-17 22:47 ` Ingo Molnar 1 sibling, 2 replies; 191+ messages in thread From: Linus Torvalds @ 2008-11-17 19:47 UTC (permalink / raw) To: David Miller Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra On Mon, 17 Nov 2008, David Miller wrote: > > The scheduler has accounted for at least %10 of the tbench > regressions at this point, what are you talking about? I'm wondering if you're not looking at totally different issues. For example, if I recall correctly, David had a big hit on the hrtimers. And I wonder if perhaps Ingo's numbers are without hrtimers or something? The other possibility is that it's just a sparc suckiness issue, that simply doesn't show up on x86. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:47 ` Linus Torvalds @ 2008-11-17 19:51 ` David Miller 2008-11-17 19:53 ` Ingo Molnar 1 sibling, 0 replies; 191+ messages in thread From: David Miller @ 2008-11-17 19:51 UTC (permalink / raw) To: torvalds Cc: mingo, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 17 Nov 2008 11:47:24 -0800 (PST) > For example, if I recall correctly, David had a big hit on the hrtimers. That got fixed, the HRTIMER bits are now disabled. > The other possibility is that it's just a sparc suckiness issue, that > simply doesn't show up on x86. Could be and I intend to measure that to find out. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:47 ` Linus Torvalds 2008-11-17 19:51 ` David Miller @ 2008-11-17 19:53 ` Ingo Molnar 1 sibling, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 19:53 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 17 Nov 2008, David Miller wrote: > > > > The scheduler has accounted for at least %10 of the tbench > > regressions at this point, what are you talking about? > > I'm wondering if you're not looking at totally different issues. > > For example, if I recall correctly, David had a big hit on the > hrtimers. And I wonder if perhaps Ingo's numbers are without > hrtimers or something? hrtimers should not be an issue anymore since this commit: | commit 0c4b83da58ec2e96ce9c44c211d6eac5f9dae478 | Author: Ingo Molnar <mingo@elte.hu> | Date: Mon Oct 20 14:27:43 2008 +0200 | | sched: disable the hrtick for now | | David Miller reported that hrtick update overhead has tripled the | wakeup overhead on Sparc64. | | That is too much - disable the HRTICK feature for now by default, | until a faster implementation is found. | | Reported-by: David Miller <davem@davemloft.net> | Acked-by: Peter Zijlstra <peterz@infradead.org> | Signed-off-by: Ingo Molnar <mingo@elte.hu> Which was included in v2.6.28-rc1 already. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:31 ` David Miller 2008-11-17 19:47 ` Linus Torvalds @ 2008-11-17 22:47 ` Ingo Molnar 1 sibling, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-17 22:47 UTC (permalink / raw) To: David Miller Cc: dada1, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, torvalds * David Miller <davem@davemloft.net> wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Mon, 17 Nov 2008 17:11:35 +0100 > > > Ouch, +4% from a oneliner networking change? That's a _huge_ speedup > > compared to the things we were after in scheduler land. > > The scheduler has accounted for at least %10 of the tbench > regressions at this point, what are you talking about? yeah, you are probably right when it comes to task migration policy impact - that can have effects in that range. (and that, you have to accept, is a fundamentally hard and fragile job to get right, as it involves observing the past and predicting the future out of it - at 1.3 million events per second) So above i was just talking about straight scheduling code overhead. (that cannot have been +10% of the total - as the whole scheduler only takes 7% total - TLB flush and FPU restore overhead included. Even the hrtimer bits were about 1% of the total.) Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 11:01 ` Ingo Molnar 2008-11-17 11:20 ` Eric Dumazet @ 2008-11-17 19:21 ` David Miller 2008-11-17 19:48 ` Linus Torvalds 1 sibling, 1 reply; 191+ messages in thread From: David Miller @ 2008-11-17 19:21 UTC (permalink / raw) To: mingo Cc: rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra, torvalds From: Ingo Molnar <mingo@elte.hu> Date: Mon, 17 Nov 2008 12:01:19 +0100 > The scheduler's overhead barely even registers on a 16-way x86 system > i'm running tbench on. Here's the NMI profile during 64 threads tbench > on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]: Try a non-NMI profile. It's the whole of the try_to_wake_up() path that's the problem. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:21 ` David Miller @ 2008-11-17 19:48 ` Linus Torvalds 2008-11-17 19:52 ` David Miller 0 siblings, 1 reply; 191+ messages in thread From: Linus Torvalds @ 2008-11-17 19:48 UTC (permalink / raw) To: David Miller Cc: mingo, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra On Mon, 17 Nov 2008, David Miller wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Mon, 17 Nov 2008 12:01:19 +0100 > > > The scheduler's overhead barely even registers on a 16-way x86 system > > i'm running tbench on. Here's the NMI profile during 64 threads tbench > > on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]: > > Try a non-NMI profile. > > It's the whole of the try_to_wake_up() path that's the problem. David, that makes no sense. A NMI profile is going to be a _lot_ more accurate than a non-NMI one. Asking somebody to do a clearly inferior profile to get "better numbers" is insane. We've asked _you_ to do NMI profiling, it shouldn't be the other way around. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:48 ` Linus Torvalds @ 2008-11-17 19:52 ` David Miller 2008-11-17 19:57 ` Linus Torvalds 0 siblings, 1 reply; 191+ messages in thread From: David Miller @ 2008-11-17 19:52 UTC (permalink / raw) To: torvalds Cc: mingo, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 17 Nov 2008 11:48:33 -0800 (PST) > We've asked _you_ to do NMI profiling, it shouldn't be the other way > around. I wasn't able to on these systems, so instead I did cycle level evaluation of the parts that have to run with interrupts disabled. And as a result I found that wake_up() is now 4 times slower than it was in 2.6.22, I even analyzed this for every single kernel release till now. It could be a sparc specific issue, because the call chain is deeper and we eat a lot more register window spills onto the stack. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:52 ` David Miller @ 2008-11-17 19:57 ` Linus Torvalds 2008-11-17 20:18 ` David Miller 0 siblings, 1 reply; 191+ messages in thread From: Linus Torvalds @ 2008-11-17 19:57 UTC (permalink / raw) To: David Miller Cc: mingo, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra On Mon, 17 Nov 2008, David Miller wrote: > > And as a result I found that wake_up() is now 4 times slower than it > was in 2.6.22, I even analyzed this for every single kernel release > till now. ..and that's the one where you then pointed to hrtimers, and now you claim that was fixed? At least I haven't seen any new analysis since then. > It could be a sparc specific issue, because the call chain is deeper > and we eat a lot more register window spills onto the stack. Oh, easily. In-order machines tend to have serious problems with indirect function calls in particular. x86, in contrast, tends to not even notice, especially if the indirect function is fairly static per call-site, and predicts well. There is a reason my machine is 15-20 times faster than yours. Linus ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 19:57 ` Linus Torvalds @ 2008-11-17 20:18 ` David Miller 0 siblings, 0 replies; 191+ messages in thread From: David Miller @ 2008-11-17 20:18 UTC (permalink / raw) To: torvalds Cc: mingo, rjw, linux-kernel, kernel-testers, cl, efault, a.p.zijlstra From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 17 Nov 2008 11:57:55 -0800 (PST) > On Mon, 17 Nov 2008, David Miller wrote: > > And as a result I found that wake_up() is now 4 times slower than it > > was in 2.6.22, I even analyzed this for every single kernel release > > till now. > > ..and that's the one where you then pointed to hrtimers, and now you claim > that was fixed? That was a huge increase going from 2.6.26 to 2.6.27, and has been fixed. The rest of the gradual release-to-release cost increase, however, remains. > At least I haven't seen any new analysis since then. I will find time ot make it after I get back from Portland. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-17 9:06 ` Ingo Molnar 2008-11-17 9:14 ` David Miller @ 2008-11-19 19:43 ` Christoph Lameter 2008-11-19 20:14 ` Ingo Molnar 2008-11-20 23:52 ` Christoph Lameter 1 sibling, 2 replies; 191+ messages in thread From: Christoph Lameter @ 2008-11-19 19:43 UTC (permalink / raw) To: Ingo Molnar Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Kernel Testers List, Mike Galbraith, Peter Zijlstra On Mon, 17 Nov 2008, Ingo Molnar wrote: > Christoph, as per the recent analysis of Mike: > > http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html > > all scheduler components of this regression have been eliminated. > > In fact his numbers show that scheduler speedups since 2.6.22 have > offset and hidden most other sources of tbench regression. (i.e. the > scheduler portion got 5% faster, hence it was able to offset a > slowdown of 5% in other areas of the kernel that tbench triggers) Ok will rerun the tests tomorrow. Just got back from SC08 need some time to catch up. Looks like a lot of work was done on this issue. Thanks! ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-19 19:43 ` Christoph Lameter @ 2008-11-19 20:14 ` Ingo Molnar 2008-11-20 23:52 ` Christoph Lameter 1 sibling, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-19 20:14 UTC (permalink / raw) To: Christoph Lameter Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Kernel Testers List, Mike Galbraith, Peter Zijlstra * Christoph Lameter <cl@linux-foundation.org> wrote: > On Mon, 17 Nov 2008, Ingo Molnar wrote: > > > Christoph, as per the recent analysis of Mike: > > > > http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html > > > > all scheduler components of this regression have been eliminated. > > > > In fact his numbers show that scheduler speedups since 2.6.22 have > > offset and hidden most other sources of tbench regression. (i.e. the > > scheduler portion got 5% faster, hence it was able to offset a > > slowdown of 5% in other areas of the kernel that tbench triggers) > > Ok will rerun the tests tomorrow. Just got back from SC08 need some > time to catch up. > > Looks like a lot of work was done on this issue. Thanks! You might also want to try net-next: [remote "net-next"] url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git fetch = +refs/heads/*:refs/remotes/net-next/* Some good stuff is in there too, impacting this workload. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-19 19:43 ` Christoph Lameter 2008-11-19 20:14 ` Ingo Molnar @ 2008-11-20 23:52 ` Christoph Lameter 2008-11-21 8:30 ` Ingo Molnar 1 sibling, 1 reply; 191+ messages in thread From: Christoph Lameter @ 2008-11-20 23:52 UTC (permalink / raw) To: Ingo Molnar Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Kernel Testers List, Mike Galbraith, Peter Zijlstra hmmm... Well we are almost there. 2.6.22: Throughput 2526.15 MB/sec 8 procs 2.6.28-rc5: Throughput 2486.2 MB/sec 8 procs 8p Dell 1950 and the number of processors specified on the tbench command line. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-20 23:52 ` Christoph Lameter @ 2008-11-21 8:30 ` Ingo Molnar 2008-11-21 8:51 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-21 8:30 UTC (permalink / raw) To: Christoph Lameter Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Kernel Testers List, Mike Galbraith, Peter Zijlstra, David S. Miller * Christoph Lameter <cl@linux-foundation.org> wrote: > hmmm... Well we are almost there. > > 2.6.22: > > Throughput 2526.15 MB/sec 8 procs > > 2.6.28-rc5: > > Throughput 2486.2 MB/sec 8 procs > > 8p Dell 1950 and the number of processors specified on the tbench > command line. And with net-next we might even be able to get past that magic limit? net-next is linus-latest plus the latest and greatest networking bits: $ cat .git/config [remote "net-next"] url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git fetch = +refs/heads/*:refs/remotes/net-next/* ... so might be worth a test. Just to satisfy our curiosity and to possibly close the entry :-) Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-21 8:30 ` Ingo Molnar @ 2008-11-21 8:51 ` Eric Dumazet 2008-11-21 9:05 ` David Miller 2008-11-21 9:18 ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 Ingo Molnar 2008-11-21 9:03 ` David Miller 2008-11-21 16:11 ` Christoph Lameter 2 siblings, 2 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-21 8:51 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Lameter, Rafael J. Wysocki, Linux Kernel Mailing List, Kernel Testers List, Mike Galbraith, Peter Zijlstra, David S. Miller Ingo Molnar a écrit : > * Christoph Lameter <cl@linux-foundation.org> wrote: > >> hmmm... Well we are almost there. >> >> 2.6.22: >> >> Throughput 2526.15 MB/sec 8 procs >> >> 2.6.28-rc5: >> >> Throughput 2486.2 MB/sec 8 procs >> >> 8p Dell 1950 and the number of processors specified on the tbench >> command line. > > And with net-next we might even be able to get past that magic limit? > net-next is linus-latest plus the latest and greatest networking bits: > > $ cat .git/config > > [remote "net-next"] > url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git > fetch = +refs/heads/*:refs/remotes/net-next/* > > ... so might be worth a test. Just to satisfy our curiosity and to > possibly close the entry :-) > Well, bits in net-next are new stuff for 2.6.29, not really regression fixes, but yes, they should give nice tbench speedups. Now, I wish sockets and pipes not going through dcache, not tbench affair of course but real workloads... running 8 processes on a 8 way machine doing a for (;;) close(socket(AF_INET, SOCK_STREAM, 0)); is slow as hell, we hit so many contended cache lines ... ticket spin locks are slower in this case (dcache_lock for example is taken twice when we allocate a socket(), once in d_alloc(), another one in d_instantiate()) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-21 8:51 ` Eric Dumazet @ 2008-11-21 9:05 ` David Miller 2008-11-21 12:51 ` Eric Dumazet 2008-11-21 9:18 ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 Ingo Molnar 1 sibling, 1 reply; 191+ messages in thread From: David Miller @ 2008-11-21 9:05 UTC (permalink / raw) To: dada1; +Cc: mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 21 Nov 2008 09:51:32 +0100 > Now, I wish sockets and pipes not going through dcache, not tbench affair > of course but real workloads... > > running 8 processes on a 8 way machine doing a > > for (;;) > close(socket(AF_INET, SOCK_STREAM, 0)); > > is slow as hell, we hit so many contended cache lines ... > > ticket spin locks are slower in this case (dcache_lock for example > is taken twice when we allocate a socket(), once in d_alloc(), another one > in d_instantiate()) As you of course know, this used to be a ton worse. At least now these things are unhashed. :) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-21 9:05 ` David Miller @ 2008-11-21 12:51 ` Eric Dumazet 2008-11-21 15:13 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Eric Dumazet 0 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-21 12:51 UTC (permalink / raw) To: David Miller Cc: mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Fri, 21 Nov 2008 09:51:32 +0100 > >> Now, I wish sockets and pipes not going through dcache, not tbench affair >> of course but real workloads... >> >> running 8 processes on a 8 way machine doing a >> >> for (;;) >> close(socket(AF_INET, SOCK_STREAM, 0)); >> >> is slow as hell, we hit so many contended cache lines ... >> >> ticket spin locks are slower in this case (dcache_lock for example >> is taken twice when we allocate a socket(), once in d_alloc(), another one >> in d_instantiate()) > > As you of course know, this used to be a ton worse. At least now > these things are unhashed. :) Well, this is dust compared to what we currently have. To allocate a socket we : 0) Do the usual file manipulation (pretty scalable these days) (but recent drop_file_write_access() and co slow down a bit) 1) allocate an inode with new_inode() This function : - locks inode_lock, - dirties nr_inodes counter - dirties inode_in_use list (for sockets, I doubt it is usefull) - dirties superblock s_inodes. - dirties last_ino counter All these are in different cache lines of course. 2) allocate a dentry d_alloc() takes dcache_lock, insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root) dirties nr_dentry 3) d_instantiate() dentry (dcache_lock taken again) 4) init_file() -> atomic_inc on sock_mnt->refcount (in case we want to umount this vfs ...) At close() time, we must undo the things. Its even more expensive because of the _atomic_dec_and_lock() that stress a lot, and because of two cache lines that are touched when an element is deleted from a list. for (i = 0; i < 1000*1000; i++) close(socket(socket(AF_INET, SOCK_STREAM, 0)); Cost if run one one cpu : real 0m1.561s user 0m0.092s sys 0m1.469s If run on 8 CPUS : real 0m27.496s user 0m0.657s sys 3m39.092s CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100 000 samples cum. samples % cum. % symbol name 164211 164211 10.9678 10.9678 init_file 155663 319874 10.3969 21.3647 d_alloc 147596 467470 9.8581 31.2228 _atomic_dec_and_lock 92993 560463 6.2111 37.4339 inet_create 73495 633958 4.9088 42.3427 kmem_cache_alloc 46353 680311 3.0960 45.4387 dentry_iput 46042 726353 3.0752 48.5139 tcp_close 42784 769137 2.8576 51.3715 kmem_cache_free 37074 806211 2.4762 53.8477 wake_up_inode 36375 842586 2.4295 56.2772 tcp_v4_init_sock 35212 877798 2.3518 58.6291 inotify_d_instantiate 33199 910997 2.2174 60.8465 sysenter_past_esp 31161 942158 2.0813 62.9277 d_instantiate 31000 973158 2.0705 64.9983 generic_forget_inode 28020 1001178 1.8715 66.8698 vfs_dq_drop 19007 1020185 1.2695 68.1393 __copy_from_user_ll 17513 1037698 1.1697 69.3090 new_inode 16957 1054655 1.1326 70.4415 __init_timer 16897 1071552 1.1286 71.5701 discard_slab 16115 1087667 1.0763 72.6464 d_kill 15542 1103209 1.0381 73.6845 __percpu_counter_add 13562 1116771 0.9058 74.5903 __slab_free 13276 1130047 0.8867 75.4771 __fput 12423 1142470 0.8297 76.3068 new_slab 11976 1154446 0.7999 77.1067 tcp_v4_destroy_sock 10889 1165335 0.7273 77.8340 inet_csk_destroy_sock 10516 1175851 0.7024 78.5364 alloc_inode 9979 1185830 0.6665 79.2029 sock_attach_fd 7980 1193810 0.5330 79.7359 drop_file_write_access 7609 1201419 0.5082 80.2441 alloc_fd 7584 1209003 0.5065 80.7506 sock_init_data 7164 1216167 0.4785 81.2291 add_partial 7107 1223274 0.4747 81.7038 sys_close 6997 1230271 0.4673 82.1711 mwait_idle ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH] fs: pipe/sockets/anon dentries should not have a parent 2008-11-21 12:51 ` Eric Dumazet @ 2008-11-21 15:13 ` Eric Dumazet 2008-11-21 15:21 ` Ingo Molnar 2008-11-21 15:36 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig 0 siblings, 2 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-21 15:13 UTC (permalink / raw) To: David Miller, mingo Cc: cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List [-- Attachment #1: Type: text/plain, Size: 5668 bytes --] Eric Dumazet a écrit : > David Miller a écrit : >> From: Eric Dumazet <dada1@cosmosbay.com> >> Date: Fri, 21 Nov 2008 09:51:32 +0100 >> >>> Now, I wish sockets and pipes not going through dcache, not tbench >>> affair >>> of course but real workloads... >>> >>> running 8 processes on a 8 way machine doing a >>> for (;;) >>> close(socket(AF_INET, SOCK_STREAM, 0)); >>> >>> is slow as hell, we hit so many contended cache lines ... >>> >>> ticket spin locks are slower in this case (dcache_lock for example >>> is taken twice when we allocate a socket(), once in d_alloc(), >>> another one >>> in d_instantiate()) >> >> As you of course know, this used to be a ton worse. At least now >> these things are unhashed. :) > > Well, this is dust compared to what we currently have. > > To allocate a socket we : > 0) Do the usual file manipulation (pretty scalable these days) > (but recent drop_file_write_access() and co slow down a bit) > 1) allocate an inode with new_inode() > This function : > - locks inode_lock, > - dirties nr_inodes counter > - dirties inode_in_use list (for sockets, I doubt it is usefull) > - dirties superblock s_inodes. > - dirties last_ino counter > All these are in different cache lines of course. > 2) allocate a dentry > d_alloc() takes dcache_lock, > insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root) > dirties nr_dentry > 3) d_instantiate() dentry (dcache_lock taken again) > 4) init_file() -> atomic_inc on sock_mnt->refcount (in case we want to > umount this vfs ...) > > > > At close() time, we must undo the things. Its even more expensive because > of the _atomic_dec_and_lock() that stress a lot, and because of two > cache lines that are touched when an element is deleted from a list. > > for (i = 0; i < 1000*1000; i++) > close(socket(socket(AF_INET, SOCK_STREAM, 0)); > > Cost if run one one cpu : > > real 0m1.561s > user 0m0.092s > sys 0m1.469s > > If run on 8 CPUS : > > real 0m27.496s > user 0m0.657s > sys 3m39.092s > > [PATCH] fs: pipe/sockets/anon dentries should not have a parent Linking pipe/sockets/anon dentries to one root 'parent' has no functional impact at all, but a scalability one. We can avoid touching a cache line at allocation stage (inside d_alloc(), no need to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count) We avoid an expensive atomic_dec_and_lock() call on the root dentry. If we correct dnotify_parent() and inotify_d_instantiate() to take into account a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry. Before patch, time to run 8 millions of close(socket()) calls on 8 CPUS was : real 0m27.496s user 0m0.657s sys 3m39.092s After patch : real 0m23.997s user 0m0.682s sys 3m11.193s Old oprofile : CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 164257 164257 11.0245 11.0245 init_file 155488 319745 10.4359 21.4604 d_alloc 151887 471632 10.1942 31.6547 _atomic_dec_and_lock 91620 563252 6.1493 37.8039 inet_create 74245 637497 4.9831 42.7871 kmem_cache_alloc 46702 684199 3.1345 45.9216 dentry_iput 46186 730385 3.0999 49.0215 tcp_close 42824 773209 2.8742 51.8957 kmem_cache_free 37275 810484 2.5018 54.3975 wake_up_inode 36553 847037 2.4533 56.8508 tcp_v4_init_sock 35661 882698 2.3935 59.2443 inotify_d_instantiate 32998 915696 2.2147 61.4590 sysenter_past_esp 31442 947138 2.1103 63.5693 d_instantiate 31303 978441 2.1010 65.6703 generic_forget_inode 27533 1005974 1.8479 67.5183 vfs_dq_drop 24237 1030211 1.6267 69.1450 sock_attach_fd 19290 1049501 1.2947 70.4397 __copy_from_user_ll New oprofile : CPU: Core 2, speed 3000.24 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 147287 147287 10.3984 10.3984 new_inode 144884 292171 10.2287 20.6271 inet_create 93670 385841 6.6131 27.2402 init_file 89852 475693 6.3435 33.5837 wake_up_inode 80910 556603 5.7122 39.2959 kmem_cache_alloc 53588 610191 3.7833 43.0792 _atomic_dec_and_lock 44341 654532 3.1305 46.2096 generic_forget_inode 38710 693242 2.7329 48.9425 kmem_cache_free 37605 730847 2.6549 51.5974 tcp_v4_init_sock 37228 768075 2.6283 54.2257 d_alloc 34085 802160 2.4064 56.6321 tcp_close 32550 834710 2.2980 58.9301 sysenter_past_esp 25931 860641 1.8307 60.7608 vfs_dq_drop 24458 885099 1.7267 62.4875 d_kill 22015 907114 1.5542 64.0418 dentry_iput 18877 925991 1.3327 65.3745 __copy_from_user_ll 17873 943864 1.2618 66.6363 mwait_idle Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 2 +- fs/dnotify.c | 2 +- fs/inotify.c | 2 +- fs/pipe.c | 2 +- net/socket.c | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) [-- Attachment #2: null_parent.patch --] [-- Type: text/plain, Size: 2076 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 3662dd4..22cce87 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -92,7 +92,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, this.name = name; this.len = strlen(name); this.hash = 0; - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); + dentry = d_alloc(NULL, &this); if (!dentry) goto err_put_unused_fd; diff --git a/fs/dnotify.c b/fs/dnotify.c index 676073b..66066a3 100644 --- a/fs/dnotify.c +++ b/fs/dnotify.c @@ -173,7 +173,7 @@ void dnotify_parent(struct dentry *dentry, unsigned long event) spin_lock(&dentry->d_lock); parent = dentry->d_parent; - if (parent->d_inode->i_dnotify_mask & event) { + if (parent && parent->d_inode->i_dnotify_mask & event) { dget(parent); spin_unlock(&dentry->d_lock); __inode_dir_notify(parent->d_inode, event); diff --git a/fs/inotify.c b/fs/inotify.c index 7bbed1b..9f051bb 100644 --- a/fs/inotify.c +++ b/fs/inotify.c @@ -270,7 +270,7 @@ void inotify_d_instantiate(struct dentry *entry, struct inode *inode) spin_lock(&entry->d_lock); parent = entry->d_parent; - if (parent->d_inode && inotify_inode_watched(parent->d_inode)) + if (parent && parent->d_inode && inotify_inode_watched(parent->d_inode)) entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED; spin_unlock(&entry->d_lock); } diff --git a/fs/pipe.c b/fs/pipe.c index 7aea8b8..4b961bc 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -926,7 +926,7 @@ struct file *create_write_pipe(int flags) goto err; err = -ENOMEM; - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); + dentry = d_alloc(NULL, &name); if (!dentry) goto err_inode; diff --git a/net/socket.c b/net/socket.c index e9d65ea..b84de7d 100644 --- a/net/socket.c +++ b/net/socket.c @@ -373,7 +373,7 @@ static int sock_attach_fd(struct socket *sock, struct file *file, int flags) struct dentry *dentry; struct qstr name = { .name = "" }; - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); + dentry = d_alloc(NULL, &name); if (unlikely(!dentry)) return -ENOMEM; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent 2008-11-21 15:13 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Eric Dumazet @ 2008-11-21 15:21 ` Ingo Molnar 2008-11-21 15:28 ` Eric Dumazet 2008-11-21 15:36 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig 1 sibling, 1 reply; 191+ messages in thread From: Ingo Molnar @ 2008-11-21 15:21 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List * Eric Dumazet <dada1@cosmosbay.com> wrote: > Before patch, time to run 8 millions of close(socket()) calls on 8 > CPUS was : > > real 0m27.496s > user 0m0.657s > sys 3m39.092s > > After patch : > > real 0m23.997s > user 0m0.682s > sys 3m11.193s cool :-) What would it take to get it down to: >> Cost if run one one cpu : >> >> real 0m1.561s >> user 0m0.092s >> sys 0m1.469s i guess asking for a wall-clock cost of 1.561/8 would be too much? :) Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent 2008-11-21 15:21 ` Ingo Molnar @ 2008-11-21 15:28 ` Eric Dumazet 2008-11-21 15:34 ` Ingo Molnar 0 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-21 15:28 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List Ingo Molnar a écrit : > * Eric Dumazet <dada1@cosmosbay.com> wrote: > >> Before patch, time to run 8 millions of close(socket()) calls on 8 >> CPUS was : >> >> real 0m27.496s >> user 0m0.657s >> sys 3m39.092s >> >> After patch : >> >> real 0m23.997s >> user 0m0.682s >> sys 3m11.193s > > cool :-) > > What would it take to get it down to: > >>> Cost if run one one cpu : >>> >>> real 0m1.561s >>> user 0m0.092s >>> sys 0m1.469s > > i guess asking for a wall-clock cost of 1.561/8 would be too much? :) > It might be possible, depending on the level of hackery I am allowed to inject in fs/dcache.c and fs/inode.c :) wall cost of 1.56 (each cpu runs one loop of one million iterations) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent 2008-11-21 15:28 ` Eric Dumazet @ 2008-11-21 15:34 ` Ingo Molnar 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet ` (5 more replies) 0 siblings, 6 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-21 15:34 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List * Eric Dumazet <dada1@cosmosbay.com> wrote: > Ingo Molnar a écrit : >> * Eric Dumazet <dada1@cosmosbay.com> wrote: >> >>> Before patch, time to run 8 millions of close(socket()) calls on 8 >>> CPUS was : >>> >>> real 0m27.496s >>> user 0m0.657s >>> sys 3m39.092s >>> >>> After patch : >>> >>> real 0m23.997s >>> user 0m0.682s >>> sys 3m11.193s >> >> cool :-) >> >> What would it take to get it down to: >> >>>> Cost if run one one cpu : >>>> >>>> real 0m1.561s >>>> user 0m0.092s >>>> sys 0m1.469s >> >> i guess asking for a wall-clock cost of 1.561/8 would be too much? :) >> > > It might be possible, depending on the level of hackery I am allowed > to inject in fs/dcache.c and fs/inode.c :) I think being able to open+close sockets in a scalable way is an undisputed prime-time workload on Linux. The numbers you showed look horrible. Once you can show how much faster it could go via hacks, it should only be a matter of time to achieve that safely and cleanly. > wall cost of 1.56 (each cpu runs one loop of one million iterations) (indeed.) Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-21 15:34 ` Ingo Molnar @ 2008-11-26 23:27 ` Eric Dumazet 2008-11-27 1:37 ` Christoph Lameter ` (8 more replies) 2008-11-26 23:30 ` [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Eric Dumazet ` (4 subsequent siblings) 5 siblings, 9 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-26 23:27 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig Hi all Short summary : Nice speedups for allocation/deallocation of sockets/pipes (From 27.5 seconds to 1.6 second) Long version : To allocate a socket or a pipe we : 0) Do the usual file table manipulation (pretty scalable these days, but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU and avoid call_rcu() cache killer) 1) allocate an inode with new_inode() This function : - locks inode_lock, - dirties nr_inodes counter - dirties inode_in_use list (for sockets/pipes, this is useless) - dirties superblock s_inodes. - dirties last_ino counter All these are in different cache lines unfortunatly. 2) allocate a dentry d_alloc() takes dcache_lock, insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root) dirties nr_dentry 3) d_instantiate() dentry (dcache_lock taken again) 4) init_file() -> atomic_inc() on sock_mnt->refcount At close() time, we must undo the things. Its even more expensive because of the _atomic_dec_and_lock() that stress a lot, and because of two cache lines that are touched when an element is deleted from a list (previous and next items) This is really bad, since sockets/pipes dont need to be visible in dcache or an inode list per super block. This patch series get rid of all contended cache lines for sockets, pipes and anonymous fd (signalfd, timerfd, ...) Sample program : for (i = 0; i < 1000000; i++) close(socket(AF_INET, SOCK_STREAM, 0)); Cost if one cpu runs the program : real 1.561s user 0.092s sys 1.469s Cost if 8 processes are launched on a 8 CPU machine (benchmark named socket8) : real 27.496s <<<< !!!! >>>> user 0.657s sys 3m39.092s Oprofile results (for the 8 process run, 3 times): CPU: Core 2, speed 3000.03 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock 3301428 6648780 27.6388 55.6620 d_instantiate 2971130 9619910 24.8736 80.5355 d_alloc 241318 9861228 2.0203 82.5558 init_file 146190 10007418 1.2239 83.7797 __slab_free 144149 10151567 1.2068 84.9864 inotify_d_instantiate 143971 10295538 1.2053 86.1917 inet_create 137168 10432706 1.1483 87.3401 new_inode 117549 10550255 0.9841 88.3242 add_partial 110795 10661050 0.9275 89.2517 generic_drop_inode 107137 10768187 0.8969 90.1486 kmem_cache_alloc 94029 10862216 0.7872 90.9358 tcp_close 82837 10945053 0.6935 91.6293 dput 67486 11012539 0.5650 92.1943 dentry_iput 57751 11070290 0.4835 92.6778 iput 54327 11124617 0.4548 93.1326 tcp_v4_init_sock 49921 11174538 0.4179 93.5505 sysenter_past_esp 47616 11222154 0.3986 93.9491 kmem_cache_free 30792 11252946 0.2578 94.2069 clear_inode 27540 11280486 0.2306 94.4375 copy_from_user 26509 11306995 0.2219 94.6594 init_timer 26363 11333358 0.2207 94.8801 discard_slab 25284 11358642 0.2117 95.0918 __fput 22482 11381124 0.1882 95.2800 __percpu_counter_add 20369 11401493 0.1705 95.4505 sock_alloc 18501 11419994 0.1549 95.6054 inet_csk_destroy_sock 17923 11437917 0.1500 95.7555 sys_close This patch serie avoids all contented cache lines and makes this "bench" pretty fast. New cost if run on one cpu : real 1.325s (instead of 1.561s) user 0.091s sys 1.234s If run on 8 CPUS : real 2.229s <<<< instead of 27.496s >>> user 0.695s sys 16.903s Oprofile results (for the 8 process run, 3 times): CPU: Core 2, speed 2999.74 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 143791 143791 11.7849 11.7849 __slab_free 128404 272195 10.5238 22.3087 add_partial 99150 371345 8.1262 30.4349 kmem_cache_alloc 52031 423376 4.2644 34.6993 sysenter_past_esp 47752 471128 3.9137 38.6130 kmem_cache_free 47429 518557 3.8872 42.5002 tcp_close 34376 552933 2.8174 45.3176 __percpu_counter_add 29046 581979 2.3806 47.6982 copy_from_user 28249 610228 2.3152 50.0134 init_timer 26220 636448 2.1490 52.1624 __slab_alloc 23402 659850 1.9180 54.0803 discard_slab 20560 680410 1.6851 55.7654 __call_rcu 18288 698698 1.4989 57.2643 d_alloc 16425 715123 1.3462 58.6104 get_empty_filp 16237 731360 1.3308 59.9412 __fput 15729 747089 1.2891 61.2303 alloc_fd 15021 762110 1.2311 62.4614 alloc_inode 14690 776800 1.2040 63.6654 sys_close 14666 791466 1.2020 64.8674 inet_create 13638 805104 1.1178 65.9852 dput 12503 817607 1.0247 67.0099 iput_special 12231 829838 1.0024 68.0123 lock_sock_nested 12210 842048 1.0007 69.0130 fd_install 12137 854185 0.9947 70.0078 d_alloc_special 12058 866243 0.9883 70.9960 sock_init_data 11200 877443 0.9179 71.9140 release_sock 11114 888557 0.9109 72.8248 inotify_d_instantiate The last point is about SLUB being hit hard, unless we use slub_min_order=3 at boot, or we use Christoph Lameter patch (struct file RCU optimizations) http://thread.gmane.org/gmane.linux.kernel/418615 If we boot machine with slub_min_order=3, SLUB overhead disappears. New cost if run on one cpu : real 1.307s user 0.094s sys 1.214s If run on 8 CPUS : real 1.625s <<<< instead of 27.496s or 2.229s >>> user 0.771s sys 12.061s Oprofile results (for the 8 process run, 3 times): CPU: Core 2, speed 3000.05 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 108005 108005 11.0758 11.0758 kmem_cache_alloc 52023 160028 5.3349 16.4107 sysenter_past_esp 47363 207391 4.8570 21.2678 tcp_close 45430 252821 4.6588 25.9266 kmem_cache_free 36566 289387 3.7498 29.6764 __percpu_counter_add 36085 325472 3.7005 33.3769 __slab_free 29185 354657 2.9929 36.3698 copy_from_user 28210 382867 2.8929 39.2627 init_timer 25663 408530 2.6317 41.8944 d_alloc_special 22360 430890 2.2930 44.1874 cap_file_alloc_security 19237 450127 1.9727 46.1601 __call_rcu 19097 469224 1.9584 48.1185 d_alloc 16962 486186 1.7394 49.8580 alloc_fd 16315 502501 1.6731 51.5311 __fput 16102 518603 1.6512 53.1823 get_empty_filp 14954 533557 1.5335 54.7158 inet_create 14468 548025 1.4837 56.1995 alloc_inode 14198 562223 1.4560 57.6555 sys_close 13905 576128 1.4259 59.0814 dput 12262 588390 1.2575 60.3389 lock_sock_nested 12203 600593 1.2514 61.5903 sock_attach_fd 12147 612740 1.2457 62.8360 iput_special 12049 624789 1.2356 64.0716 fd_install 12033 636822 1.2340 65.3056 sock_init_data 11999 648821 1.2305 66.5361 release_sock 11231 660052 1.1517 67.6878 inotify_d_instantiate 11068 671120 1.1350 68.8228 inet_csk_destroy_sock This patch serie contains 6 patches, against net-next-2.6 tree (because this tree already contains network improvement on this subject, but should apply on other trees) [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Adding a per_cpu nr_dentry avoids cache line ping pongs between cpus to maintain this metric. We centralize decrements of nr_dentry in d_free(), and increments in d_alloc(). d_alloc() can avoid taking dcache_lock if parent is NULL [PATCH 2/6] fs: Introduce special dentries for pipes, socket, anon fd Sockets, pipes and anonymous fds have interesting properties. Like other files, they use a dentry and an inode. But dentries for these kind of files are not hashed into dcache, since there is no way someone can lookup such a file in the vfs tree. (/proc/{pid}/fd/{number} uses a different mechanism) Still, allocating and freeing such dentries are expensive processes, because we currently take dcache_lock inside d_alloc(), d_instantiate(), and dput(). This lock is very contended on SMP machines. This patch defines a new DCACHE_SPECIAL flag, to mark a dentry as a special one (for sockets, pipes, anonymous fd), and a new d_alloc_special(const struct qstr *name, struct inode *inode) method, called by the three subsystems. Internally, dput() can take a fast path to dput_special() for special dentries. Differences betwen a special dentry and a normal one are : 1) Special dentry has the DCACHE_SPECIAL flag 2) Special dentry's parent are themselves This to avoid taking a reference on 'root' dentry, shared by too many dentries. 3) They are not hashed into global hash table 4) Their d_alias list is empty Internally, dput() can avoid an expensive atomic_dec_and_lock() for special dentries. (socket8 bench result : from 27.5s to 25.5s) [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator new_inode() dirties a contended cache line to get inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino. Note : last_ino_get() method must be called with preemption disabled. (socket8 bench result : 25.5s to 25s almost no differences, but this is because inode_lock cost is too heavy for the moment) [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. (socket8 bench result : 25s to 20.5s) [PATCH 5/6] fs: Introduce special inodes Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. In new_inode(), we test if super block has MS_SPECIAL flag set. If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list As inode_lock was taken only to protect these lists, we avoid it as well Using iput_special() from dput_special() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we really need a different flag. (socket8 bench result : from 20.5s to 2.94s) [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs This function arms a flag (MNT_SPECIAL) on the vfs, to avoid refcounting on permanent system vfs. Use this function for sockets, pipes, anonymous fds. (socket8 bench result : from 2.94s to 2.23s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- Overall diffstat : fs/anon_inodes.c | 19 +----- fs/dcache.c | 106 ++++++++++++++++++++++++++++++++------- fs/fs-writeback.c | 2 fs/inode.c | 101 +++++++++++++++++++++++++++++++------ fs/pipe.c | 28 +--------- fs/super.c | 9 +++ include/linux/dcache.h | 2 include/linux/fs.h | 8 ++ include/linux/mount.h | 5 + kernel/sysctl.c | 6 +- mm/page-writeback.c | 2 net/socket.c | 27 +-------- 12 files changed, 212 insertions(+), 103 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet @ 2008-11-27 1:37 ` Christoph Lameter 2008-11-27 6:27 ` Eric Dumazet 2008-11-27 9:39 ` Christoph Hellwig ` (7 subsequent siblings) 8 siblings, 1 reply; 191+ messages in thread From: Christoph Lameter @ 2008-11-27 1:37 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Hellwig On Thu, 27 Nov 2008, Eric Dumazet wrote: > The last point is about SLUB being hit hard, unless we > use slub_min_order=3 at boot, or we use Christoph Lameter > patch (struct file RCU optimizations) > http://thread.gmane.org/gmane.linux.kernel/418615 > > If we boot machine with slub_min_order=3, SLUB overhead disappears. I'd rather not be that drastic. Did you try increasing slub_min_objects instead? Try 40-100. If we find the right number then we should update the tuning to make sure that it pickes the right slab page sizes. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-27 1:37 ` Christoph Lameter @ 2008-11-27 6:27 ` Eric Dumazet 2008-11-27 14:44 ` Christoph Lameter 0 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-27 6:27 UTC (permalink / raw) To: Christoph Lameter Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Hellwig Christoph Lameter a écrit : > On Thu, 27 Nov 2008, Eric Dumazet wrote: > >> The last point is about SLUB being hit hard, unless we >> use slub_min_order=3 at boot, or we use Christoph Lameter >> patch (struct file RCU optimizations) >> http://thread.gmane.org/gmane.linux.kernel/418615 >> >> If we boot machine with slub_min_order=3, SLUB overhead disappears. > > > I'd rather not be that drastic. Did you try increasing slub_min_objects > instead? Try 40-100. If we find the right number then we should update > the tuning to make sure that it pickes the right slab page sizes. > > 4096/192 = 21 with slub_min_objects=22 : # cat /sys/kernel/slab/filp/order 1 # time ./socket8 real 0m1.725s user 0m0.685s sys 0m12.955s with slub_min_objects=45 : # cat /sys/kernel/slab/filp/order 2 # time ./socket8 real 0m1.652s user 0m0.694s sys 0m12.367s with slub_min_objects=80 : # cat /sys/kernel/slab/filp/order 3 # time ./socket8 real 0m1.642s user 0m0.719s sys 0m12.315s I would say slub_min_objects=45 is the optimal value on 32bit arches to get acceptable performance on this workload (order=2 for filp kmem_cache) Note : SLAB here is disastrous, but you already knew that :) real 0m8.128s user 0m0.748s sys 1m3.467s ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-27 6:27 ` Eric Dumazet @ 2008-11-27 14:44 ` Christoph Lameter 0 siblings, 0 replies; 191+ messages in thread From: Christoph Lameter @ 2008-11-27 14:44 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Hellwig, Pekka Enberg On Thu, 27 Nov 2008, Eric Dumazet wrote: > with slub_min_objects=45 : > > # cat /sys/kernel/slab/filp/order > 2 > # time ./socket8 > real 0m1.652s > user 0m0.694s > sys 0m12.367s That may be a good value. How many processor do you have? Look at calculate_order() in mm/slub.c: if (!min_objects) min_objects = 4 * (fls(nr_cpu_ids) + 1); We couild increase the scaling factor there or start with a mininum of 20 objects? Try min_objects = 20 + 4 * (fls(nr_cpu_ids) + 1); > I would say slub_min_objects=45 is the optimal value on 32bit arches to > get acceptable performance on this workload (order=2 for filp kmem_cache) > > Note : SLAB here is disastrous, but you already knew that :) Its good though to have examples where the queue management gets in the way of performance. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet 2008-11-27 1:37 ` Christoph Lameter @ 2008-11-27 9:39 ` Christoph Hellwig 2008-11-28 18:03 ` Ingo Molnar ` (6 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Christoph Hellwig @ 2008-11-27 9:39 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig As I told you before, you absolutely must include the fsdevel list and the VFS maintainer for a patchset like this. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet 2008-11-27 1:37 ` Christoph Lameter 2008-11-27 9:39 ` Christoph Hellwig @ 2008-11-28 18:03 ` Ingo Molnar 2008-11-28 18:47 ` Peter Zijlstra 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (5 subsequent siblings) 8 siblings, 1 reply; 191+ messages in thread From: Ingo Molnar @ 2008-11-28 18:03 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig * Eric Dumazet <dada1@cosmosbay.com> wrote: > Hi all > > Short summary : Nice speedups for allocation/deallocation of sockets/pipes > (From 27.5 seconds to 1.6 second) Wow, that's incredibly impressive! :-) Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-28 18:03 ` Ingo Molnar @ 2008-11-28 18:47 ` Peter Zijlstra 2008-11-29 6:38 ` Christoph Hellwig 0 siblings, 1 reply; 191+ messages in thread From: Peter Zijlstra @ 2008-11-28 18:47 UTC (permalink / raw) To: Ingo Molnar Cc: Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig On Fri, 2008-11-28 at 19:03 +0100, Ingo Molnar wrote: > * Eric Dumazet <dada1@cosmosbay.com> wrote: > > > Hi all > > > > Short summary : Nice speedups for allocation/deallocation of sockets/pipes > > (From 27.5 seconds to 1.6 second) > > Wow, that's incredibly impressive! :-) Yeah, we got a similar speedup on -rt by pushing those super-block files list into per-cpu lists and doing crazy locking on them. Of course avoiding them all together, like done here is a nicer option but is sadly not a possibility for regular files (until hch gets around to removing the need for the list). ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-28 18:47 ` Peter Zijlstra @ 2008-11-29 6:38 ` Christoph Hellwig 2008-11-29 8:07 ` Eric Dumazet 0 siblings, 1 reply; 191+ messages in thread From: Christoph Hellwig @ 2008-11-29 6:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote: > > Wow, that's incredibly impressive! :-) > > Yeah, we got a similar speedup on -rt by pushing those super-block files > list into per-cpu lists and doing crazy locking on them. > > Of course avoiding them all together, like done here is a nicer option > but is sadly not a possibility for regular files (until hch gets around > to removing the need for the list). We should have finished this long ago, thanks for the reminder. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-29 6:38 ` Christoph Hellwig @ 2008-11-29 8:07 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-29 8:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Peter Zijlstra, Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter Christoph Hellwig a écrit : > On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote: >>> Wow, that's incredibly impressive! :-) >> Yeah, we got a similar speedup on -rt by pushing those super-block files >> list into per-cpu lists and doing crazy locking on them. >> >> Of course avoiding them all together, like done here is a nicer option >> but is sadly not a possibility for regular files (until hch gets around >> to removing the need for the list). > > We should have finished this long ago, thanks for the reminder. > > inode_in_use could be percpu, at least. Or just zap it, since we never have to scan it. ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet ` (2 preceding siblings ...) 2008-11-28 18:03 ` Ingo Molnar @ 2008-11-29 8:43 ` Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 0/7] " Eric Dumazet ` (7 more replies) 2008-11-29 8:43 ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet ` (4 subsequent siblings) 8 siblings, 8 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-29 8:43 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro Hi all Short summary : Nice speedups for allocation/deallocation of sockets/pipes (From 27.5 seconds to 2.9 seconds (2.3 seconds with SLUB tweaks)) Long version : For this second version, I removed the mntput()/mntget() optimization since most reviewers are not convinced it is usefull. This is a four lines patch that can be reconsidered later. I chose the name SINGLE instead of SPECIAL to name isolated dentries (for sockets, pipes, anonymous fd) that have no parent and no relationship in the vfs. Thanks all To allocate a socket or a pipe we : 0) Do the usual file table manipulation (pretty scalable these days, but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU and avoid call_rcu() cache killer) 1) allocate an inode with new_inode() This function : - locks inode_lock, - dirties nr_inodes counter - dirties inode_in_use list (for sockets/pipes, this is useless) - dirties superblock s_inodes. - dirties last_ino counter All these are in different cache lines unfortunatly. 2) allocate a dentry d_alloc() takes dcache_lock, insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root) dirties nr_dentry 3) d_instantiate() dentry (dcache_lock taken again) 4) init_file() -> atomic_inc() on sock_mnt->refcount At close() time, we must undo the things. Its even more expensive because of the _atomic_dec_and_lock() that stress a lot, and because of two cache lines that are touched when an element is deleted from a list (previous and next items) This is really bad, since sockets/pipes dont need to be visible in dcache or an inode list per super block. This patch series get rid of all but one contended cache lines for sockets, pipes and anonymous fd (signalfd, timerfd, ...) Sample program : for (i = 0; i < 1000000; i++) close(socket(AF_INET, SOCK_STREAM, 0)); Cost if one cpu runs the program : real 1.561s user 0.092s sys 1.469s Cost if 8 processes are launched on a 8 CPU machine (benchmark named socket8) : real 27.496s <<<< !!!! >>>> user 0.657s sys 3m39.092s Oprofile results (for the 8 process run, 3 times): CPU: Core 2, speed 3000.03 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock 3301428 6648780 27.6388 55.6620 d_instantiate 2971130 9619910 24.8736 80.5355 d_alloc 241318 9861228 2.0203 82.5558 init_file 146190 10007418 1.2239 83.7797 __slab_free 144149 10151567 1.2068 84.9864 inotify_d_instantiate 143971 10295538 1.2053 86.1917 inet_create 137168 10432706 1.1483 87.3401 new_inode 117549 10550255 0.9841 88.3242 add_partial 110795 10661050 0.9275 89.2517 generic_drop_inode 107137 10768187 0.8969 90.1486 kmem_cache_alloc 94029 10862216 0.7872 90.9358 tcp_close 82837 10945053 0.6935 91.6293 dput 67486 11012539 0.5650 92.1943 dentry_iput 57751 11070290 0.4835 92.6778 iput 54327 11124617 0.4548 93.1326 tcp_v4_init_sock 49921 11174538 0.4179 93.5505 sysenter_past_esp 47616 11222154 0.3986 93.9491 kmem_cache_free 30792 11252946 0.2578 94.2069 clear_inode 27540 11280486 0.2306 94.4375 copy_from_user 26509 11306995 0.2219 94.6594 init_timer 26363 11333358 0.2207 94.8801 discard_slab 25284 11358642 0.2117 95.0918 __fput 22482 11381124 0.1882 95.2800 __percpu_counter_add 20369 11401493 0.1705 95.4505 sock_alloc 18501 11419994 0.1549 95.6054 inet_csk_destroy_sock 17923 11437917 0.1500 95.7555 sys_close This patch serie avoids all contented cache lines and makes this "bench" pretty fast. New cost if run on one cpu : real 1.325s (instead of 1.561s) user 0.091s sys 1.234s If run on 8 CPUS : real 0m2.971s user 0m0.726s sys 0m21.310s CPU: Core 2, speed 3000.04 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100 000 samples cum. samples % cum. % symbol name 189772 189772 12.7205 12.7205 _atomic_dec_and_lock 140467 330239 9.4155 22.1360 __slab_free 128210 458449 8.5940 30.7300 add_partial 121578 580027 8.1494 38.8794 kmem_cache_alloc 72626 652653 4.8681 43.7475 init_file 62720 715373 4.2041 47.9517 __percpu_counter_add 51632 767005 3.4609 51.4126 sysenter_past_esp 49196 816201 3.2976 54.7102 tcp_close 47933 864134 3.2130 57.9231 kmem_cache_free 29628 893762 1.9860 59.9091 copy_from_user 28443 922205 1.9065 61.8157 init_timer 25602 947807 1.7161 63.5318 __slab_alloc 22139 969946 1.4840 65.0158 discard_slab 20428 990374 1.3693 66.3851 __call_rcu 18174 1008548 1.2182 67.6033 alloc_fd 17643 1026191 1.1826 68.7859 __fput 17374 1043565 1.1646 69.9505 d_alloc 17196 1060761 1.1527 71.1031 sys_close 17024 1077785 1.1411 72.2442 inet_create 15208 1092993 1.0194 73.2636 alloc_inode 12201 1105194 0.8178 74.0815 fd_install 12167 1117361 0.8156 74.8970 lock_sock_nested 12123 1129484 0.8126 75.7096 get_empty_filp 11648 1141132 0.7808 76.4904 release_sock 11509 1152641 0.7715 77.2619 dput 11335 1163976 0.7598 78.0216 sock_init_data 11038 1175014 0.7399 78.7615 inet_csk_destroy_sock 10880 1185894 0.7293 79.4908 drop_file_write_access 10083 1195977 0.6759 80.1667 inotify_d_instantiate 9216 1205193 0.6178 80.7844 local_bh_enable_ip 8881 1214074 0.5953 81.3797 sysenter_do_call 8759 1222833 0.5871 81.9668 setup_object 8489 1231322 0.5690 82.5359 iput_single So we now hit mntput()/mntget() and SLUB. The last point is about SLUB being hit hard, unless we use slub_min_order=3 (or slub_min_objects=45) at boot, or we use Christoph Lameter patch (struct file RCU optimizations) http://thread.gmane.org/gmane.linux.kernel/418615 If we boot machine with slub_min_order=3, SLUB overhead disappears. If run on 8 CPUS : real 0m2.315s user 0m0.752s sys 0m17.324s CPU: Core 2, speed 3000.15 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 199409 199409 15.6440 15.6440 _atomic_dec_and_lock (mntput()) 141606 341015 11.1092 26.7532 kmem_cache_alloc 76071 417086 5.9679 32.7211 init_file 70595 487681 5.5383 38.2595 __percpu_counter_add 51595 539276 4.0477 42.3072 sysenter_past_esp 49313 588589 3.8687 46.1759 tcp_close 45503 634092 3.5698 49.7457 kmem_cache_free 41413 675505 3.2489 52.9946 __slab_free 29911 705416 2.3466 55.3412 copy_from_user 28979 734395 2.2735 57.6146 init_timer 22251 756646 1.7456 59.3602 get_empty_filp 19942 776588 1.5645 60.9247 __call_rcu 18348 794936 1.4394 62.3642 __fput 18328 813264 1.4379 63.8020 alloc_fd 17395 830659 1.3647 65.1667 sys_close 17301 847960 1.3573 66.5240 d_alloc 16570 864530 1.2999 67.8239 inet_create 15522 880052 1.2177 69.0417 alloc_inode 13185 893237 1.0344 70.0761 setup_object 12359 905596 0.9696 71.0456 fd_install 12275 917871 0.9630 72.0086 lock_sock_nested 11924 929795 0.9355 72.9441 release_sock 11790 941585 0.9249 73.8690 sock_init_data 11310 952895 0.8873 74.7563 dput 10924 963819 0.8570 75.6133 drop_file_write_access 10903 974722 0.8554 76.4687 inet_csk_destroy_sock 10184 984906 0.7990 77.2676 inotify_d_instantiate 9372 994278 0.7353 78.0029 local_bh_enable_ip 8901 1003179 0.6983 78.7012 sysenter_do_call 8569 1011748 0.6723 79.3735 iput_single 8194 1019942 0.6428 80.0163 inet_release This patch serie contains 5 patches, against net-next-2.6 tree (because this tree already contains network improvement on this subject, but should apply on other trees) [PATCH 1/5] fs: Use a percpu_counter to track nr_dentry Adding a percpu_counter nr_dentry avoids cache line ping pongs between cpus to maintain this metric, and dcache_lock is no more needed to protect dentry_stat.nr_dentry We centralize nr_dentry updates at the right place : - increments in d_alloc() - decrements in d_free() d_alloc() can avoid taking dcache_lock if parent is NULL (socket8 bench result : 27.5s to 25s) [PATCH 2/5] fs: Use a percpu_counter to track nr_inodes Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. (socket8 bench result : no difference at this point) [PATCH 3/5] fs: Introduce a per_cpu last_ino allocator new_inode() dirties a contended cache line to get increasing inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino, and give same spreading ino numbers than before. (same wraparound after 2^32 allocations) (socket8 bench result : no difference) [PATCH 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Sockets, pipes and anonymous fds have interesting properties. Like other files, they use a dentry and an inode. But dentries for these kind of files are not hashed into dcache, since there is no way someone can lookup such a file in the vfs tree. (/proc/{pid}/fd/{number} uses a different mechanism) Still, allocating and freeing such dentries are expensive processes, because we currently take dcache_lock inside d_alloc(), d_instantiate(), and dput(). This lock is very contended on SMP machines. This patch defines a new DCACHE_SINGLE flag, to mark a dentry as a single one (for sockets, pipes, anonymous fd), and a new d_alloc_single(const struct qstr *name, struct inode *inode) method, called by the three subsystems. Internally, dput() can take a fast path to dput_single() for SINGLE dentries. No more atomic_dec_and_lock() for such dentries. Differences betwen an SINGLE dentry and a normal one are : 1) SINGLE dentry has the DCACHE_SINGLE flag 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED) This to avoid taking a reference on sb 'root' dentry, shared by too many dentries. 3) They are not hashed into global hash table (DCACHE_UNHASHED) 4) Their d_alias list is empty (socket8 bench result : from 25s to 19.9s) [PATCH 5/5] fs: new_inode_single() and iput_single() Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. SINGLE dentries are attached to inodes that dont need to be linked in a list of inodes, being "inode_in_use" or "sb->s_inodes" As inode_lock was taken only to protect these lists, we avoid taking it as well. Using iput_single() from dput_single() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() (socket8 bench result : from 19.9s to 2.3s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- Overall diffstat : fs/anon_inodes.c | 18 ------ fs/dcache.c | 100 ++++++++++++++++++++++++++++++-------- fs/fs-writeback.c | 2 fs/inode.c | 101 +++++++++++++++++++++++++++++++-------- fs/pipe.c | 25 +-------- include/linux/dcache.h | 9 +++ include/linux/fs.h | 17 ++++++ kernel/sysctl.c | 6 +- mm/page-writeback.c | 2 net/socket.c | 26 +--------- 10 files changed, 200 insertions(+), 106 deletions(-) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH v3 0/7] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet @ 2008-12-11 22:38 ` Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet ` (6 subsequent siblings) 7 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-12-11 22:38 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Hi Andrew Take v2 of this patch serie got no new feedback, maybe its time for mm inclusion for a while ? In this third version I added last two patches, one intialy from Christoph Lameter, and one to avoid dirtying mnt->mnt_count on hardwired fs. Many thanks to Christoph and Paul for this SLAB_DESTROY_PER_RCU work done on "struct file". Thank you Short summary : Nice speedups for allocation/deallocation of sockets/pipes (From 27.5 seconds to 1.62 s, on a 8 cpus machine) Long version : To allocate a socket or a pipe we : 0) Do the usual file table manipulation (pretty scalable these days, but would be faster if 'struct file' were using SLAB_DESTROY_BY_RCU and avoid call_rcu() cache killer). This point is addressed by 6th patch. 1) allocate an inode with new_inode() This function : - locks inode_lock, - dirties nr_inodes counter - dirties inode_in_use list (for sockets/pipes, this is useless) - dirties superblock s_inodes. - dirties last_ino counter All these are in different cache lines unfortunatly. 2) allocate a dentry d_alloc() takes dcache_lock, insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root) dirties nr_dentry 3) d_instantiate() dentry (dcache_lock taken again) 4) init_file() -> atomic_inc() on sock_mnt->refcount At close() time, we must undo the things. Its even more expensive because of the _atomic_dec_and_lock() that stress a lot, and because of two cache lines that are touched when an element is deleted from a list (previous and next items) This is really bad, since sockets/pipes dont need to be visible in dcache or an inode list per super block. This patch series get rid of all but one contended cache lines for sockets, pipes and anonymous fd (signalfd, timerfd, ...) socketallocbench is a very simple program (attached to this mail) that makes a loop : for (i = 0; i < 1000000; i++) close(socket(AF_INET, SOCK_STREAM, 0)); Cost if one cpu runs the program : real 1.561s user 0.092s sys 1.469s Cost if 8 processes are launched on a 8 CPU machine (socketallocbench -n 8) : real 27.496s <<<< !!!! >>>> user 0.657s sys 3m39.092s Oprofile results (for the 8 process run, 3 times): CPU: Core 2, speed 3000.03 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock 3301428 6648780 27.6388 55.6620 d_instantiate 2971130 9619910 24.8736 80.5355 d_alloc 241318 9861228 2.0203 82.5558 init_file 146190 10007418 1.2239 83.7797 __slab_free 144149 10151567 1.2068 84.9864 inotify_d_instantiate 143971 10295538 1.2053 86.1917 inet_create 137168 10432706 1.1483 87.3401 new_inode 117549 10550255 0.9841 88.3242 add_partial 110795 10661050 0.9275 89.2517 generic_drop_inode 107137 10768187 0.8969 90.1486 kmem_cache_alloc 94029 10862216 0.7872 90.9358 tcp_close 82837 10945053 0.6935 91.6293 dput 67486 11012539 0.5650 92.1943 dentry_iput 57751 11070290 0.4835 92.6778 iput 54327 11124617 0.4548 93.1326 tcp_v4_init_sock 49921 11174538 0.4179 93.5505 sysenter_past_esp 47616 11222154 0.3986 93.9491 kmem_cache_free 30792 11252946 0.2578 94.2069 clear_inode 27540 11280486 0.2306 94.4375 copy_from_user 26509 11306995 0.2219 94.6594 init_timer 26363 11333358 0.2207 94.8801 discard_slab 25284 11358642 0.2117 95.0918 __fput 22482 11381124 0.1882 95.2800 __percpu_counter_add 20369 11401493 0.1705 95.4505 sock_alloc 18501 11419994 0.1549 95.6054 inet_csk_destroy_sock 17923 11437917 0.1500 95.7555 sys_close This patch serie avoids all contented cache lines and makes this "bench" pretty fast. New cost if run on one cpu : real 1.245s (instead of 1.561s) user 0.074s sys 1.161s If run on 8 CPUS : real 1.624s user 0.580s sys 12.296s On oprofile, we finally can see network stuff coming at the front of expensive stuff. (with the exception of kmem_cache_[z]alloc(), because it has to clear 192 bytes of file structures, this takes half of the time) CPU: Core 2, speed 3000.09 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100 000 samples cum. samples % cum. % symbol name 176586 176586 10.9376 10.9376 kmem_cache_alloc 169838 346424 10.5196 21.4572 tcp_close 105331 451755 6.5241 27.9813 tcp_v4_init_sock 105146 556901 6.5126 34.4939 tcp_v4_destroy_sock 83307 640208 5.1600 39.6539 sysenter_past_esp 80241 720449 4.9701 44.6239 inet_csk_destroy_sock 74263 794712 4.5998 49.2237 kmem_cache_free 56806 851518 3.5185 52.7422 __percpu_counter_add 48619 900137 3.0114 55.7536 copy_from_user 44803 944940 2.7751 58.5287 init_timer 28539 973479 1.7677 60.2964 d_alloc 27795 1001274 1.7216 62.0180 alloc_fd 26747 1028021 1.6567 63.6747 __fput 24312 1052333 1.5059 65.1805 sys_close 24205 1076538 1.4992 66.6798 inet_create 22409 1098947 1.3880 68.0677 alloc_inode 21359 1120306 1.3230 69.3907 release_sock 19865 1140171 1.2304 70.6211 fd_install 19472 1159643 1.2061 71.8272 lock_sock_nested 18956 1178599 1.1741 73.0013 sock_init_data 17301 1195900 1.0716 74.0729 drop_file_write_access 17113 1213013 1.0600 75.1329 inotify_d_instantiate 16384 1229397 1.0148 76.1477 dput 15173 1244570 0.9398 77.0875 local_bh_enable_ip 15017 1259587 0.9301 78.0176 local_bh_enable 13354 1272941 0.8271 78.8448 __sock_create 13139 1286080 0.8138 79.6586 inet_release 13062 1299142 0.8090 80.4676 sysenter_do_call 11935 1311077 0.7392 81.2069 iput_single This patch serie contains 7 patches, against linux-2.6 tree, plus one patch in mm (fs: filp_cachep can be static in fs/file_table.c) [PATCH 1/7] fs: Use a percpu_counter to track nr_dentry Adding a percpu_counter nr_dentry avoids cache line ping pongs between cpus to maintain this metric, and dcache_lock is no more needed to protect dentry_stat.nr_dentry We centralize nr_dentry updates at the right place : - increments in d_alloc() - decrements in d_free() d_alloc() can avoid taking dcache_lock if parent is NULL ("socketallocbench -n 8" bench result : 27.5s to 25s) [PATCH 2/7] fs: Use a percpu_counter to track nr_inodes Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. ("socketallocbench -n 8" bench result : no difference at this point) [PATCH 3/7] fs: Introduce a per_cpu last_ino allocator new_inode() dirties a contended cache line to get increasing inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino, and give same spreading ino numbers than before. (same wraparound after 232 allocations) ("socketallocbench -n 8" result : no difference) [PATCH 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Sockets, pipes and anonymous fds have interesting properties. Like other files, they use a dentry and an inode. But dentries for these kind of files are not hashed into dcache, since there is no way someone can lookup such a file in the vfs tree. (/proc/{pid}/fd/{number} uses a different mechanism) Still, allocating and freeing such dentries are expensive processes, because we currently take dcache_lock inside d_alloc(), d_instantiate(), and dput(). This lock is very contended on SMP machines. This patch defines a new DCACHE_SINGLE flag, to mark a dentry as a single one (for sockets, pipes, anonymous fd), and a new d_alloc_single(const struct qstr *name, struct inode *inode) method, called by the three subsystems. Internally, dput() can take a fast path to dput_single() for SINGLE dentries. No more atomic_dec_and_lock() for such dentries. Differences betwen an SINGLE dentry and a normal one are : 1) SINGLE dentry has the DCACHE_SINGLE flag 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED) This to avoid taking a reference on sb 'root' dentry, shared by too many dentries. 3) They are not hashed into global hash table (DCACHE_UNHASHED) 4) Their d_alias list is empty (socket8 bench result : from 25s to 19.9s) [PATCH 5/7] fs: new_inode_single() and iput_single() Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. SINGLE dentries are attached to inodes that dont need to be linked in a list of inodes, being "inode_in_use" or "sb->s_inodes" As inode_lock was taken only to protect these lists, we avoid taking it as well. Using iput_single() from dput_single() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() ("socketallocbench -n 8" result : from 19.9s to 3.01s) [PATH 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU From: Christoph Lameter <cl@linux-foundation.org> Currently we schedule RCU frees for each file we free separately. That has several drawbacks against the earlier file handling (in 2.6.5 f.e.), which did not require RCU callbacks: 1. Excessive number of RCU callbacks can be generated causing long RCU queues that in turn cause long latencies. We hit SLUB page allocation more often than necessary. 2. The cache hot object is not preserved between free and realloc. A close followed by another open is very fast with the RCUless approach because the last freed object is returned by the slab allocator that is still cache hot. RCU free means that the object is not immediately available again. The new object is cache cold and therefore open/close performance tests show a significant degradation with the RCU implementation. One solution to this problem is to move the RCU freeing into the Slab allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation time. The slab allocator will do RCU frees only when it is necessary to dispose of slabs of objects (rare). So with that approach we can cut out the RCU overhead significantly. However, the slab allocator may return the object for another use even before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means there is the (unlikely) possibility that the object is going to be switched under us in sections protected by rcu_read_lock() and rcu_read_unlock(). So we need to verify that we have acquired the correct object after establishing a stable object reference (incrementing the refcounter does that). Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> ("socketallocbench -n 8" result : from 3.01s to 2.20s) [PATCH 7/7] fs: MS_NOREFCOUNT Some fs are hardwired into kernel, and mntput()/mntget() hit a contended cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set on socket, pipes and anonymous fd superblocks. mntput()/mntget() become null ops on these fs. ("socketallocbench -n 8" result : from 2.20s to 1.64s) cat socketallocbench.c /* * socketallocbench benchmark * * Usage : socket [-n procs] [-l loops] */ #include <sys/socket.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <sys/wait.h> void dowork(int loops) { int i; for (i = 0; i < loops; i++) close(socket(AF_INET, SOCK_STREAM, 0)); } int main(int argc, char *argv[]) { int i; int n = 1; int loops = 1000000; pid_t *pidtable; while ((i = getopt(argc, argv, "n:l:")) != EOF) { if (i == 'n') n = atoi(optarg); if (i == 'l') loops = atoi(optarg); } pidtable = malloc(n * sizeof(pid_t)); for (i = 1; i < n; i++) { pidtable[i] = fork(); if (pidtable[i] == 0) { dowork(loops); _exit(0); } if (pidtable[i] == -1) { perror("fork"); n = i; break; } } dowork(loops); for (i = 1; i < n; i++) { int status; wait(&status); } return 0; } ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 0/7] " Eric Dumazet @ 2008-12-11 22:38 ` Eric Dumazet 2007-07-24 1:24 ` Nick Piggin 2008-12-16 21:04 ` Paul E. McKenney 2008-12-11 22:39 ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet ` (5 subsequent siblings) 7 siblings, 2 replies; 191+ messages in thread From: Eric Dumazet @ 2008-12-11 22:38 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Adding a percpu_counter nr_dentry avoids cache line ping pongs between cpus to maintain this metric, and dcache_lock is no more needed to protect dentry_stat.nr_dentry We centralize nr_dentry updates at the right place : - increments in d_alloc() - decrements in d_free() d_alloc() can avoid taking dcache_lock if parent is NULL ("socketallocbench -n8" result : 27.5s to 25s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/dcache.c | 49 +++++++++++++++++++++++++------------------ include/linux/fs.h | 2 + kernel/sysctl.c | 2 - 3 files changed, 32 insertions(+), 21 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index fa1ba03..f463a81 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly; static unsigned int d_hash_mask __read_mostly; static unsigned int d_hash_shift __read_mostly; static struct hlist_head *dentry_hashtable __read_mostly; +static struct percpu_counter nr_dentry; /* Statistics gathering. */ struct dentry_stat_t dentry_stat = { .age_limit = 45, }; +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry); + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void __d_free(struct dentry *dentry) { WARN_ON(!list_empty(&dentry->d_alias)); @@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head) } /* - * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry - * inside dcache_lock. + * no dcache_lock, please. */ static void d_free(struct dentry *dentry) { @@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry) __d_free(dentry); else call_rcu(&dentry->d_u.d_rcu, d_callback); + percpu_counter_dec(&nr_dentry); } /* @@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry) struct dentry *parent; list_del(&dentry->d_u.d_child); - dentry_stat.nr_dentry--; /* For d_free, below */ /*drops the locks, at that point nobody can reach this dentry */ dentry_iput(dentry); if (IS_ROOT(dentry)) @@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb) static void shrink_dcache_for_umount_subtree(struct dentry *dentry) { struct dentry *parent; - unsigned detached = 0; BUG_ON(!IS_ROOT(dentry)); @@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) } list_del(&dentry->d_u.d_child); - detached++; inode = dentry->d_inode; if (inode) { @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) * otherwise we ascend to the parent and move to the * next sibling if there is one */ if (!parent) - goto out; + return; dentry = parent; @@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) dentry = list_entry(dentry->d_subdirs.next, struct dentry, d_u.d_child); } -out: - /* several dentries were freed, need to correct nr_dentry */ - spin_lock(&dcache_lock); - dentry_stat.nr_dentry -= detached; - spin_unlock(&dcache_lock); } /* @@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) dentry->d_flags = DCACHE_UNHASHED; spin_lock_init(&dentry->d_lock); dentry->d_inode = NULL; - dentry->d_parent = NULL; - dentry->d_sb = NULL; dentry->d_op = NULL; dentry->d_fsdata = NULL; dentry->d_mounted = 0; @@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) if (parent) { dentry->d_parent = dget(parent); dentry->d_sb = parent->d_sb; + spin_lock(&dcache_lock); + list_add(&dentry->d_u.d_child, &parent->d_subdirs); + spin_unlock(&dcache_lock); } else { + dentry->d_parent = NULL; + dentry->d_sb = NULL; INIT_LIST_HEAD(&dentry->d_u.d_child); } - - spin_lock(&dcache_lock); - if (parent) - list_add(&dentry->d_u.d_child, &parent->d_subdirs); - dentry_stat.nr_dentry++; - spin_unlock(&dcache_lock); - + percpu_counter_inc(&nr_dentry); return dentry; } @@ -2282,6 +2290,7 @@ static void __init dcache_init(void) { int loop; + percpu_counter_init(&nr_dentry, 0); /* * A constructor could be added for stable state like the lists, * but it is probably not worth it because of the cache nature diff --git a/include/linux/fs.h b/include/linux/fs.h index 4a853ef..114cb65 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata) struct ctl_table; int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 3d56fe7..777bee7 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = { .data = &dentry_stat, .maxlen = 6*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_dentry, }, { .ctl_name = FS_OVERFLOWUID, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry 2008-12-11 22:38 ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet @ 2007-07-24 1:24 ` Nick Piggin 2008-12-16 21:04 ` Paul E. McKenney 1 sibling, 0 replies; 191+ messages in thread From: Nick Piggin @ 2007-07-24 1:24 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney On Friday 12 December 2008 09:38, Eric Dumazet wrote: > Adding a percpu_counter nr_dentry avoids cache line ping pongs > between cpus to maintain this metric, and dcache_lock is > no more needed to protect dentry_stat.nr_dentry > > We centralize nr_dentry updates at the right place : > - increments in d_alloc() > - decrements in d_free() > > d_alloc() can avoid taking dcache_lock if parent is NULL > > ("socketallocbench -n8" result : 27.5s to 25s) Seems like a good idea. > @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct > dentry *dentry) * otherwise we ascend to the parent and move to the > * next sibling if there is one */ > if (!parent) > - goto out; > + return; > > dentry = parent; > Andrew doesn't like return from middle of function. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry 2008-12-11 22:38 ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet 2007-07-24 1:24 ` Nick Piggin @ 2008-12-16 21:04 ` Paul E. McKenney 1 sibling, 0 replies; 191+ messages in thread From: Paul E. McKenney @ 2008-12-16 21:04 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro On Thu, Dec 11, 2008 at 11:38:56PM +0100, Eric Dumazet wrote: > Adding a percpu_counter nr_dentry avoids cache line ping pongs > between cpus to maintain this metric, and dcache_lock is > no more needed to protect dentry_stat.nr_dentry > > We centralize nr_dentry updates at the right place : > - increments in d_alloc() > - decrements in d_free() > > d_alloc() can avoid taking dcache_lock if parent is NULL > > ("socketallocbench -n8" result : 27.5s to 25s) Looks good! (At least once I realised that nr_dentry was global rather than per-dentry!!!) Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > fs/dcache.c | 49 +++++++++++++++++++++++++------------------ > include/linux/fs.h | 2 + > kernel/sysctl.c | 2 - > 3 files changed, 32 insertions(+), 21 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index fa1ba03..f463a81 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly; > static unsigned int d_hash_mask __read_mostly; > static unsigned int d_hash_shift __read_mostly; > static struct hlist_head *dentry_hashtable __read_mostly; > +static struct percpu_counter nr_dentry; > > /* Statistics gathering. */ > struct dentry_stat_t dentry_stat = { > .age_limit = 45, > }; > > +/* > + * Handle nr_dentry sysctl > + */ > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) > +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos) > +{ > + dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry); > + return proc_dointvec(table, write, filp, buffer, lenp, ppos); > +} > +#else > +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos) > +{ > + return -ENOSYS; > +} > +#endif > + > static void __d_free(struct dentry *dentry) > { > WARN_ON(!list_empty(&dentry->d_alias)); > @@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head) > } > > /* > - * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry > - * inside dcache_lock. > + * no dcache_lock, please. > */ > static void d_free(struct dentry *dentry) > { > @@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry) > __d_free(dentry); > else > call_rcu(&dentry->d_u.d_rcu, d_callback); > + percpu_counter_dec(&nr_dentry); > } > > /* > @@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry) > struct dentry *parent; > > list_del(&dentry->d_u.d_child); > - dentry_stat.nr_dentry--; /* For d_free, below */ > /*drops the locks, at that point nobody can reach this dentry */ > dentry_iput(dentry); > if (IS_ROOT(dentry)) > @@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb) > static void shrink_dcache_for_umount_subtree(struct dentry *dentry) > { > struct dentry *parent; > - unsigned detached = 0; > > BUG_ON(!IS_ROOT(dentry)); > > @@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) > } > > list_del(&dentry->d_u.d_child); > - detached++; > > inode = dentry->d_inode; > if (inode) { > @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) > * otherwise we ascend to the parent and move to the > * next sibling if there is one */ > if (!parent) > - goto out; > + return; > > dentry = parent; > > @@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) > dentry = list_entry(dentry->d_subdirs.next, > struct dentry, d_u.d_child); > } > -out: > - /* several dentries were freed, need to correct nr_dentry */ > - spin_lock(&dcache_lock); > - dentry_stat.nr_dentry -= detached; > - spin_unlock(&dcache_lock); > } > > /* > @@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) > dentry->d_flags = DCACHE_UNHASHED; > spin_lock_init(&dentry->d_lock); > dentry->d_inode = NULL; > - dentry->d_parent = NULL; > - dentry->d_sb = NULL; > dentry->d_op = NULL; > dentry->d_fsdata = NULL; > dentry->d_mounted = 0; > @@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) > if (parent) { > dentry->d_parent = dget(parent); > dentry->d_sb = parent->d_sb; > + spin_lock(&dcache_lock); > + list_add(&dentry->d_u.d_child, &parent->d_subdirs); > + spin_unlock(&dcache_lock); > } else { > + dentry->d_parent = NULL; > + dentry->d_sb = NULL; > INIT_LIST_HEAD(&dentry->d_u.d_child); > } > - > - spin_lock(&dcache_lock); > - if (parent) > - list_add(&dentry->d_u.d_child, &parent->d_subdirs); > - dentry_stat.nr_dentry++; > - spin_unlock(&dcache_lock); > - > + percpu_counter_inc(&nr_dentry); > return dentry; > } > > @@ -2282,6 +2290,7 @@ static void __init dcache_init(void) > { > int loop; > > + percpu_counter_init(&nr_dentry, 0); > /* > * A constructor could be added for stable state like the lists, > * but it is probably not worth it because of the cache nature > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 4a853ef..114cb65 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata) > struct ctl_table; > int proc_nr_files(struct ctl_table *table, int write, struct file *filp, > void __user *buffer, size_t *lenp, loff_t *ppos); > +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos); > > int get_filesystem_list(char * buf); > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 3d56fe7..777bee7 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = { > .data = &dentry_stat, > .maxlen = 6*sizeof(int), > .mode = 0444, > - .proc_handler = &proc_dointvec, > + .proc_handler = &proc_nr_dentry, > }, > { > .ctl_name = FS_OVERFLOWUID, ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 0/7] " Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet @ 2008-12-11 22:39 ` Eric Dumazet 2007-07-24 1:30 ` Nick Piggin 2008-12-16 21:10 ` Paul E. McKenney 2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet ` (4 subsequent siblings) 7 siblings, 2 replies; 191+ messages in thread From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. (socket8 bench result : no difference at this point) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/fs-writeback.c | 2 +- fs/inode.c | 39 +++++++++++++++++++++++++++++++-------- include/linux/fs.h | 3 +++ kernel/sysctl.c | 4 ++-- mm/page-writeback.c | 2 +- 5 files changed, 38 insertions(+), 12 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d0ff0b8..b591cdd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait) unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS); wbc.nr_to_write = nr_dirty + nr_unstable + - (inodes_stat.nr_inodes - inodes_stat.nr_unused) + + (get_nr_inodes() - inodes_stat.nr_unused) + nr_dirty + nr_unstable; wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ sync_sb_inodes(sb, &wbc); diff --git a/fs/inode.c b/fs/inode.c index 0487ddb..f94f889 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex); * Statistics gathering.. */ struct inodes_stat_t inodes_stat; +static struct percpu_counter nr_inodes; static struct kmem_cache * inode_cachep __read_mostly; +int get_nr_inodes(void) +{ + return percpu_counter_sum_positive(&nr_inodes); +} + +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + inodes_stat.nr_inodes = get_nr_inodes(); + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void wake_up_inode(struct inode *inode) { /* @@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head) destroy_inode(inode); nr_disposed++; } - spin_lock(&inode_lock); - inodes_stat.nr_inodes -= nr_disposed; - spin_unlock(&inode_lock); + percpu_counter_sub(&nr_inodes, nr_disposed); } /* @@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb) inode = alloc_inode(sb); if (inode) { + percpu_counter_inc(&nr_inodes); spin_lock(&inode_lock); - inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); inode->i_ino = ++last_ino; @@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h if (set(inode, data)) goto set_failed; - inodes_stat.nr_inodes++; + percpu_counter_inc(&nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he old = find_inode_fast(sb, head, ino); if (!old) { inode->i_ino = ino; - inodes_stat.nr_inodes++; + percpu_counter_inc(&nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + percpu_counter_dec(&nr_inodes); security_inode_delete(inode); @@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + percpu_counter_dec(&nr_inodes); if (inode->i_data.nrpages) truncate_inode_pages(&inode->i_data, 0); clear_inode(inode); @@ -1394,6 +1416,7 @@ void __init inode_init(void) { int loop; + percpu_counter_init(&nr_inodes, 0); /* inode slab cache */ inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode), diff --git a/include/linux/fs.h b/include/linux/fs.h index 114cb65..a789346 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -47,6 +47,7 @@ struct inodes_stat_t { int dummy[5]; /* padding for sysctl ABI compatibility */ }; extern struct inodes_stat_t inodes_stat; +extern int get_nr_inodes(void); extern int leases_enable, lease_break_time; @@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 777bee7..b705f3a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 2*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .ctl_name = FS_STATINODE, @@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 7*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .procname = "file-nr", diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2970e35..a71a922 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg) next_jif = start_jif + dirty_writeback_interval; nr_to_write = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS) + - (inodes_stat.nr_inodes - inodes_stat.nr_unused); + (get_nr_inodes() - inodes_stat.nr_unused); while (nr_to_write > 0) { wbc.more_io = 0; wbc.encountered_congestion = 0; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes 2008-12-11 22:39 ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet @ 2007-07-24 1:30 ` Nick Piggin 2008-12-12 5:11 ` Eric Dumazet 2008-12-16 21:10 ` Paul E. McKenney 1 sibling, 1 reply; 191+ messages in thread From: Nick Piggin @ 2007-07-24 1:30 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney On Friday 12 December 2008 09:39, Eric Dumazet wrote: > Avoids cache line ping pongs between cpus and prepare next patch, > because updates of nr_inodes dont need inode_lock anymore. > > (socket8 bench result : no difference at this point) Looks good. But.... If we never actually need fast access to the approximate total, (which seems to apply to this and the previous patch) we could use something much simpler which does not have the spinlock or all this batching stuff that percpu counters have. I'd prefer that because it will be faster in a straight line... (BTW. percpu counters can't be used in interrupt context? That's nice.) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes 2007-07-24 1:30 ` Nick Piggin @ 2008-12-12 5:11 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-12-12 5:11 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Nick Piggin a écrit : > On Friday 12 December 2008 09:39, Eric Dumazet wrote: >> Avoids cache line ping pongs between cpus and prepare next patch, >> because updates of nr_inodes dont need inode_lock anymore. >> >> (socket8 bench result : no difference at this point) > > Looks good. > > But.... If we never actually need fast access to the approximate > total, (which seems to apply to this and the previous patch) we > could use something much simpler which does not have the spinlock > or all this batching stuff that percpu counters have. I'd prefer > that because it will be faster in a straight line... Well, using a non batching mode could be real easy, just call __percpu_counter_add(&counter, inc, 1<<30); Or define a new percpu_counter_fastadd(&counter, inc); percpu_counter are nice because handle the CPU hotplug problem, if we want to use for_each_online_cpu() instead of for_each_possible_cpu(). > > (BTW. percpu counters can't be used in interrupt context? That's > nice.) > > Not sure why you said this. I would like to have a irqsafe percpu_counter, I was preparing such a patch because we need it for net-next ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes 2008-12-11 22:39 ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet 2007-07-24 1:30 ` Nick Piggin @ 2008-12-16 21:10 ` Paul E. McKenney 1 sibling, 0 replies; 191+ messages in thread From: Paul E. McKenney @ 2008-12-16 21:10 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro On Thu, Dec 11, 2008 at 11:39:10PM +0100, Eric Dumazet wrote: > Avoids cache line ping pongs between cpus and prepare next patch, > because updates of nr_inodes dont need inode_lock anymore. > > (socket8 bench result : no difference at this point) I do like this per-CPU counter infrastructure! One small comment change noted below. Other than that: Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > fs/fs-writeback.c | 2 +- > fs/inode.c | 39 +++++++++++++++++++++++++++++++-------- > include/linux/fs.h | 3 +++ > kernel/sysctl.c | 4 ++-- > mm/page-writeback.c | 2 +- > 5 files changed, 38 insertions(+), 12 deletions(-) > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index d0ff0b8..b591cdd 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait) > unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS); > > wbc.nr_to_write = nr_dirty + nr_unstable + > - (inodes_stat.nr_inodes - inodes_stat.nr_unused) + > + (get_nr_inodes() - inodes_stat.nr_unused) + > nr_dirty + nr_unstable; > wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ > sync_sb_inodes(sb, &wbc); > diff --git a/fs/inode.c b/fs/inode.c > index 0487ddb..f94f889 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex); > * Statistics gathering.. > */ > struct inodes_stat_t inodes_stat; > +static struct percpu_counter nr_inodes; > > static struct kmem_cache * inode_cachep __read_mostly; > > +int get_nr_inodes(void) > +{ > + return percpu_counter_sum_positive(&nr_inodes); > +} > + > +/* > + * Handle nr_dentry sysctl That would be "nr_inode", right? > + */ > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) > +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos) > +{ > + inodes_stat.nr_inodes = get_nr_inodes(); > + return proc_dointvec(table, write, filp, buffer, lenp, ppos); > +} > +#else > +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos) > +{ > + return -ENOSYS; > +} > +#endif > + > static void wake_up_inode(struct inode *inode) > { > /* > @@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head) > destroy_inode(inode); > nr_disposed++; > } > - spin_lock(&inode_lock); > - inodes_stat.nr_inodes -= nr_disposed; > - spin_unlock(&inode_lock); > + percpu_counter_sub(&nr_inodes, nr_disposed); > } > > /* > @@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb) > > inode = alloc_inode(sb); > if (inode) { > + percpu_counter_inc(&nr_inodes); > spin_lock(&inode_lock); > - inodes_stat.nr_inodes++; > list_add(&inode->i_list, &inode_in_use); > list_add(&inode->i_sb_list, &sb->s_inodes); > inode->i_ino = ++last_ino; > @@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h > if (set(inode, data)) > goto set_failed; > > - inodes_stat.nr_inodes++; > + percpu_counter_inc(&nr_inodes); > list_add(&inode->i_list, &inode_in_use); > list_add(&inode->i_sb_list, &sb->s_inodes); > hlist_add_head(&inode->i_hash, head); > @@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he > old = find_inode_fast(sb, head, ino); > if (!old) { > inode->i_ino = ino; > - inodes_stat.nr_inodes++; > + percpu_counter_inc(&nr_inodes); > list_add(&inode->i_list, &inode_in_use); > list_add(&inode->i_sb_list, &sb->s_inodes); > hlist_add_head(&inode->i_hash, head); > @@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode) > list_del_init(&inode->i_list); > list_del_init(&inode->i_sb_list); > inode->i_state |= I_FREEING; > - inodes_stat.nr_inodes--; > spin_unlock(&inode_lock); > + percpu_counter_dec(&nr_inodes); > > security_inode_delete(inode); > > @@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode) > list_del_init(&inode->i_list); > list_del_init(&inode->i_sb_list); > inode->i_state |= I_FREEING; > - inodes_stat.nr_inodes--; > spin_unlock(&inode_lock); > + percpu_counter_dec(&nr_inodes); > if (inode->i_data.nrpages) > truncate_inode_pages(&inode->i_data, 0); > clear_inode(inode); > @@ -1394,6 +1416,7 @@ void __init inode_init(void) > { > int loop; > > + percpu_counter_init(&nr_inodes, 0); > /* inode slab cache */ > inode_cachep = kmem_cache_create("inode_cache", > sizeof(struct inode), > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 114cb65..a789346 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -47,6 +47,7 @@ struct inodes_stat_t { > int dummy[5]; /* padding for sysctl ABI compatibility */ > }; > extern struct inodes_stat_t inodes_stat; > +extern int get_nr_inodes(void); > > extern int leases_enable, lease_break_time; > > @@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp, > void __user *buffer, size_t *lenp, loff_t *ppos); > int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, > void __user *buffer, size_t *lenp, loff_t *ppos); > +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos); > > int get_filesystem_list(char * buf); > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 777bee7..b705f3a 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = { > .data = &inodes_stat, > .maxlen = 2*sizeof(int), > .mode = 0444, > - .proc_handler = &proc_dointvec, > + .proc_handler = &proc_nr_inodes, > }, > { > .ctl_name = FS_STATINODE, > @@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = { > .data = &inodes_stat, > .maxlen = 7*sizeof(int), > .mode = 0444, > - .proc_handler = &proc_dointvec, > + .proc_handler = &proc_nr_inodes, > }, > { > .procname = "file-nr", > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 2970e35..a71a922 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg) > next_jif = start_jif + dirty_writeback_interval; > nr_to_write = global_page_state(NR_FILE_DIRTY) + > global_page_state(NR_UNSTABLE_NFS) + > - (inodes_stat.nr_inodes - inodes_stat.nr_unused); > + (get_nr_inodes() - inodes_stat.nr_unused); > while (nr_to_write > 0) { > wbc.more_io = 0; > wbc.encountered_congestion = 0; ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (2 preceding siblings ...) 2008-12-11 22:39 ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet @ 2008-12-11 22:39 ` Eric Dumazet 2007-07-24 1:34 ` Nick Piggin 2008-12-16 21:26 ` Paul E. McKenney 2008-12-11 22:39 ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet ` (3 subsequent siblings) 7 siblings, 2 replies; 191+ messages in thread From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney new_inode() dirties a contended cache line to get increasing inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino, and give same spreading ino numbers than before. (same wraparound after 2^32 allocations) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/inode.c | 35 ++++++++++++++++++++++++++++++++--- 1 files changed, 32 insertions(+), 3 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index f94f889..dc8e72a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -556,6 +556,36 @@ repeat: return node ? inode : NULL; } +#ifdef CONFIG_SMP +/* + * Each cpu owns a range of 1024 numbers. + * 'shared_last_ino' is dirtied only once out of 1024 allocations, + * to renew the exhausted range. + */ +static DEFINE_PER_CPU(int, last_ino); + +static int last_ino_get(void) +{ + static atomic_t shared_last_ino; + int *p = &get_cpu_var(last_ino); + int res = *p; + + if (unlikely((res & 1023) == 0)) + res = atomic_add_return(1024, &shared_last_ino) - 1024; + + *p = ++res; + put_cpu_var(last_ino); + return res; +} +#else +static int last_ino_get(void) +{ + static int last_ino; + + return ++last_ino; +} +#endif + /** * new_inode - obtain an inode * @sb: superblock @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb) * error if st_ino won't fit in target struct field. Use 32bit counter * here to attempt to avoid that. */ - static unsigned int last_ino; struct inode * inode; spin_lock_prefetch(&inode_lock); @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb) inode = alloc_inode(sb); if (inode) { percpu_counter_inc(&nr_inodes); + inode->i_state = 0; + inode->i_ino = last_ino_get(); spin_lock(&inode_lock); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); - inode->i_ino = ++last_ino; - inode->i_state = 0; spin_unlock(&inode_lock); } return inode; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator 2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet @ 2007-07-24 1:34 ` Nick Piggin 2008-12-16 21:26 ` Paul E. McKenney 1 sibling, 0 replies; 191+ messages in thread From: Nick Piggin @ 2007-07-24 1:34 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney On Friday 12 December 2008 09:39, Eric Dumazet wrote: > new_inode() dirties a contended cache line to get increasing > inode numbers. > > Solve this problem by providing to each cpu a per_cpu variable, > feeded by the shared last_ino, but once every 1024 allocations. > > This reduce contention on the shared last_ino, and give same > spreading ino numbers than before. > (same wraparound after 2^32 allocations) I don't suppose this would cause any filesystems to do silly things? Seems like a good idea, if you could just add a #define instead of 1024. > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > fs/inode.c | 35 ++++++++++++++++++++++++++++++++--- > 1 files changed, 32 insertions(+), 3 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index f94f889..dc8e72a 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -556,6 +556,36 @@ repeat: > return node ? inode : NULL; > } > > +#ifdef CONFIG_SMP > +/* > + * Each cpu owns a range of 1024 numbers. > + * 'shared_last_ino' is dirtied only once out of 1024 allocations, > + * to renew the exhausted range. > + */ > +static DEFINE_PER_CPU(int, last_ino); > + > +static int last_ino_get(void) > +{ > + static atomic_t shared_last_ino; > + int *p = &get_cpu_var(last_ino); > + int res = *p; > + > + if (unlikely((res & 1023) == 0)) > + res = atomic_add_return(1024, &shared_last_ino) - 1024; > + > + *p = ++res; > + put_cpu_var(last_ino); > + return res; > +} > +#else > +static int last_ino_get(void) > +{ > + static int last_ino; > + > + return ++last_ino; > +} > +#endif > + > /** > * new_inode - obtain an inode > * @sb: superblock > @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb) > * error if st_ino won't fit in target struct field. Use 32bit counter > * here to attempt to avoid that. > */ > - static unsigned int last_ino; > struct inode * inode; > > spin_lock_prefetch(&inode_lock); > @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb) > inode = alloc_inode(sb); > if (inode) { > percpu_counter_inc(&nr_inodes); > + inode->i_state = 0; > + inode->i_ino = last_ino_get(); > spin_lock(&inode_lock); > list_add(&inode->i_list, &inode_in_use); > list_add(&inode->i_sb_list, &sb->s_inodes); > - inode->i_ino = ++last_ino; > - inode->i_state = 0; > spin_unlock(&inode_lock); > } > return inode; ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator 2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet 2007-07-24 1:34 ` Nick Piggin @ 2008-12-16 21:26 ` Paul E. McKenney 1 sibling, 0 replies; 191+ messages in thread From: Paul E. McKenney @ 2008-12-16 21:26 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro On Thu, Dec 11, 2008 at 11:39:18PM +0100, Eric Dumazet wrote: > new_inode() dirties a contended cache line to get increasing > inode numbers. > > Solve this problem by providing to each cpu a per_cpu variable, > feeded by the shared last_ino, but once every 1024 allocations. > > This reduce contention on the shared last_ino, and give same > spreading ino numbers than before. > (same wraparound after 2^32 allocations) One question below, but just a clarification. Works correctly as is, though a bit strangely. Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > fs/inode.c | 35 ++++++++++++++++++++++++++++++++--- > 1 files changed, 32 insertions(+), 3 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index f94f889..dc8e72a 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -556,6 +556,36 @@ repeat: > return node ? inode : NULL; > } > > +#ifdef CONFIG_SMP > +/* > + * Each cpu owns a range of 1024 numbers. > + * 'shared_last_ino' is dirtied only once out of 1024 allocations, > + * to renew the exhausted range. > + */ > +static DEFINE_PER_CPU(int, last_ino); > + > +static int last_ino_get(void) > +{ > + static atomic_t shared_last_ino; > + int *p = &get_cpu_var(last_ino); > + int res = *p; > + > + if (unlikely((res & 1023) == 0)) > + res = atomic_add_return(1024, &shared_last_ino) - 1024; > + > + *p = ++res; So the first CPU gets the range [1:1024], the second [1025:2048], and so on, eventually wrapping to [4294966273:0]. Is that the intent? (I don't see a problem with this, just seems a bit strange.) > + put_cpu_var(last_ino); > + return res; > +} > +#else > +static int last_ino_get(void) > +{ > + static int last_ino; > + > + return ++last_ino; > +} > +#endif > + > /** > * new_inode - obtain an inode > * @sb: superblock > @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb) > * error if st_ino won't fit in target struct field. Use 32bit counter > * here to attempt to avoid that. > */ > - static unsigned int last_ino; > struct inode * inode; > > spin_lock_prefetch(&inode_lock); > @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb) > inode = alloc_inode(sb); > if (inode) { > percpu_counter_inc(&nr_inodes); > + inode->i_state = 0; > + inode->i_ino = last_ino_get(); > spin_lock(&inode_lock); > list_add(&inode->i_list, &inode_in_use); > list_add(&inode->i_sb_list, &sb->s_inodes); > - inode->i_ino = ++last_ino; > - inode->i_state = 0; > spin_unlock(&inode_lock); > } > return inode; ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (3 preceding siblings ...) 2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet @ 2008-12-11 22:39 ` Eric Dumazet 2008-12-16 21:40 ` Paul E. McKenney 2008-12-11 22:40 ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet ` (2 subsequent siblings) 7 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Sockets, pipes and anonymous fds have interesting properties. Like other files, they use a dentry and an inode. But dentries for these kind of files are not hashed into dcache, since there is no way someone can lookup such a file in the vfs tree. (/proc/{pid}/fd/{number} uses a different mechanism) Still, allocating and freeing such dentries are expensive processes, because we currently take dcache_lock inside d_alloc(), d_instantiate(), and dput(). This lock is very contended on SMP machines. This patch defines a new DCACHE_SINGLE flag, to mark a dentry as a single one (for sockets, pipes, anonymous fd), and a new d_alloc_single(const struct qstr *name, struct inode *inode) method, called by the three subsystems. Internally, dput() can take a fast path to dput_single() for SINGLE dentries. No more atomic_dec_and_lock() for such dentries. Differences betwen an SINGLE dentry and a normal one are : 1) SINGLE dentry has the DCACHE_SINGLE flag 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED) This to avoid taking a reference on sb 'root' dentry, shared by too many dentries. 3) They are not hashed into global hash table (DCACHE_UNHASHED) 4) Their d_alias list is empty ("socketallocbench -n 8" bench result : from 25s to 19.9s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 16 ------------ fs/dcache.c | 51 +++++++++++++++++++++++++++++++++++++++ fs/pipe.c | 23 +---------------- include/linux/dcache.h | 9 ++++++ net/socket.c | 24 +----------------- 5 files changed, 65 insertions(+), 58 deletions(-) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 3662dd4..8bf83cb 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags, mnt); } -static int anon_inodefs_delete_dentry(struct dentry *dentry) -{ - /* - * We faked vfs to believe the dentry was hashed when we created it. - * Now we restore the flag so that dput() will work correctly. - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 1; -} - static struct file_system_type anon_inode_fs_type = { .name = "anon_inodefs", .get_sb = anon_inodefs_get_sb, .kill_sb = kill_anon_super, }; static struct dentry_operations anon_inodefs_dentry_operations = { - .d_delete = anon_inodefs_delete_dentry, }; /** @@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, this.name = name; this.len = strlen(name); this.hash = 0; - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); + dentry = d_alloc_single(&this, anon_inode_inode); if (!dentry) goto err_put_unused_fd; @@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, atomic_inc(&anon_inode_inode->i_count); dentry->d_op = &anon_inodefs_dentry_operations; - /* Do not publish this dentry inside the global dentry hash table */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, anon_inode_inode); error = -ENFILE; file = alloc_file(anon_inode_mnt, dentry, diff --git a/fs/dcache.c b/fs/dcache.c index f463a81..af3bfb3 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry) */ /* + * special version of dput() for pipes/sockets/anon. + * These dentries are not present in hash table, we can avoid + * taking/dirtying dcache_lock + */ +static void dput_single(struct dentry *dentry) +{ + struct inode *inode; + + if (!atomic_dec_and_test(&dentry->d_count)) + return; + inode = dentry->d_inode; + if (inode) + iput(inode); + d_free(dentry); +} + +/* * dput - release a dentry * @dentry: dentry to release * @@ -234,6 +251,11 @@ void dput(struct dentry *dentry) { if (!dentry) return; + /* + * single dentries (sockets/pipes/anon) fast path + */ + if (dentry->d_flags & DCACHE_SINGLE) + return dput_single(dentry); repeat: if (atomic_read(&dentry->d_count) == 1) @@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode) return res; } +/** + * d_alloc_single - allocate SINGLE dentry + * @name: dentry name, given in a qstr structure + * @inode: inode to allocate the dentry for + * + * Allocate an SINGLE dentry for the inode given. The inode is + * instantiated and returned. %NULL is returned if there is insufficient + * memory. + * - SINGLE dentries have themselves as a parent. + * - SINGLE dentries are not hashed into global hash table + * - their d_alias list is empty + */ +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode) +{ + struct dentry *entry; + + entry = d_alloc(NULL, name); + if (entry) { + entry->d_sb = inode->i_sb; + entry->d_parent = entry; + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED; + entry->d_inode = inode; + fsnotify_d_instantiate(entry, inode); + security_d_instantiate(entry, inode); + } + return entry; +} + + static inline struct hlist_head *d_hash(struct dentry *parent, unsigned long hash) { diff --git a/fs/pipe.c b/fs/pipe.c index 7aea8b8..4de6dd5 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode) } static struct vfsmount *pipe_mnt __read_mostly; -static int pipefs_delete_dentry(struct dentry *dentry) -{ - /* - * At creation time, we pretended this dentry was hashed - * (by clearing DCACHE_UNHASHED bit in d_flags) - * At delete time, we restore the truth : not hashed. - * (so that dput() can proceed correctly) - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 0; -} /* * pipefs_dname() is called from d_path(). @@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen) } static struct dentry_operations pipefs_dentry_operations = { - .d_delete = pipefs_delete_dentry, .d_dname = pipefs_dname, }; @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags) struct inode *inode; struct file *f; struct dentry *dentry; - struct qstr name = { .name = "" }; + static const struct qstr name = { .name = "" }; err = -ENFILE; inode = get_pipe_inode(); @@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags) goto err; err = -ENOMEM; - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_single(&name, inode); if (!dentry) goto err_inode; dentry->d_op = &pipefs_dentry_operations; - /* - * We dont want to publish this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on pipes - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, inode); err = -ENFILE; f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index a37359d..ca8d269 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -176,6 +176,14 @@ d_iput: no no no yes #define DCACHE_UNHASHED 0x0010 #define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched */ +#define DCACHE_SINGLE 0x0040 + /* + * socket, pipe or anonymous fd dentry + * - SINGLE dentries have themselves as a parent. + * - SINGLE dentries are not hashed into global hash table + * - Their d_alias list is empty + * - They dont need dcache_lock synchronization + */ extern spinlock_t dcache_lock; extern seqlock_t rename_lock; @@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *); extern void shrink_dcache_parent(struct dentry *); extern void shrink_dcache_for_umount(struct super_block *); extern int d_invalidate(struct dentry *); +extern struct dentry *d_alloc_single(const struct qstr *, struct inode *); /* only used at mount-time */ extern struct dentry * d_alloc_root(struct inode *); diff --git a/net/socket.c b/net/socket.c index 92764d8..353c928 100644 --- a/net/socket.c +++ b/net/socket.c @@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = { .kill_sb = kill_anon_super, }; -static int sockfs_delete_dentry(struct dentry *dentry) -{ - /* - * At creation time, we pretended this dentry was hashed - * (by clearing DCACHE_UNHASHED bit in d_flags) - * At delete time, we restore the truth : not hashed. - * (so that dput() can proceed correctly) - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 0; -} - /* * sockfs_dname() is called from d_path(). */ @@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen) } static struct dentry_operations sockfs_dentry_operations = { - .d_delete = sockfs_delete_dentry, .d_dname = sockfs_dname, }; @@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags) static int sock_attach_fd(struct socket *sock, struct file *file, int flags) { struct dentry *dentry; - struct qstr name = { .name = "" }; + static const struct qstr name = { .name = "" }; - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_single(&name, SOCK_INODE(sock)); if (unlikely(!dentry)) return -ENOMEM; dentry->d_op = &sockfs_dentry_operations; - /* - * We dont want to push this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on sockets - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, SOCK_INODE(sock)); sock->file = file; init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd 2008-12-11 22:39 ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet @ 2008-12-16 21:40 ` Paul E. McKenney 0 siblings, 0 replies; 191+ messages in thread From: Paul E. McKenney @ 2008-12-16 21:40 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro On Thu, Dec 11, 2008 at 11:39:38PM +0100, Eric Dumazet wrote: > Sockets, pipes and anonymous fds have interesting properties. > > Like other files, they use a dentry and an inode. > > But dentries for these kind of files are not hashed into dcache, > since there is no way someone can lookup such a file in the vfs tree. > (/proc/{pid}/fd/{number} uses a different mechanism) > > Still, allocating and freeing such dentries are expensive processes, > because we currently take dcache_lock inside d_alloc(), d_instantiate(), > and dput(). This lock is very contended on SMP machines. > > This patch defines a new DCACHE_SINGLE flag, to mark a dentry as > a single one (for sockets, pipes, anonymous fd), and a new > d_alloc_single(const struct qstr *name, struct inode *inode) > method, called by the three subsystems. > > Internally, dput() can take a fast path to dput_single() for > SINGLE dentries. No more atomic_dec_and_lock() > for such dentries. > > > Differences betwen an SINGLE dentry and a normal one are : > > 1) SINGLE dentry has the DCACHE_SINGLE flag > 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED) > This to avoid taking a reference on sb 'root' dentry, shared > by too many dentries. > 3) They are not hashed into global hash table (DCACHE_UNHASHED) > 4) Their d_alias list is empty > > ("socketallocbench -n 8" bench result : from 25s to 19.9s) Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > fs/anon_inodes.c | 16 ------------ > fs/dcache.c | 51 +++++++++++++++++++++++++++++++++++++++ > fs/pipe.c | 23 +---------------- > include/linux/dcache.h | 9 ++++++ > net/socket.c | 24 +----------------- > 5 files changed, 65 insertions(+), 58 deletions(-) > > diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c > index 3662dd4..8bf83cb 100644 > --- a/fs/anon_inodes.c > +++ b/fs/anon_inodes.c > @@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags, > mnt); > } > > -static int anon_inodefs_delete_dentry(struct dentry *dentry) > -{ > - /* > - * We faked vfs to believe the dentry was hashed when we created it. > - * Now we restore the flag so that dput() will work correctly. > - */ > - dentry->d_flags |= DCACHE_UNHASHED; > - return 1; > -} > - > static struct file_system_type anon_inode_fs_type = { > .name = "anon_inodefs", > .get_sb = anon_inodefs_get_sb, > .kill_sb = kill_anon_super, > }; > static struct dentry_operations anon_inodefs_dentry_operations = { > - .d_delete = anon_inodefs_delete_dentry, > }; > > /** > @@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, > this.name = name; > this.len = strlen(name); > this.hash = 0; > - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); > + dentry = d_alloc_single(&this, anon_inode_inode); > if (!dentry) > goto err_put_unused_fd; > > @@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, > atomic_inc(&anon_inode_inode->i_count); > > dentry->d_op = &anon_inodefs_dentry_operations; > - /* Do not publish this dentry inside the global dentry hash table */ > - dentry->d_flags &= ~DCACHE_UNHASHED; > - d_instantiate(dentry, anon_inode_inode); > > error = -ENFILE; > file = alloc_file(anon_inode_mnt, dentry, > diff --git a/fs/dcache.c b/fs/dcache.c > index f463a81..af3bfb3 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry) > */ > > /* > + * special version of dput() for pipes/sockets/anon. > + * These dentries are not present in hash table, we can avoid > + * taking/dirtying dcache_lock > + */ > +static void dput_single(struct dentry *dentry) > +{ > + struct inode *inode; > + > + if (!atomic_dec_and_test(&dentry->d_count)) > + return; > + inode = dentry->d_inode; > + if (inode) > + iput(inode); > + d_free(dentry); > +} > + > +/* > * dput - release a dentry > * @dentry: dentry to release > * > @@ -234,6 +251,11 @@ void dput(struct dentry *dentry) > { > if (!dentry) > return; > + /* > + * single dentries (sockets/pipes/anon) fast path > + */ > + if (dentry->d_flags & DCACHE_SINGLE) > + return dput_single(dentry); > > repeat: > if (atomic_read(&dentry->d_count) == 1) > @@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode) > return res; > } > > +/** > + * d_alloc_single - allocate SINGLE dentry > + * @name: dentry name, given in a qstr structure > + * @inode: inode to allocate the dentry for > + * > + * Allocate an SINGLE dentry for the inode given. The inode is > + * instantiated and returned. %NULL is returned if there is insufficient > + * memory. > + * - SINGLE dentries have themselves as a parent. > + * - SINGLE dentries are not hashed into global hash table > + * - their d_alias list is empty > + */ > +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode) > +{ > + struct dentry *entry; > + > + entry = d_alloc(NULL, name); > + if (entry) { > + entry->d_sb = inode->i_sb; > + entry->d_parent = entry; > + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED; > + entry->d_inode = inode; > + fsnotify_d_instantiate(entry, inode); > + security_d_instantiate(entry, inode); > + } > + return entry; > +} > + > + > static inline struct hlist_head *d_hash(struct dentry *parent, > unsigned long hash) > { > diff --git a/fs/pipe.c b/fs/pipe.c > index 7aea8b8..4de6dd5 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode) > } > > static struct vfsmount *pipe_mnt __read_mostly; > -static int pipefs_delete_dentry(struct dentry *dentry) > -{ > - /* > - * At creation time, we pretended this dentry was hashed > - * (by clearing DCACHE_UNHASHED bit in d_flags) > - * At delete time, we restore the truth : not hashed. > - * (so that dput() can proceed correctly) > - */ > - dentry->d_flags |= DCACHE_UNHASHED; > - return 0; > -} > > /* > * pipefs_dname() is called from d_path(). > @@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen) > } > > static struct dentry_operations pipefs_dentry_operations = { > - .d_delete = pipefs_delete_dentry, > .d_dname = pipefs_dname, > }; > > @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags) > struct inode *inode; > struct file *f; > struct dentry *dentry; > - struct qstr name = { .name = "" }; > + static const struct qstr name = { .name = "" }; > > err = -ENFILE; > inode = get_pipe_inode(); > @@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags) > goto err; > > err = -ENOMEM; > - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); > + dentry = d_alloc_single(&name, inode); > if (!dentry) > goto err_inode; > > dentry->d_op = &pipefs_dentry_operations; > - /* > - * We dont want to publish this dentry into global dentry hash table. > - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED > - * This permits a working /proc/$pid/fd/XXX on pipes > - */ > - dentry->d_flags &= ~DCACHE_UNHASHED; > - d_instantiate(dentry, inode); > > err = -ENFILE; > f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops); > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index a37359d..ca8d269 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -176,6 +176,14 @@ d_iput: no no no yes > #define DCACHE_UNHASHED 0x0010 > > #define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched */ > +#define DCACHE_SINGLE 0x0040 > + /* > + * socket, pipe or anonymous fd dentry > + * - SINGLE dentries have themselves as a parent. > + * - SINGLE dentries are not hashed into global hash table > + * - Their d_alias list is empty > + * - They dont need dcache_lock synchronization > + */ > > extern spinlock_t dcache_lock; > extern seqlock_t rename_lock; > @@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *); > extern void shrink_dcache_parent(struct dentry *); > extern void shrink_dcache_for_umount(struct super_block *); > extern int d_invalidate(struct dentry *); > +extern struct dentry *d_alloc_single(const struct qstr *, struct inode *); > > /* only used at mount-time */ > extern struct dentry * d_alloc_root(struct inode *); > diff --git a/net/socket.c b/net/socket.c > index 92764d8..353c928 100644 > --- a/net/socket.c > +++ b/net/socket.c > @@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = { > .kill_sb = kill_anon_super, > }; > > -static int sockfs_delete_dentry(struct dentry *dentry) > -{ > - /* > - * At creation time, we pretended this dentry was hashed > - * (by clearing DCACHE_UNHASHED bit in d_flags) > - * At delete time, we restore the truth : not hashed. > - * (so that dput() can proceed correctly) > - */ > - dentry->d_flags |= DCACHE_UNHASHED; > - return 0; > -} > - > /* > * sockfs_dname() is called from d_path(). > */ > @@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen) > } > > static struct dentry_operations sockfs_dentry_operations = { > - .d_delete = sockfs_delete_dentry, > .d_dname = sockfs_dname, > }; > > @@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags) > static int sock_attach_fd(struct socket *sock, struct file *file, int flags) > { > struct dentry *dentry; > - struct qstr name = { .name = "" }; > + static const struct qstr name = { .name = "" }; > > - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); > + dentry = d_alloc_single(&name, SOCK_INODE(sock)); > if (unlikely(!dentry)) > return -ENOMEM; > > dentry->d_op = &sockfs_dentry_operations; > - /* > - * We dont want to push this dentry into global dentry hash table. > - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED > - * This permits a working /proc/$pid/fd/XXX on sockets > - */ > - dentry->d_flags &= ~DCACHE_UNHASHED; > - d_instantiate(dentry, SOCK_INODE(sock)); > > sock->file = file; > init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH v3 5/7] fs: new_inode_single() and iput_single() 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (4 preceding siblings ...) 2008-12-11 22:39 ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet @ 2008-12-11 22:40 ` Eric Dumazet 2008-12-16 21:41 ` Paul E. McKenney 2008-12-11 22:40 ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet 2008-12-11 22:41 ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet 7 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-12-11 22:40 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. SINGLE dentries are attached to inodes that dont need to be linked in a list of inodes, being "inode_in_use" or "sb->s_inodes" As inode_lock was taken only to protect these lists, we avoid taking it as well. Using iput_single() from dput_single() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() ("socketallocbench -n 8" result : from 19.9s to 3.01s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 2 +- fs/dcache.c | 2 +- fs/inode.c | 29 ++++++++++++++++++++--------- fs/pipe.c | 2 +- include/linux/fs.h | 12 +++++++++++- net/socket.c | 2 +- 6 files changed, 35 insertions(+), 14 deletions(-) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 8bf83cb..89fd36d 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd); */ static struct inode *anon_inode_mkinode(void) { - struct inode *inode = new_inode(anon_inode_mnt->mnt_sb); + struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb); if (!inode) return ERR_PTR(-ENOMEM); diff --git a/fs/dcache.c b/fs/dcache.c index af3bfb3..3363853 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry) return; inode = dentry->d_inode; if (inode) - iput(inode); + iput_single(inode); d_free(dentry); } diff --git a/fs/inode.c b/fs/inode.c index dc8e72a..0fdfe1b 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode) kmem_cache_free(inode_cachep, (inode)); } +void iput_single(struct inode *inode) +{ + if (atomic_dec_and_test(&inode->i_count)) { + destroy_inode(inode); + percpu_counter_dec(&nr_inodes); + } +} /* * These are initializations that only need to be done @@ -587,8 +594,9 @@ static int last_ino_get(void) #endif /** - * new_inode - obtain an inode + * __new_inode - obtain an inode * @sb: superblock + * @single: if true, dont link new inode in a list * * Allocates a new inode for given superblock. The default gfp_mask * for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE. @@ -598,7 +606,7 @@ static int last_ino_get(void) * newly created inode's mapping * */ -struct inode *new_inode(struct super_block *sb) +struct inode *__new_inode(struct super_block *sb, int single) { /* * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW @@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb) */ struct inode * inode; - spin_lock_prefetch(&inode_lock); - inode = alloc_inode(sb); if (inode) { percpu_counter_inc(&nr_inodes); inode->i_state = 0; inode->i_ino = last_ino_get(); - spin_lock(&inode_lock); - list_add(&inode->i_list, &inode_in_use); - list_add(&inode->i_sb_list, &sb->s_inodes); - spin_unlock(&inode_lock); + if (single) { + INIT_LIST_HEAD(&inode->i_list); + INIT_LIST_HEAD(&inode->i_sb_list); + } else { + spin_lock(&inode_lock); + list_add(&inode->i_list, &inode_in_use); + list_add(&inode->i_sb_list, &sb->s_inodes); + spin_unlock(&inode_lock); + } } return inode; } -EXPORT_SYMBOL(new_inode); +EXPORT_SYMBOL(__new_inode); void unlock_new_inode(struct inode *inode) { diff --git a/fs/pipe.c b/fs/pipe.c index 4de6dd5..8c51a0d 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = { static struct inode * get_pipe_inode(void) { - struct inode *inode = new_inode(pipe_mnt->mnt_sb); + struct inode *inode = new_inode_single(pipe_mnt->mnt_sb); struct pipe_inode_info *pipe; if (!inode) diff --git a/include/linux/fs.h b/include/linux/fs.h index a789346..a702d81 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode); extern void iget_failed(struct inode *); extern void clear_inode(struct inode *); extern void destroy_inode(struct inode *); -extern struct inode *new_inode(struct super_block *); +extern struct inode *__new_inode(struct super_block *, int); +static inline struct inode *new_inode(struct super_block *sb) +{ + return __new_inode(sb, 0); +} +static inline struct inode *new_inode_single(struct super_block *sb) +{ + return __new_inode(sb, 1); +} +extern void iput_single(struct inode *); + extern int should_remove_suid(struct dentry *); extern int file_remove_suid(struct file *); diff --git a/net/socket.c b/net/socket.c index 353c928..4017409 100644 --- a/net/socket.c +++ b/net/socket.c @@ -464,7 +464,7 @@ static struct socket *sock_alloc(void) struct inode *inode; struct socket *sock; - inode = new_inode(sock_mnt->mnt_sb); + inode = new_inode_single(sock_mnt->mnt_sb); if (!inode) return NULL; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH v3 5/7] fs: new_inode_single() and iput_single() 2008-12-11 22:40 ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet @ 2008-12-16 21:41 ` Paul E. McKenney 0 siblings, 0 replies; 191+ messages in thread From: Paul E. McKenney @ 2008-12-16 21:41 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro On Thu, Dec 11, 2008 at 11:40:07PM +0100, Eric Dumazet wrote: > Goal of this patch is to not touch inode_lock for socket/pipes/anonfd > inodes allocation/freeing. > > SINGLE dentries are attached to inodes that dont need to be linked > in a list of inodes, being "inode_in_use" or "sb->s_inodes" > As inode_lock was taken only to protect these lists, we avoid taking it > as well. > > Using iput_single() from dput_single() avoids taking inode_lock > at freeing time. > > This patch has a very noticeable effect, because we avoid dirtying of > three contended cache lines in new_inode(), and five cache lines in iput() > > ("socketallocbench -n 8" result : from 19.9s to 3.01s) Nice! Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > fs/anon_inodes.c | 2 +- > fs/dcache.c | 2 +- > fs/inode.c | 29 ++++++++++++++++++++--------- > fs/pipe.c | 2 +- > include/linux/fs.h | 12 +++++++++++- > net/socket.c | 2 +- > 6 files changed, 35 insertions(+), 14 deletions(-) > > diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c > index 8bf83cb..89fd36d 100644 > --- a/fs/anon_inodes.c > +++ b/fs/anon_inodes.c > @@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd); > */ > static struct inode *anon_inode_mkinode(void) > { > - struct inode *inode = new_inode(anon_inode_mnt->mnt_sb); > + struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb); > > if (!inode) > return ERR_PTR(-ENOMEM); > diff --git a/fs/dcache.c b/fs/dcache.c > index af3bfb3..3363853 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry) > return; > inode = dentry->d_inode; > if (inode) > - iput(inode); > + iput_single(inode); > d_free(dentry); > } > > diff --git a/fs/inode.c b/fs/inode.c > index dc8e72a..0fdfe1b 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode) > kmem_cache_free(inode_cachep, (inode)); > } > > +void iput_single(struct inode *inode) > +{ > + if (atomic_dec_and_test(&inode->i_count)) { > + destroy_inode(inode); > + percpu_counter_dec(&nr_inodes); > + } > +} > > /* > * These are initializations that only need to be done > @@ -587,8 +594,9 @@ static int last_ino_get(void) > #endif > > /** > - * new_inode - obtain an inode > + * __new_inode - obtain an inode > * @sb: superblock > + * @single: if true, dont link new inode in a list > * > * Allocates a new inode for given superblock. The default gfp_mask > * for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE. > @@ -598,7 +606,7 @@ static int last_ino_get(void) > * newly created inode's mapping > * > */ > -struct inode *new_inode(struct super_block *sb) > +struct inode *__new_inode(struct super_block *sb, int single) > { > /* > * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW > @@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb) > */ > struct inode * inode; > > - spin_lock_prefetch(&inode_lock); > - > inode = alloc_inode(sb); > if (inode) { > percpu_counter_inc(&nr_inodes); > inode->i_state = 0; > inode->i_ino = last_ino_get(); > - spin_lock(&inode_lock); > - list_add(&inode->i_list, &inode_in_use); > - list_add(&inode->i_sb_list, &sb->s_inodes); > - spin_unlock(&inode_lock); > + if (single) { > + INIT_LIST_HEAD(&inode->i_list); > + INIT_LIST_HEAD(&inode->i_sb_list); > + } else { > + spin_lock(&inode_lock); > + list_add(&inode->i_list, &inode_in_use); > + list_add(&inode->i_sb_list, &sb->s_inodes); > + spin_unlock(&inode_lock); > + } > } > return inode; > } > > -EXPORT_SYMBOL(new_inode); > +EXPORT_SYMBOL(__new_inode); > > void unlock_new_inode(struct inode *inode) > { > diff --git a/fs/pipe.c b/fs/pipe.c > index 4de6dd5..8c51a0d 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = { > > static struct inode * get_pipe_inode(void) > { > - struct inode *inode = new_inode(pipe_mnt->mnt_sb); > + struct inode *inode = new_inode_single(pipe_mnt->mnt_sb); > struct pipe_inode_info *pipe; > > if (!inode) > diff --git a/include/linux/fs.h b/include/linux/fs.h > index a789346..a702d81 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode); > extern void iget_failed(struct inode *); > extern void clear_inode(struct inode *); > extern void destroy_inode(struct inode *); > -extern struct inode *new_inode(struct super_block *); > +extern struct inode *__new_inode(struct super_block *, int); > +static inline struct inode *new_inode(struct super_block *sb) > +{ > + return __new_inode(sb, 0); > +} > +static inline struct inode *new_inode_single(struct super_block *sb) > +{ > + return __new_inode(sb, 1); > +} > +extern void iput_single(struct inode *); > + > extern int should_remove_suid(struct dentry *); > extern int file_remove_suid(struct file *); > > diff --git a/net/socket.c b/net/socket.c > index 353c928..4017409 100644 > --- a/net/socket.c > +++ b/net/socket.c > @@ -464,7 +464,7 @@ static struct socket *sock_alloc(void) > struct inode *inode; > struct socket *sock; > > - inode = new_inode(sock_mnt->mnt_sb); > + inode = new_inode_single(sock_mnt->mnt_sb); > if (!inode) > return NULL; > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (5 preceding siblings ...) 2008-12-11 22:40 ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet @ 2008-12-11 22:40 ` Eric Dumazet 2007-07-24 1:13 ` Nick Piggin 2008-12-11 22:41 ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet 7 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-12-11 22:40 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney From: Christoph Lameter <cl@linux-foundation.org> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Currently we schedule RCU frees for each file we free separately. That has several drawbacks against the earlier file handling (in 2.6.5 f.e.), which did not require RCU callbacks: 1. Excessive number of RCU callbacks can be generated causing long RCU queues that in turn cause long latencies. We hit SLUB page allocation more often than necessary. 2. The cache hot object is not preserved between free and realloc. A close followed by another open is very fast with the RCUless approach because the last freed object is returned by the slab allocator that is still cache hot. RCU free means that the object is not immediately available again. The new object is cache cold and therefore open/close performance tests show a significant degradation with the RCU implementation. One solution to this problem is to move the RCU freeing into the Slab allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation time. The slab allocator will do RCU frees only when it is necessary to dispose of slabs of objects (rare). So with that approach we can cut out the RCU overhead significantly. However, the slab allocator may return the object for another use even before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means there is the (unlikely) possibility that the object is going to be switched under us in sections protected by rcu_read_lock() and rcu_read_unlock(). So we need to verify that we have acquired the correct object after establishing a stable object reference (incrementing the refcounter does that). Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- Documentation/filesystems/files.txt | 21 ++++++++++++++-- fs/file_table.c | 33 ++++++++++++++++++-------- include/linux/fs.h | 5 --- 3 files changed, 42 insertions(+), 17 deletions(-) diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 --- a/Documentation/filesystems/files.txt +++ b/Documentation/filesystems/files.txt @@ -78,13 +78,28 @@ the fdtable structure - that look-up may race with the last put() operation on the file structure. This is avoided using atomic_long_inc_not_zero() on ->f_count : + As file structures are allocated with SLAB_DESTROY_BY_RCU, + they can also be freed before a RCU grace period, and reused, + but still as a struct file. + It is necessary to check again after getting + a stable reference (ie after atomic_long_inc_not_zero()), + that fcheck_files(files, fd) points to the same file. rcu_read_lock(); file = fcheck_files(files, fd); if (file) { - if (atomic_long_inc_not_zero(&file->f_count)) + if (atomic_long_inc_not_zero(&file->f_count)) { *fput_needed = 1; - else + /* + * Now we have a stable reference to an object. + * Check if other threads freed file and reallocated it. + */ + if (file != fcheck_files(files, fd)) { + *fput_needed = 0; + put_filp(file); + file = NULL; + } + } else /* Didn't get the reference, someone's freed */ file = NULL; } @@ -95,6 +110,8 @@ the fdtable structure - atomic_long_inc_not_zero() detects if refcounts is already zero or goes to zero during increment. If it does, we fail fget()/fget_light(). + The second call to fcheck_files(files, fd) checks that this filp + was not freed, then reused by an other thread. 6. Since both fdtable and file structures can be looked up lock-free, they must be installed using rcu_assign_pointer() diff --git a/fs/file_table.c b/fs/file_table.c index a46e880..3e9259d 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; static struct percpu_counter nr_files __cacheline_aligned_in_smp; -static inline void file_free_rcu(struct rcu_head *head) -{ - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); - kmem_cache_free(filp_cachep, f); -} - static inline void file_free(struct file *f) { percpu_counter_dec(&nr_files); file_check_state(f); - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); + kmem_cache_free(filp_cachep, f); } /* @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) rcu_read_unlock(); return NULL; } + /* + * Now we have a stable reference to an object. + * Check if other threads freed file and re-allocated it. + */ + if (unlikely(file != fcheck_files(files, fd))) { + put_filp(file); + file = NULL; + } } rcu_read_unlock(); @@ -333,9 +335,19 @@ struct file *fget_light(unsigned int fd, int *fput_needed) rcu_read_lock(); file = fcheck_files(files, fd); if (file) { - if (atomic_long_inc_not_zero(&file->f_count)) + if (atomic_long_inc_not_zero(&file->f_count)) { *fput_needed = 1; - else + /* + * Now we have a stable reference to an object. + * Check if other threads freed this file and + * re-allocated it. + */ + if (unlikely(file != fcheck_files(files, fd))) { + *fput_needed = 0; + put_filp(file); + file = NULL; + } + } else /* Didn't get the reference, someone's freed */ file = NULL; } @@ -402,7 +414,8 @@ void __init files_init(unsigned long mempages) int n; filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); + SLAB_HWCACHE_ALIGN | SLAB_DESTROY_BY_RCU | SLAB_PANIC, + NULL); /* * One file with associated inode and dcache is very roughly 1K. diff --git a/include/linux/fs.h b/include/linux/fs.h index a702d81..a1f56d4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -811,13 +811,8 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) #define FILE_MNT_WRITE_RELEASED 2 struct file { - /* - * fu_list becomes invalid after file_free is called and queued via - * fu_rcuhead for RCU freeing - */ union { struct list_head fu_list; - struct rcu_head fu_rcuhead; } f_u; struct path f_path; #define f_dentry f_path.dentry ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2008-12-11 22:40 ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet @ 2007-07-24 1:13 ` Nick Piggin 2008-12-12 2:50 ` Nick Piggin 2008-12-12 4:45 ` Eric Dumazet 0 siblings, 2 replies; 191+ messages in thread From: Nick Piggin @ 2007-07-24 1:13 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney On Friday 12 December 2008 09:40, Eric Dumazet wrote: > From: Christoph Lameter <cl@linux-foundation.org> > > [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU > > Currently we schedule RCU frees for each file we free separately. That has > several drawbacks against the earlier file handling (in 2.6.5 f.e.), which > did not require RCU callbacks: > > 1. Excessive number of RCU callbacks can be generated causing long RCU > queues that in turn cause long latencies. We hit SLUB page allocation > more often than necessary. > > 2. The cache hot object is not preserved between free and realloc. A close > followed by another open is very fast with the RCUless approach because > the last freed object is returned by the slab allocator that is > still cache hot. RCU free means that the object is not immediately > available again. The new object is cache cold and therefore open/close > performance tests show a significant degradation with the RCU > implementation. > > One solution to this problem is to move the RCU freeing into the Slab > allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation > time. The slab allocator will do RCU frees only when it is necessary > to dispose of slabs of objects (rare). So with that approach we can cut > out the RCU overhead significantly. > > However, the slab allocator may return the object for another use even > before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means > there is the (unlikely) possibility that the object is going to be > switched under us in sections protected by rcu_read_lock() and > rcu_read_unlock(). So we need to verify that we have acquired the correct > object after establishing a stable object reference (incrementing the > refcounter does that). > > > Signed-off-by: Christoph Lameter <cl@linux-foundation.org> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > --- > Documentation/filesystems/files.txt | 21 ++++++++++++++-- > fs/file_table.c | 33 ++++++++++++++++++-------- > include/linux/fs.h | 5 --- > 3 files changed, 42 insertions(+), 17 deletions(-) > > diff --git a/Documentation/filesystems/files.txt > b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 > --- a/Documentation/filesystems/files.txt > +++ b/Documentation/filesystems/files.txt > @@ -78,13 +78,28 @@ the fdtable structure - > that look-up may race with the last put() operation on the > file structure. This is avoided using atomic_long_inc_not_zero() > on ->f_count : > + As file structures are allocated with SLAB_DESTROY_BY_RCU, > + they can also be freed before a RCU grace period, and reused, > + but still as a struct file. > + It is necessary to check again after getting > + a stable reference (ie after atomic_long_inc_not_zero()), > + that fcheck_files(files, fd) points to the same file. > > rcu_read_lock(); > file = fcheck_files(files, fd); > if (file) { > - if (atomic_long_inc_not_zero(&file->f_count)) > + if (atomic_long_inc_not_zero(&file->f_count)) { > *fput_needed = 1; > - else > + /* > + * Now we have a stable reference to an object. > + * Check if other threads freed file and reallocated it. > + */ > + if (file != fcheck_files(files, fd)) { > + *fput_needed = 0; > + put_filp(file); > + file = NULL; > + } > + } else > /* Didn't get the reference, someone's freed */ > file = NULL; > } > @@ -95,6 +110,8 @@ the fdtable structure - > atomic_long_inc_not_zero() detects if refcounts is already zero or > goes to zero during increment. If it does, we fail > fget()/fget_light(). > + The second call to fcheck_files(files, fd) checks that this filp > + was not freed, then reused by an other thread. > > 6. Since both fdtable and file structures can be looked up > lock-free, they must be installed using rcu_assign_pointer() > diff --git a/fs/file_table.c b/fs/file_table.c > index a46e880..3e9259d 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; > > static struct percpu_counter nr_files __cacheline_aligned_in_smp; > > -static inline void file_free_rcu(struct rcu_head *head) > -{ > - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); > - kmem_cache_free(filp_cachep, f); > -} > - > static inline void file_free(struct file *f) > { > percpu_counter_dec(&nr_files); > file_check_state(f); > - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); > + kmem_cache_free(filp_cachep, f); > } > > /* > @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) > rcu_read_unlock(); > return NULL; > } > + /* > + * Now we have a stable reference to an object. > + * Check if other threads freed file and re-allocated it. > + */ > + if (unlikely(file != fcheck_files(files, fd))) { > + put_filp(file); > + file = NULL; > + } This is a non-trivial change, because that put_filp may drop the last reference to the file. So now we have the case where we free the file from a context in which it had never been allocated. >From a quick glance though the callchains, I can't seen an obvious problem. But it needs to have documentation in put_filp, or at least a mention in the changelog, and also cc'ed to the security lists. Also, it adds code and cost to the get/put path in return for improvement in the free path. get/put is the more common path, but it is a small loss for a big improvement. So it might be worth it. But it is not justified by your microbenchmark. Do we have a more useful case that it helps? ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2007-07-24 1:13 ` Nick Piggin @ 2008-12-12 2:50 ` Nick Piggin 2008-12-12 4:45 ` Eric Dumazet 1 sibling, 0 replies; 191+ messages in thread From: Nick Piggin @ 2008-12-12 2:50 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney On Tuesday 24 July 2007 11:13, Nick Piggin wrote: > On Friday 12 December 2008 09:40, Eric Dumazet wrote: > > From: Christoph Lameter <cl@linux-foundation.org> > > > > [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU > > > > Currently we schedule RCU frees for each file we free separately. That > > has several drawbacks against the earlier file handling (in 2.6.5 f.e.), > > which did not require RCU callbacks: > > > > 1. Excessive number of RCU callbacks can be generated causing long RCU > > queues that in turn cause long latencies. We hit SLUB page allocation > > more often than necessary. > > > > 2. The cache hot object is not preserved between free and realloc. A > > close followed by another open is very fast with the RCUless approach > > because the last freed object is returned by the slab allocator that is > > still cache hot. RCU free means that the object is not immediately > > available again. The new object is cache cold and therefore open/close > > performance tests show a significant degradation with the RCU > > implementation. > > > > One solution to this problem is to move the RCU freeing into the Slab > > allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation > > time. The slab allocator will do RCU frees only when it is necessary > > to dispose of slabs of objects (rare). So with that approach we can cut > > out the RCU overhead significantly. > > > > However, the slab allocator may return the object for another use even > > before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means > > there is the (unlikely) possibility that the object is going to be > > switched under us in sections protected by rcu_read_lock() and > > rcu_read_unlock(). So we need to verify that we have acquired the correct > > object after establishing a stable object reference (incrementing the > > refcounter does that). > > > > > > Signed-off-by: Christoph Lameter <cl@linux-foundation.org> > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > --- > > Documentation/filesystems/files.txt | 21 ++++++++++++++-- > > fs/file_table.c | 33 ++++++++++++++++++-------- > > include/linux/fs.h | 5 --- > > 3 files changed, 42 insertions(+), 17 deletions(-) > > > > diff --git a/Documentation/filesystems/files.txt > > b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 > > --- a/Documentation/filesystems/files.txt > > +++ b/Documentation/filesystems/files.txt > > @@ -78,13 +78,28 @@ the fdtable structure - > > that look-up may race with the last put() operation on the > > file structure. This is avoided using atomic_long_inc_not_zero() > > on ->f_count : > > + As file structures are allocated with SLAB_DESTROY_BY_RCU, > > + they can also be freed before a RCU grace period, and reused, > > + but still as a struct file. > > + It is necessary to check again after getting > > + a stable reference (ie after atomic_long_inc_not_zero()), > > + that fcheck_files(files, fd) points to the same file. > > > > rcu_read_lock(); > > file = fcheck_files(files, fd); > > if (file) { > > - if (atomic_long_inc_not_zero(&file->f_count)) > > + if (atomic_long_inc_not_zero(&file->f_count)) { > > *fput_needed = 1; > > - else > > + /* > > + * Now we have a stable reference to an object. > > + * Check if other threads freed file and reallocated it. > > + */ > > + if (file != fcheck_files(files, fd)) { > > + *fput_needed = 0; > > + put_filp(file); > > + file = NULL; > > + } > > + } else > > /* Didn't get the reference, someone's freed */ > > file = NULL; > > } > > @@ -95,6 +110,8 @@ the fdtable structure - > > atomic_long_inc_not_zero() detects if refcounts is already zero or > > goes to zero during increment. If it does, we fail > > fget()/fget_light(). > > + The second call to fcheck_files(files, fd) checks that this filp > > + was not freed, then reused by an other thread. > > > > 6. Since both fdtable and file structures can be looked up > > lock-free, they must be installed using rcu_assign_pointer() > > diff --git a/fs/file_table.c b/fs/file_table.c > > index a46e880..3e9259d 100644 > > --- a/fs/file_table.c > > +++ b/fs/file_table.c > > @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; > > > > static struct percpu_counter nr_files __cacheline_aligned_in_smp; > > > > -static inline void file_free_rcu(struct rcu_head *head) > > -{ > > - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); > > - kmem_cache_free(filp_cachep, f); > > -} > > - > > static inline void file_free(struct file *f) > > { > > percpu_counter_dec(&nr_files); > > file_check_state(f); > > - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); > > + kmem_cache_free(filp_cachep, f); > > } > > > > /* > > @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) > > rcu_read_unlock(); > > return NULL; > > } > > + /* > > + * Now we have a stable reference to an object. > > + * Check if other threads freed file and re-allocated it. > > + */ > > + if (unlikely(file != fcheck_files(files, fd))) { > > + put_filp(file); > > + file = NULL; > > + } > > This is a non-trivial change, because that put_filp may drop the last > reference to the file. So now we have the case where we free the file > from a context in which it had never been allocated. > > From a quick glance though the callchains, I can't seen an obvious > problem. But it needs to have documentation in put_filp, or at least > a mention in the changelog, and also cc'ed to the security lists. > > Also, it adds code and cost to the get/put path in return for > improvement in the free path. get/put is the more common path, but > it is a small loss for a big improvement. So it might be worth it. But > it is not justified by your microbenchmark. Do we have a more useful > case that it helps? Sorry, my clock screwed up and I didn't notice :( ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2007-07-24 1:13 ` Nick Piggin 2008-12-12 2:50 ` Nick Piggin @ 2008-12-12 4:45 ` Eric Dumazet 2008-12-12 16:48 ` Eric Dumazet 2008-12-13 1:41 ` Christoph Lameter 1 sibling, 2 replies; 191+ messages in thread From: Eric Dumazet @ 2008-12-12 4:45 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Nick Piggin a écrit : > On Friday 12 December 2008 09:40, Eric Dumazet wrote: >> From: Christoph Lameter <cl@linux-foundation.org> >> >> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU >> >> Currently we schedule RCU frees for each file we free separately. That has >> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which >> did not require RCU callbacks: >> >> 1. Excessive number of RCU callbacks can be generated causing long RCU >> queues that in turn cause long latencies. We hit SLUB page allocation >> more often than necessary. >> >> 2. The cache hot object is not preserved between free and realloc. A close >> followed by another open is very fast with the RCUless approach because >> the last freed object is returned by the slab allocator that is >> still cache hot. RCU free means that the object is not immediately >> available again. The new object is cache cold and therefore open/close >> performance tests show a significant degradation with the RCU >> implementation. >> >> One solution to this problem is to move the RCU freeing into the Slab >> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation >> time. The slab allocator will do RCU frees only when it is necessary >> to dispose of slabs of objects (rare). So with that approach we can cut >> out the RCU overhead significantly. >> >> However, the slab allocator may return the object for another use even >> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means >> there is the (unlikely) possibility that the object is going to be >> switched under us in sections protected by rcu_read_lock() and >> rcu_read_unlock(). So we need to verify that we have acquired the correct >> object after establishing a stable object reference (incrementing the >> refcounter does that). >> >> >> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> >> --- >> Documentation/filesystems/files.txt | 21 ++++++++++++++-- >> fs/file_table.c | 33 ++++++++++++++++++-------- >> include/linux/fs.h | 5 --- >> 3 files changed, 42 insertions(+), 17 deletions(-) >> >> diff --git a/Documentation/filesystems/files.txt >> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 >> --- a/Documentation/filesystems/files.txt >> +++ b/Documentation/filesystems/files.txt >> @@ -78,13 +78,28 @@ the fdtable structure - >> that look-up may race with the last put() operation on the >> file structure. This is avoided using atomic_long_inc_not_zero() >> on ->f_count : >> + As file structures are allocated with SLAB_DESTROY_BY_RCU, >> + they can also be freed before a RCU grace period, and reused, >> + but still as a struct file. >> + It is necessary to check again after getting >> + a stable reference (ie after atomic_long_inc_not_zero()), >> + that fcheck_files(files, fd) points to the same file. >> >> rcu_read_lock(); >> file = fcheck_files(files, fd); >> if (file) { >> - if (atomic_long_inc_not_zero(&file->f_count)) >> + if (atomic_long_inc_not_zero(&file->f_count)) { >> *fput_needed = 1; >> - else >> + /* >> + * Now we have a stable reference to an object. >> + * Check if other threads freed file and reallocated it. >> + */ >> + if (file != fcheck_files(files, fd)) { >> + *fput_needed = 0; >> + put_filp(file); >> + file = NULL; >> + } >> + } else >> /* Didn't get the reference, someone's freed */ >> file = NULL; >> } >> @@ -95,6 +110,8 @@ the fdtable structure - >> atomic_long_inc_not_zero() detects if refcounts is already zero or >> goes to zero during increment. If it does, we fail >> fget()/fget_light(). >> + The second call to fcheck_files(files, fd) checks that this filp >> + was not freed, then reused by an other thread. >> >> 6. Since both fdtable and file structures can be looked up >> lock-free, they must be installed using rcu_assign_pointer() >> diff --git a/fs/file_table.c b/fs/file_table.c >> index a46e880..3e9259d 100644 >> --- a/fs/file_table.c >> +++ b/fs/file_table.c >> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; >> >> static struct percpu_counter nr_files __cacheline_aligned_in_smp; >> >> -static inline void file_free_rcu(struct rcu_head *head) >> -{ >> - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); >> - kmem_cache_free(filp_cachep, f); >> -} >> - >> static inline void file_free(struct file *f) >> { >> percpu_counter_dec(&nr_files); >> file_check_state(f); >> - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); >> + kmem_cache_free(filp_cachep, f); >> } >> >> /* >> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) >> rcu_read_unlock(); >> return NULL; >> } >> + /* >> + * Now we have a stable reference to an object. >> + * Check if other threads freed file and re-allocated it. >> + */ >> + if (unlikely(file != fcheck_files(files, fd))) { >> + put_filp(file); >> + file = NULL; >> + } > > This is a non-trivial change, because that put_filp may drop the last > reference to the file. So now we have the case where we free the file > from a context in which it had never been allocated. If we got at this point, we : Found a non NULL pointer in our fd table. Then, another thread came, closed the file while we not yet added our reference. This file was freed (kmem_cache_free(filp_cachep, file)) This file was reused and inserted on another thread fd table. We added our reference on refcount. We checked if this file is still ours (in our fd tab). We found this file is not anymore the file we wanted. Calling put_filp() here is our only choice to safely remove the reference on a truly allocated file. At this point the file is a truly allocated file but not anymore ours. Unfortunatly we added a reference on it : we must release it. If the other thread already called put_filp() because it wanted to close its new file, we must see f_refcnt going to zero, and we must call __fput(), to perform all the relevant file cleanup ourself. > >>From a quick glance though the callchains, I can't seen an obvious > problem. But it needs to have documentation in put_filp, or at least > a mention in the changelog, and also cc'ed to the security lists. I see your point. But currently, any thread can be "releasing the last reference on a file". That is not always the thread that called close(fd) We extend this to "any thread of any process", so it might have a security effect you are absolutely right. > > Also, it adds code and cost to the get/put path in return for > improvement in the free path. get/put is the more common path, but > it is a small loss for a big improvement. So it might be worth it. But > it is not justified by your microbenchmark. Do we have a more useful > case that it helps? Any real world program that open and close files, or said better, that close and open files :) sizeof(struct file) is 192 bytes. Thats three cache lines. Being able to reuse a hot "struct file" avoids three cache line misses. Thats about 120 ns. Then, using call_rcu() is also a latency killer, since we explicitly say : I dont want to free this file right now, I delegate this job to another layer in two or three milli second (or more) A final point is that SLUB doesnt need to allocate or free a slab in many cases. (This is probably why Christoph needed this patch in 2006 :) ) In my case, I need all these patches to speedup http servers. They obviously open and close many files per second. The added code has a cost of less than 3 ns, but I suspect we can cut it to less than 1ns We prefered with Christoph and Paul to keep patch as short as possible to focus on essential points. :c0287656: mov -0x14(%ebp),%esi :c0287659: mov -0x24(%ebp),%edi :c028765c: mov 0x4(%esi),%eax :c028765f: cmp (%eax),%edi :c0287661: jb c0287678 <fget+0xc8> :c0287663: mov %ebx,%eax :c0287665: xor %ebx,%ebx :c0287667: call c0287450 <put_filp> :c028766c: jmp c02875ec <fget+0x3c> :c0287671: lea 0x0(%esi,%eiz,1),%esi :c0287678: mov 0x4(%eax),%edi :c028767b: add %edi,-0x10(%ebp) :c028767e: mov -0x10(%ebp),%edx 1 8.8e-05 :c0287681: mov (%edx),%eax :c0287683: cmp %eax,%ebx :c0287685: je c02875ec <fget+0x3c> :c028768b: jmp c0287663 <fget+0xb3> We could avoid doing the full test, because there is no way the files->max_fds could become lower under us, or even fdt itself, and fdt->fd So instead of using twice this function : static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd) { struct file * file = NULL; struct fdtable *fdt = files_fdtable(files); if (fd < fdt->max_fds) file = rcu_dereference(fdt->fd[fd]); return file; } We could use the attached patch This becomes a matter of three instructions, including a 99.99% predicted branch : c0287646: 8b 03 mov (%ebx),%eax c0287648: 39 45 e4 cmp %eax,-0x1c(%ebp) c028764b: 74 a1 je c02875ee <fget+0x3e> c028764d: 8b 45 e4 mov -0x1c(%ebp),%eax c0287650: e8 fb fd ff ff call c0287450 <put_filp> c0287655: 31 c0 xor %eax,%eax c0287657: eb 98 jmp c02875f1 <fget+0x41> At the time Christoph sent its patch (in 2006), nobody cared, because we had no benchmark or real world workload that demonstrated the gain of his patch, only intuitions. We had too many contended cache lines that slow down the whole process. SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line misses costs become really problematic. This patch series clearly demonstrate it. Thanks Nick for your feedback and comments. Eric [PATCH] fs: optimize fget() & fget_light() Instead of calling fcheck_files() a second time, we can take into account we already did part of the job, in a rcu read locked section. We need a struct file **filp pointer so that we only dereference it a second time. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/file_table.c | 23 +++++++++++++++++------ 1 files changed, 17 insertions(+), 6 deletions(-) diff --git a/fs/file_table.c b/fs/file_table.c index 3e9259d..4bc019f 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -289,11 +289,16 @@ void __fput(struct file *file) struct file *fget(unsigned int fd) { - struct file *file; + struct file *file = NULL, **filp; struct files_struct *files = current->files; + struct fdtable *fdt; rcu_read_lock(); - file = fcheck_files(files, fd); + fdt = files_fdtable(files); + if (likely(fd < fdt->max_fds)) { + filp = &fdt->fd[fd]; + file = rcu_dereference(*filp); + } if (file) { if (!atomic_long_inc_not_zero(&file->f_count)) { /* File object ref couldn't be taken */ @@ -304,7 +309,7 @@ struct file *fget(unsigned int fd) * Now we have a stable reference to an object. * Check if other threads freed file and re-allocated it. */ - if (unlikely(file != fcheck_files(files, fd))) { + if (unlikely(file != rcu_dereference(*filp))) { put_filp(file); file = NULL; } @@ -325,15 +330,21 @@ EXPORT_SYMBOL(fget); */ struct file *fget_light(unsigned int fd, int *fput_needed) { - struct file *file; + struct file *file, **filp; struct files_struct *files = current->files; + struct fdtable *fdt; *fput_needed = 0; if (likely((atomic_read(&files->count) == 1))) { file = fcheck_files(files, fd); } else { rcu_read_lock(); - file = fcheck_files(files, fd); + fdt = files_fdtable(files); + file = NULL; + if (likely(fd < fdt->max_fds)) { + filp = &fdt->fd[fd]; + file = rcu_dereference(*filp); + } if (file) { if (atomic_long_inc_not_zero(&file->f_count)) { *fput_needed = 1; @@ -342,7 +353,7 @@ struct file *fget_light(unsigned int fd, int *fput_needed) * Check if other threads freed this file and * re-allocated it. */ - if (unlikely(file != fcheck_files(files, fd))) { + if (unlikely(file != rcu_dereference(*filp))) { *fput_needed = 0; put_filp(file); file = NULL; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2008-12-12 4:45 ` Eric Dumazet @ 2008-12-12 16:48 ` Eric Dumazet 2008-12-13 2:07 ` Christoph Lameter 2008-12-13 1:41 ` Christoph Lameter 1 sibling, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-12-12 16:48 UTC (permalink / raw) To: Christoph Lameter, Paul E. McKenney Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel, Al Viro Eric Dumazet a écrit : > Nick Piggin a écrit : >> On Friday 12 December 2008 09:40, Eric Dumazet wrote: >>> From: Christoph Lameter <cl@linux-foundation.org> >>> >>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU >>> >>> Currently we schedule RCU frees for each file we free separately. That has >>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which >>> did not require RCU callbacks: >>> >>> 1. Excessive number of RCU callbacks can be generated causing long RCU >>> queues that in turn cause long latencies. We hit SLUB page allocation >>> more often than necessary. >>> >>> 2. The cache hot object is not preserved between free and realloc. A close >>> followed by another open is very fast with the RCUless approach because >>> the last freed object is returned by the slab allocator that is >>> still cache hot. RCU free means that the object is not immediately >>> available again. The new object is cache cold and therefore open/close >>> performance tests show a significant degradation with the RCU >>> implementation. >>> >>> One solution to this problem is to move the RCU freeing into the Slab >>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation >>> time. The slab allocator will do RCU frees only when it is necessary >>> to dispose of slabs of objects (rare). So with that approach we can cut >>> out the RCU overhead significantly. >>> >>> However, the slab allocator may return the object for another use even >>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means >>> there is the (unlikely) possibility that the object is going to be >>> switched under us in sections protected by rcu_read_lock() and >>> rcu_read_unlock(). So we need to verify that we have acquired the correct >>> object after establishing a stable object reference (incrementing the >>> refcounter does that). >>> >>> >>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> >>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> >>> --- >>> Documentation/filesystems/files.txt | 21 ++++++++++++++-- >>> fs/file_table.c | 33 ++++++++++++++++++-------- >>> include/linux/fs.h | 5 --- >>> 3 files changed, 42 insertions(+), 17 deletions(-) >>> >>> diff --git a/Documentation/filesystems/files.txt >>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 >>> --- a/Documentation/filesystems/files.txt >>> +++ b/Documentation/filesystems/files.txt >>> @@ -78,13 +78,28 @@ the fdtable structure - >>> that look-up may race with the last put() operation on the >>> file structure. This is avoided using atomic_long_inc_not_zero() >>> on ->f_count : >>> + As file structures are allocated with SLAB_DESTROY_BY_RCU, >>> + they can also be freed before a RCU grace period, and reused, >>> + but still as a struct file. >>> + It is necessary to check again after getting >>> + a stable reference (ie after atomic_long_inc_not_zero()), >>> + that fcheck_files(files, fd) points to the same file. >>> >>> rcu_read_lock(); >>> file = fcheck_files(files, fd); >>> if (file) { >>> - if (atomic_long_inc_not_zero(&file->f_count)) >>> + if (atomic_long_inc_not_zero(&file->f_count)) { >>> *fput_needed = 1; >>> - else >>> + /* >>> + * Now we have a stable reference to an object. >>> + * Check if other threads freed file and reallocated it. >>> + */ >>> + if (file != fcheck_files(files, fd)) { >>> + *fput_needed = 0; >>> + put_filp(file); >>> + file = NULL; >>> + } >>> + } else >>> /* Didn't get the reference, someone's freed */ >>> file = NULL; >>> } >>> @@ -95,6 +110,8 @@ the fdtable structure - >>> atomic_long_inc_not_zero() detects if refcounts is already zero or >>> goes to zero during increment. If it does, we fail >>> fget()/fget_light(). >>> + The second call to fcheck_files(files, fd) checks that this filp >>> + was not freed, then reused by an other thread. >>> >>> 6. Since both fdtable and file structures can be looked up >>> lock-free, they must be installed using rcu_assign_pointer() >>> diff --git a/fs/file_table.c b/fs/file_table.c >>> index a46e880..3e9259d 100644 >>> --- a/fs/file_table.c >>> +++ b/fs/file_table.c >>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; >>> >>> static struct percpu_counter nr_files __cacheline_aligned_in_smp; >>> >>> -static inline void file_free_rcu(struct rcu_head *head) >>> -{ >>> - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); >>> - kmem_cache_free(filp_cachep, f); >>> -} >>> - >>> static inline void file_free(struct file *f) >>> { >>> percpu_counter_dec(&nr_files); >>> file_check_state(f); >>> - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); >>> + kmem_cache_free(filp_cachep, f); >>> } >>> >>> /* >>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) >>> rcu_read_unlock(); >>> return NULL; >>> } >>> + /* >>> + * Now we have a stable reference to an object. >>> + * Check if other threads freed file and re-allocated it. >>> + */ >>> + if (unlikely(file != fcheck_files(files, fd))) { >>> + put_filp(file); >>> + file = NULL; >>> + } >> This is a non-trivial change, because that put_filp may drop the last >> reference to the file. So now we have the case where we free the file >> from a context in which it had never been allocated. > > If we got at this point, we : > > Found a non NULL pointer in our fd table. > Then, another thread came, closed the file while we not yet added our reference. > This file was freed (kmem_cache_free(filp_cachep, file)) > This file was reused and inserted on another thread fd table. > We added our reference on refcount. > We checked if this file is still ours (in our fd tab). > We found this file is not anymore the file we wanted. > Calling put_filp() here is our only choice to safely remove the reference on > a truly allocated file. At this point the file is > a truly allocated file but not anymore ours. > Unfortunatly we added a reference on it : we must release it. > If the other thread already called put_filp() because it wanted to close its new file, > we must see f_refcnt going to zero, and we must call __fput(), to perform > all the relevant file cleanup ourself. Reading again this mail I realise we call put_filp(file), while this should be fput(file) or put_filp(file), we dont know. Damned, this patch is wrong as is. Christoph, Paul, do you see the problem ? In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file, and tried to close it while we got a reference on file) had to call put_filp() or fput() to release its own reference. So we call atomic_long_dec_and_test() and cannot take the appropriate action (calling the full __fput() version or the small one, that some systems use to 'close' an not really opened file. void put_filp(struct file *file) { if (atomic_long_dec_and_test(&file->f_count)) { security_file_free(file); file_kill(file); file_free(file); } } void fput(struct file *file) { if (atomic_long_dec_and_test(&file->f_count)) __fput(file); } I believe put_filp() is only called on slowpath (error cases). Should we just zap it and always call fput() ? ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2008-12-12 16:48 ` Eric Dumazet @ 2008-12-13 2:07 ` Christoph Lameter 2008-12-17 20:25 ` Eric Dumazet 0 siblings, 1 reply; 191+ messages in thread From: Christoph Lameter @ 2008-12-13 2:07 UTC (permalink / raw) To: Eric Dumazet Cc: Paul E. McKenney, Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel, Al Viro On Fri, 12 Dec 2008, Eric Dumazet wrote: > > a truly allocated file. At this point the file is > > a truly allocated file but not anymore ours. Its a valid file. Does ownership matter here? > Reading again this mail I realise we call put_filp(file), while this should > be fput(file) or put_filp(file), we dont know. > > Damned, this patch is wrong as is. > > Christoph, Paul, do you see the problem ? Yes. > In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file, > and tried to close it while we got a reference on file) had to call put_filp() or fput() > to release its own reference. So we call atomic_long_dec_and_test() and cannot > take the appropriate action (calling the full __fput() version or the small one, > that some systems use to 'close' an not really opened file. The difference is mainly that fput() does full processing whereas put_filp() is used when we know that the file was not fully operational. If the checks in __fput are able to handle the put_filp() situation by not releasing resources that were not allocated then we should be fine. > I believe put_filp() is only called on slowpath (error cases). Looks like it. It seems to assume that no dentry is associated. > Should we just zap it and always call fput() ? Only if fput() can handle partially setup files. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2008-12-13 2:07 ` Christoph Lameter @ 2008-12-17 20:25 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-12-17 20:25 UTC (permalink / raw) To: Christoph Lameter Cc: Paul E. McKenney, Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel, Al Viro Christoph Lameter a écrit : > On Fri, 12 Dec 2008, Eric Dumazet wrote: > >>> a truly allocated file. At this point the file is >>> a truly allocated file but not anymore ours. > > Its a valid file. Does ownership matter here? > >> Reading again this mail I realise we call put_filp(file), while this should >> be fput(file) or put_filp(file), we dont know. >> >> Damned, this patch is wrong as is. >> >> Christoph, Paul, do you see the problem ? > > Yes. > >> In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file, >> and tried to close it while we got a reference on file) had to call put_filp() or fput() >> to release its own reference. So we call atomic_long_dec_and_test() and cannot >> take the appropriate action (calling the full __fput() version or the small one, >> that some systems use to 'close' an not really opened file. > > The difference is mainly that fput() does full processing whereas > put_filp() is used when we know that the file was not fully operational. > If the checks in __fput are able to handle the put_filp() situation by not > releasing resources that were not allocated then we should be fine. > >> I believe put_filp() is only called on slowpath (error cases). > > Looks like it. It seems to assume that no dentry is associated. > >> Should we just zap it and always call fput() ? > > Only if fput() can handle partially setup files. It can do that if we add a check for NULL dentry in __fput(), so put_filp() can disappear. But there is a remaining point where we do an atomic_long_dec_and_test(&...->f_count), in fs/aio.c, function __aio_put_req(). This one is tricky :( ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2008-12-12 4:45 ` Eric Dumazet 2008-12-12 16:48 ` Eric Dumazet @ 2008-12-13 1:41 ` Christoph Lameter 1 sibling, 0 replies; 191+ messages in thread From: Christoph Lameter @ 2008-12-13 1:41 UTC (permalink / raw) To: Eric Dumazet Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel, Al Viro, Paul E. McKenney On Fri, 12 Dec 2008, Eric Dumazet wrote: > > This is a non-trivial change, because that put_filp may drop the last > > reference to the file. So now we have the case where we free the file > > from a context in which it had never been allocated. > > If we got at this point, we : > > Found a non NULL pointer in our fd table. > Then, another thread came, closed the file while we not yet added our reference. > This file was freed (kmem_cache_free(filp_cachep, file)) > This file was reused and inserted on another thread fd table. > We added our reference on refcount. > We checked if this file is still ours (in our fd tab). > We found this file is not anymore the file we wanted. > Calling put_filp() here is our only choice to safely remove the reference on > a truly allocated file. At this point the file is > a truly allocated file but not anymore ours. > Unfortunatly we added a reference on it : we must release it. > If the other thread already called put_filp() because it wanted to close its new file, > we must see f_refcnt going to zero, and we must call __fput(), to perform > all the relevant file cleanup ourself. Correct. That was the idea. > A final point is that SLUB doesnt need to allocate or free a slab in many cases. > (This is probably why Christoph needed this patch in 2006 :) ) We needed this patch in 2006 because the AIM9 creat-clo test showed regressions after the rcu free was put in (discovered during SLES11 verification cycle). All slab allocators do at least defer frees until all objects in the page are freed if not longer. > In my case, I need all these patches to speedup http servers. > They obviously open and close many files per second. Run AIM9 creat-close tests.... > SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line > misses costs become really problematic. This patch series clearly demonstrate > it. Well the issue becomes more severe as accesses to cold memory become more extensive. Thanks for your work on this. ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH v3 7/7] fs: MS_NOREFCOUNT 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (6 preceding siblings ...) 2008-12-11 22:40 ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet @ 2008-12-11 22:41 ` Eric Dumazet 7 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-12-11 22:41 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Some fs are hardwired into kernel, and mntput()/mntget() hit a contended cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set on socket, pipes and anonymous fd superblocks. mntput()/mntget() become null ops on these fs. ("socketallocbench -n 8" result : from 2.20s to 1.64s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 1 + fs/pipe.c | 3 ++- include/linux/fs.h | 2 ++ include/linux/mount.h | 8 +++----- net/socket.c | 1 + 5 files changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 89fd36d..de0ec3b 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -158,6 +158,7 @@ static int __init anon_inode_init(void) error = PTR_ERR(anon_inode_mnt); goto err_unregister_filesystem; } + anon_inode_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT; anon_inode_inode = anon_inode_mkinode(); if (IS_ERR(anon_inode_inode)) { error = PTR_ERR(anon_inode_inode); diff --git a/fs/pipe.c b/fs/pipe.c index 8c51a0d..f547432 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void) if (IS_ERR(pipe_mnt)) { err = PTR_ERR(pipe_mnt); unregister_filesystem(&pipe_fs_type); - } + } else + pipe_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT; } return err; } diff --git a/include/linux/fs.h b/include/linux/fs.h index a1f56d4..11b0452 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -137,6 +137,8 @@ extern int dir_notify_enable; #define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */ #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */ #define MS_I_VERSION (1<<23) /* Update inode I_version field */ + +#define MS_NOREFCOUNT (1<<29) /* kernel static mnt : no refcounting needed */ #define MS_ACTIVE (1<<30) #define MS_NOUSER (1<<31) diff --git a/include/linux/mount.h b/include/linux/mount.h index cab2a85..51418b5 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -14,10 +14,8 @@ #include <linux/nodemask.h> #include <linux/spinlock.h> #include <asm/atomic.h> +#include <linux/fs.h> -struct super_block; -struct vfsmount; -struct dentry; struct mnt_namespace; #define MNT_NOSUID 0x01 @@ -73,7 +71,7 @@ struct vfsmount { static inline struct vfsmount *mntget(struct vfsmount *mnt) { - if (mnt) + if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT)) atomic_inc(&mnt->mnt_count); return mnt; } @@ -87,7 +85,7 @@ extern int __mnt_is_readonly(struct vfsmount *mnt); static inline void mntput(struct vfsmount *mnt) { - if (mnt) { + if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT)) { mnt->mnt_expiry_mark = 0; mntput_no_expire(mnt); } diff --git a/net/socket.c b/net/socket.c index 4017409..2534dbc 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2206,6 +2206,7 @@ static int __init sock_init(void) init_inodecache(); register_filesystem(&sock_fs_type); sock_mnt = kern_mount(&sock_fs_type); + sock_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT; /* The real protocol initialization is performed in later initcalls. */ ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet ` (3 preceding siblings ...) 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet @ 2008-11-29 8:43 ` Eric Dumazet 2008-11-29 8:43 ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet ` (3 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-29 8:43 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro [-- Attachment #1: Type: text/plain, Size: 606 bytes --] Adding a percpu_counter nr_dentry avoids cache line ping pongs between cpus to maintain this metric, and dcache_lock is no more needed to protect dentry_stat.nr_dentry We centralize nr_dentry updates at the right place : - increments in d_alloc() - decrements in d_free() d_alloc() can avoid taking dcache_lock if parent is NULL (socket8 bench result : 27.5s to 25s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/dcache.c | 49 +++++++++++++++++++++++++------------------ include/linux/fs.h | 2 + kernel/sysctl.c | 2 - 3 files changed, 32 insertions(+), 21 deletions(-) [-- Attachment #2: nr_dentry.patch --] [-- Type: text/plain, Size: 4891 bytes --] diff --git a/fs/dcache.c b/fs/dcache.c index a1d86c7..46d5d1e 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly; static unsigned int d_hash_mask __read_mostly; static unsigned int d_hash_shift __read_mostly; static struct hlist_head *dentry_hashtable __read_mostly; +static struct percpu_counter nr_dentry; /* Statistics gathering. */ struct dentry_stat_t dentry_stat = { .age_limit = 45, }; +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry); + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void __d_free(struct dentry *dentry) { WARN_ON(!list_empty(&dentry->d_alias)); @@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head) } /* - * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry - * inside dcache_lock. + * no dcache_lock, please. */ static void d_free(struct dentry *dentry) { @@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry) __d_free(dentry); else call_rcu(&dentry->d_u.d_rcu, d_callback); + percpu_counter_dec(&nr_dentry); } /* @@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry) struct dentry *parent; list_del(&dentry->d_u.d_child); - dentry_stat.nr_dentry--; /* For d_free, below */ /*drops the locks, at that point nobody can reach this dentry */ dentry_iput(dentry); if (IS_ROOT(dentry)) @@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb) static void shrink_dcache_for_umount_subtree(struct dentry *dentry) { struct dentry *parent; - unsigned detached = 0; BUG_ON(!IS_ROOT(dentry)); @@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) } list_del(&dentry->d_u.d_child); - detached++; inode = dentry->d_inode; if (inode) { @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) * otherwise we ascend to the parent and move to the * next sibling if there is one */ if (!parent) - goto out; + return; dentry = parent; @@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) dentry = list_entry(dentry->d_subdirs.next, struct dentry, d_u.d_child); } -out: - /* several dentries were freed, need to correct nr_dentry */ - spin_lock(&dcache_lock); - dentry_stat.nr_dentry -= detached; - spin_unlock(&dcache_lock); } /* @@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) dentry->d_flags = DCACHE_UNHASHED; spin_lock_init(&dentry->d_lock); dentry->d_inode = NULL; - dentry->d_parent = NULL; - dentry->d_sb = NULL; dentry->d_op = NULL; dentry->d_fsdata = NULL; dentry->d_mounted = 0; @@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) if (parent) { dentry->d_parent = dget(parent); dentry->d_sb = parent->d_sb; + spin_lock(&dcache_lock); + list_add(&dentry->d_u.d_child, &parent->d_subdirs); + spin_unlock(&dcache_lock); } else { + dentry->d_parent = NULL; + dentry->d_sb = NULL; INIT_LIST_HEAD(&dentry->d_u.d_child); } - - spin_lock(&dcache_lock); - if (parent) - list_add(&dentry->d_u.d_child, &parent->d_subdirs); - dentry_stat.nr_dentry++; - spin_unlock(&dcache_lock); - + percpu_counter_inc(&nr_dentry); return dentry; } @@ -2282,6 +2290,7 @@ static void __init dcache_init(void) { int loop; + percpu_counter_init(&nr_dentry, 0); /* * A constructor could be added for stable state like the lists, * but it is probably not worth it because of the cache nature diff --git a/include/linux/fs.h b/include/linux/fs.h index 0dcdd94..c5e7aa5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata) struct ctl_table; int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 9d048fa..eebddef 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = { .data = &dentry_stat, .maxlen = 6*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_dentry, }, { .ctl_name = FS_OVERFLOWUID, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet ` (4 preceding siblings ...) 2008-11-29 8:43 ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet @ 2008-11-29 8:43 ` Eric Dumazet 2008-11-29 8:44 ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet ` (2 subsequent siblings) 8 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-29 8:43 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro [-- Attachment #1: Type: text/plain, Size: 481 bytes --] Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. (socket8 bench result : no difference at this point) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/fs-writeback.c | 2 +- fs/inode.c | 39 +++++++++++++++++++++++++++++++-------- include/linux/fs.h | 3 +++ kernel/sysctl.c | 4 ++-- mm/page-writeback.c | 2 +- 5 files changed, 38 insertions(+), 12 deletions(-) [-- Attachment #2: nr_inodes.patch --] [-- Type: text/plain, Size: 5626 bytes --] diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d0ff0b8..b591cdd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait) unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS); wbc.nr_to_write = nr_dirty + nr_unstable + - (inodes_stat.nr_inodes - inodes_stat.nr_unused) + + (get_nr_inodes() - inodes_stat.nr_unused) + nr_dirty + nr_unstable; wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ sync_sb_inodes(sb, &wbc); diff --git a/fs/inode.c b/fs/inode.c index 0487ddb..f94f889 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex); * Statistics gathering.. */ struct inodes_stat_t inodes_stat; +static struct percpu_counter nr_inodes; static struct kmem_cache * inode_cachep __read_mostly; +int get_nr_inodes(void) +{ + return percpu_counter_sum_positive(&nr_inodes); +} + +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + inodes_stat.nr_inodes = get_nr_inodes(); + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void wake_up_inode(struct inode *inode) { /* @@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head) destroy_inode(inode); nr_disposed++; } - spin_lock(&inode_lock); - inodes_stat.nr_inodes -= nr_disposed; - spin_unlock(&inode_lock); + percpu_counter_sub(&nr_inodes, nr_disposed); } /* @@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb) inode = alloc_inode(sb); if (inode) { + percpu_counter_inc(&nr_inodes); spin_lock(&inode_lock); - inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); inode->i_ino = ++last_ino; @@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h if (set(inode, data)) goto set_failed; - inodes_stat.nr_inodes++; + percpu_counter_inc(&nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he old = find_inode_fast(sb, head, ino); if (!old) { inode->i_ino = ino; - inodes_stat.nr_inodes++; + percpu_counter_inc(&nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + percpu_counter_dec(&nr_inodes); security_inode_delete(inode); @@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + percpu_counter_dec(&nr_inodes); if (inode->i_data.nrpages) truncate_inode_pages(&inode->i_data, 0); clear_inode(inode); @@ -1394,6 +1416,7 @@ void __init inode_init(void) { int loop; + percpu_counter_init(&nr_inodes, 0); /* inode slab cache */ inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode), diff --git a/include/linux/fs.h b/include/linux/fs.h index c5e7aa5..2482977 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -47,6 +47,7 @@ struct inodes_stat_t { int dummy[5]; /* padding for sysctl ABI compatibility */ }; extern struct inodes_stat_t inodes_stat; +extern int get_nr_inodes(void); extern int leases_enable, lease_break_time; @@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index eebddef..eebed01 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 2*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .ctl_name = FS_STATINODE, @@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 7*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .procname = "file-nr", diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2970e35..a71a922 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg) next_jif = start_jif + dirty_writeback_interval; nr_to_write = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS) + - (inodes_stat.nr_inodes - inodes_stat.nr_unused); + (get_nr_inodes() - inodes_stat.nr_unused); while (nr_to_write > 0) { wbc.more_io = 0; wbc.encountered_congestion = 0; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet ` (5 preceding siblings ...) 2008-11-29 8:43 ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet @ 2008-11-29 8:44 ` Eric Dumazet 2008-11-29 8:44 ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet 2008-11-29 8:45 ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet 8 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-29 8:44 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro [-- Attachment #1: Type: text/plain, Size: 505 bytes --] new_inode() dirties a contended cache line to get increasing inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino, and give same spreading ino numbers than before. (same wraparound after 2^32 allocations) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/inode.c | 35 ++++++++++++++++++++++++++++++++--- 1 files changed, 32 insertions(+), 3 deletions(-) [-- Attachment #2: last_ino.patch --] [-- Type: text/plain, Size: 1511 bytes --] diff --git a/fs/inode.c b/fs/inode.c index f94f889..dc8e72a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -556,6 +556,36 @@ repeat: return node ? inode : NULL; } +#ifdef CONFIG_SMP +/* + * Each cpu owns a range of 1024 numbers. + * 'shared_last_ino' is dirtied only once out of 1024 allocations, + * to renew the exhausted range. + */ +static DEFINE_PER_CPU(int, last_ino); + +static int last_ino_get(void) +{ + static atomic_t shared_last_ino; + int *p = &get_cpu_var(last_ino); + int res = *p; + + if (unlikely((res & 1023) == 0)) + res = atomic_add_return(1024, &shared_last_ino) - 1024; + + *p = ++res; + put_cpu_var(last_ino); + return res; +} +#else +static int last_ino_get(void) +{ + static int last_ino; + + return ++last_ino; +} +#endif + /** * new_inode - obtain an inode * @sb: superblock @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb) * error if st_ino won't fit in target struct field. Use 32bit counter * here to attempt to avoid that. */ - static unsigned int last_ino; struct inode * inode; spin_lock_prefetch(&inode_lock); @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb) inode = alloc_inode(sb); if (inode) { percpu_counter_inc(&nr_inodes); + inode->i_state = 0; + inode->i_ino = last_ino_get(); spin_lock(&inode_lock); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); - inode->i_ino = ++last_ino; - inode->i_state = 0; spin_unlock(&inode_lock); } return inode; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet ` (6 preceding siblings ...) 2008-11-29 8:44 ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet @ 2008-11-29 8:44 ` Eric Dumazet 2008-11-29 10:38 ` Jörn Engel 2008-11-29 8:45 ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet 8 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-29 8:44 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro [-- Attachment #1: Type: text/plain, Size: 1602 bytes --] Sockets, pipes and anonymous fds have interesting properties. Like other files, they use a dentry and an inode. But dentries for these kind of files are not hashed into dcache, since there is no way someone can lookup such a file in the vfs tree. (/proc/{pid}/fd/{number} uses a different mechanism) Still, allocating and freeing such dentries are expensive processes, because we currently take dcache_lock inside d_alloc(), d_instantiate(), and dput(). This lock is very contended on SMP machines. This patch defines a new DCACHE_SINGLE flag, to mark a dentry as a single one (for sockets, pipes, anonymous fd), and a new d_alloc_single(const struct qstr *name, struct inode *inode) method, called by the three subsystems. Internally, dput() can take a fast path to dput_single() for SINGLE dentries. No more atomic_dec_and_lock() for such dentries. Differences betwen an SINGLE dentry and a normal one are : 1) SINGLE dentry has the DCACHE_SINGLE flag 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED) This to avoid taking a reference on sb 'root' dentry, shared by too many dentries. 3) They are not hashed into global hash table (DCACHE_UNHASHED) 4) Their d_alias list is empty (socket8 bench result : from 25s to 19.9s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 16 ------------ fs/dcache.c | 51 +++++++++++++++++++++++++++++++++++++++ fs/pipe.c | 23 +---------------- include/linux/dcache.h | 9 ++++++ net/socket.c | 24 +----------------- 5 files changed, 65 insertions(+), 58 deletions(-) [-- Attachment #2: dcache_single.patch --] [-- Type: text/plain, Size: 7886 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 3662dd4..8bf83cb 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags, mnt); } -static int anon_inodefs_delete_dentry(struct dentry *dentry) -{ - /* - * We faked vfs to believe the dentry was hashed when we created it. - * Now we restore the flag so that dput() will work correctly. - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 1; -} - static struct file_system_type anon_inode_fs_type = { .name = "anon_inodefs", .get_sb = anon_inodefs_get_sb, .kill_sb = kill_anon_super, }; static struct dentry_operations anon_inodefs_dentry_operations = { - .d_delete = anon_inodefs_delete_dentry, }; /** @@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, this.name = name; this.len = strlen(name); this.hash = 0; - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); + dentry = d_alloc_single(&this, anon_inode_inode); if (!dentry) goto err_put_unused_fd; @@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, atomic_inc(&anon_inode_inode->i_count); dentry->d_op = &anon_inodefs_dentry_operations; - /* Do not publish this dentry inside the global dentry hash table */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, anon_inode_inode); error = -ENFILE; file = alloc_file(anon_inode_mnt, dentry, diff --git a/fs/dcache.c b/fs/dcache.c index 46d5d1e..35d4a25 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry) */ /* + * special version of dput() for pipes/sockets/anon. + * These dentries are not present in hash table, we can avoid + * taking/dirtying dcache_lock + */ +static void dput_single(struct dentry *dentry) +{ + struct inode *inode; + + if (!atomic_dec_and_test(&dentry->d_count)) + return; + inode = dentry->d_inode; + if (inode) + iput(inode); + d_free(dentry); +} + +/* * dput - release a dentry * @dentry: dentry to release * @@ -234,6 +251,11 @@ void dput(struct dentry *dentry) { if (!dentry) return; + /* + * single dentries (sockets/pipes/anon) fast path + */ + if (dentry->d_flags & DCACHE_SINGLE) + return dput_single(dentry); repeat: if (atomic_read(&dentry->d_count) == 1) @@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode) return res; } +/** + * d_alloc_single - allocate SINGLE dentry + * @name: dentry name, given in a qstr structure + * @inode: inode to allocate the dentry for + * + * Allocate an SINGLE dentry for the inode given. The inode is + * instantiated and returned. %NULL is returned if there is insufficient + * memory. + * - SINGLE dentries have themselves as a parent. + * - SINGLE dentries are not hashed into global hash table + * - their d_alias list is empty + */ +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode) +{ + struct dentry *entry; + + entry = d_alloc(NULL, name); + if (entry) { + entry->d_sb = inode->i_sb; + entry->d_parent = entry; + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED; + entry->d_inode = inode; + fsnotify_d_instantiate(entry, inode); + security_d_instantiate(entry, inode); + } + return entry; +} + + static inline struct hlist_head *d_hash(struct dentry *parent, unsigned long hash) { diff --git a/fs/pipe.c b/fs/pipe.c index 7aea8b8..4de6dd5 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode) } static struct vfsmount *pipe_mnt __read_mostly; -static int pipefs_delete_dentry(struct dentry *dentry) -{ - /* - * At creation time, we pretended this dentry was hashed - * (by clearing DCACHE_UNHASHED bit in d_flags) - * At delete time, we restore the truth : not hashed. - * (so that dput() can proceed correctly) - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 0; -} /* * pipefs_dname() is called from d_path(). @@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen) } static struct dentry_operations pipefs_dentry_operations = { - .d_delete = pipefs_delete_dentry, .d_dname = pipefs_dname, }; @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags) struct inode *inode; struct file *f; struct dentry *dentry; - struct qstr name = { .name = "" }; + static const struct qstr name = { .name = "" }; err = -ENFILE; inode = get_pipe_inode(); @@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags) goto err; err = -ENOMEM; - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_single(&name, inode); if (!dentry) goto err_inode; dentry->d_op = &pipefs_dentry_operations; - /* - * We dont want to publish this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on pipes - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, inode); err = -ENFILE; f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index a37359d..ca8d269 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -176,6 +176,14 @@ d_iput: no no no yes #define DCACHE_UNHASHED 0x0010 #define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched */ +#define DCACHE_SINGLE 0x0040 + /* + * socket, pipe or anonymous fd dentry + * - SINGLE dentries have themselves as a parent. + * - SINGLE dentries are not hashed into global hash table + * - Their d_alias list is empty + * - They dont need dcache_lock synchronization + */ extern spinlock_t dcache_lock; extern seqlock_t rename_lock; @@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *); extern void shrink_dcache_parent(struct dentry *); extern void shrink_dcache_for_umount(struct super_block *); extern int d_invalidate(struct dentry *); +extern struct dentry *d_alloc_single(const struct qstr *, struct inode *); /* only used at mount-time */ extern struct dentry * d_alloc_root(struct inode *); diff --git a/net/socket.c b/net/socket.c index e9d65ea..231cd66 100644 --- a/net/socket.c +++ b/net/socket.c @@ -307,18 +307,6 @@ static struct file_system_type sock_fs_type = { .kill_sb = kill_anon_super, }; -static int sockfs_delete_dentry(struct dentry *dentry) -{ - /* - * At creation time, we pretended this dentry was hashed - * (by clearing DCACHE_UNHASHED bit in d_flags) - * At delete time, we restore the truth : not hashed. - * (so that dput() can proceed correctly) - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 0; -} - /* * sockfs_dname() is called from d_path(). */ @@ -329,7 +317,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen) } static struct dentry_operations sockfs_dentry_operations = { - .d_delete = sockfs_delete_dentry, .d_dname = sockfs_dname, }; @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags) static int sock_attach_fd(struct socket *sock, struct file *file, int flags) { struct dentry *dentry; - struct qstr name = { .name = "" }; + static const struct qstr name = { .name = "" }; - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_single(&name, SOCK_INODE(sock)); if (unlikely(!dentry)) return -ENOMEM; dentry->d_op = &sockfs_dentry_operations; - /* - * We dont want to push this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on sockets - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, SOCK_INODE(sock)); sock->file = file; init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd 2008-11-29 8:44 ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet @ 2008-11-29 10:38 ` Jörn Engel 2008-11-29 11:14 ` Eric Dumazet 0 siblings, 1 reply; 191+ messages in thread From: Jörn Engel @ 2008-11-29 10:38 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote: > > +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode) > +{ > + struct dentry *entry; > + > + entry = d_alloc(NULL, name); > + if (entry) { > + entry->d_sb = inode->i_sb; > + entry->d_parent = entry; > + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED; > + entry->d_inode = inode; > + fsnotify_d_instantiate(entry, inode); > + security_d_instantiate(entry, inode); > + } > + return entry; Calling the struct dentry entry had me onfused a bit. I believe everyone else (including the code you removed) uses dentry. > @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags) > struct inode *inode; > struct file *f; > struct dentry *dentry; > - struct qstr name = { .name = "" }; > + static const struct qstr name = { .name = "" }; > > err = -ENFILE; > inode = get_pipe_inode(); ... > @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags) > static int sock_attach_fd(struct socket *sock, struct file *file, int flags) > { > struct dentry *dentry; > - struct qstr name = { .name = "" }; > + static const struct qstr name = { .name = "" }; These two could even be combined. And of course I realize that I comment on absolute trivialities. On the whole, I couldn't spot a real problem in your patches. Jörn -- Public Domain - Free as in Beer General Public - Free as in Speech BSD License - Free as in Enterprise Shared Source - Free as in "Work will make you..." ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd 2008-11-29 10:38 ` Jörn Engel @ 2008-11-29 11:14 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-29 11:14 UTC (permalink / raw) To: Jörn Engel Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro Jörn Engel a écrit : > On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote: >> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode) >> +{ >> + struct dentry *entry; >> + >> + entry = d_alloc(NULL, name); >> + if (entry) { >> + entry->d_sb = inode->i_sb; >> + entry->d_parent = entry; >> + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED; >> + entry->d_inode = inode; >> + fsnotify_d_instantiate(entry, inode); >> + security_d_instantiate(entry, inode); >> + } >> + return entry; > > Calling the struct dentry entry had me onfused a bit. I believe > everyone else (including the code you removed) uses dentry. Ah yes, it seems I took it from d_instantiate(), I guess a cleanup patch would be nice. > >> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags) >> struct inode *inode; >> struct file *f; >> struct dentry *dentry; >> - struct qstr name = { .name = "" }; >> + static const struct qstr name = { .name = "" }; >> >> err = -ENFILE; >> inode = get_pipe_inode(); > ... >> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags) >> static int sock_attach_fd(struct socket *sock, struct file *file, int flags) >> { >> struct dentry *dentry; >> - struct qstr name = { .name = "" }; >> + static const struct qstr name = { .name = "" }; > > These two could even be combined. > > And of course I realize that I comment on absolute trivialities. On the > whole, I couldn't spot a real problem in your patches. Well, at least you reviewed it, it's the important point ! Thanks Jörn ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH v2 5/5] fs: new_inode_single() and iput_single() 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet ` (7 preceding siblings ...) 2008-11-29 8:44 ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet @ 2008-11-29 8:45 ` Eric Dumazet 2008-11-29 11:14 ` Jörn Engel 8 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-29 8:45 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro [-- Attachment #1: Type: text/plain, Size: 905 bytes --] Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. SINGLE dentries are attached to inodes that dont need to be linked in a list of inodes, being "inode_in_use" or "sb->s_inodes" As inode_lock was taken only to protect these lists, we avoid taking it as well. Using iput_single() from dput_single() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() (socket8 bench result : from 19.9s to 2.3s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 2 +- fs/dcache.c | 2 +- fs/inode.c | 29 ++++++++++++++++++++--------- fs/pipe.c | 2 +- include/linux/fs.h | 12 +++++++++++- net/socket.c | 2 +- 6 files changed, 35 insertions(+), 14 deletions(-) [-- Attachment #2: new_inode_single.patch --] [-- Type: text/plain, Size: 4080 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 8bf83cb..89fd36d 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd); */ static struct inode *anon_inode_mkinode(void) { - struct inode *inode = new_inode(anon_inode_mnt->mnt_sb); + struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb); if (!inode) return ERR_PTR(-ENOMEM); diff --git a/fs/dcache.c b/fs/dcache.c index 35d4a25..3aa9ed5 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry) return; inode = dentry->d_inode; if (inode) - iput(inode); + iput_single(inode); d_free(dentry); } diff --git a/fs/inode.c b/fs/inode.c index dc8e72a..0fdfe1b 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode) kmem_cache_free(inode_cachep, (inode)); } +void iput_single(struct inode *inode) +{ + if (atomic_dec_and_test(&inode->i_count)) { + destroy_inode(inode); + percpu_counter_dec(&nr_inodes); + } +} /* * These are initializations that only need to be done @@ -587,8 +594,9 @@ static int last_ino_get(void) #endif /** - * new_inode - obtain an inode + * __new_inode - obtain an inode * @sb: superblock + * @single: if true, dont link new inode in a list * * Allocates a new inode for given superblock. The default gfp_mask * for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE. @@ -598,7 +606,7 @@ static int last_ino_get(void) * newly created inode's mapping * */ -struct inode *new_inode(struct super_block *sb) +struct inode *__new_inode(struct super_block *sb, int single) { /* * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW @@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb) */ struct inode * inode; - spin_lock_prefetch(&inode_lock); - inode = alloc_inode(sb); if (inode) { percpu_counter_inc(&nr_inodes); inode->i_state = 0; inode->i_ino = last_ino_get(); - spin_lock(&inode_lock); - list_add(&inode->i_list, &inode_in_use); - list_add(&inode->i_sb_list, &sb->s_inodes); - spin_unlock(&inode_lock); + if (single) { + INIT_LIST_HEAD(&inode->i_list); + INIT_LIST_HEAD(&inode->i_sb_list); + } else { + spin_lock(&inode_lock); + list_add(&inode->i_list, &inode_in_use); + list_add(&inode->i_sb_list, &sb->s_inodes); + spin_unlock(&inode_lock); + } } return inode; } -EXPORT_SYMBOL(new_inode); +EXPORT_SYMBOL(__new_inode); void unlock_new_inode(struct inode *inode) { diff --git a/fs/pipe.c b/fs/pipe.c index 4de6dd5..8c51a0d 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = { static struct inode * get_pipe_inode(void) { - struct inode *inode = new_inode(pipe_mnt->mnt_sb); + struct inode *inode = new_inode_single(pipe_mnt->mnt_sb); struct pipe_inode_info *pipe; if (!inode) diff --git a/include/linux/fs.h b/include/linux/fs.h index 2482977..b3daffc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1898,7 +1898,17 @@ extern void __iget(struct inode * inode); extern void iget_failed(struct inode *); extern void clear_inode(struct inode *); extern void destroy_inode(struct inode *); -extern struct inode *new_inode(struct super_block *); +extern struct inode *__new_inode(struct super_block *, int); +static inline struct inode *new_inode(struct super_block *sb) +{ + return __new_inode(sb, 0); +} +static inline struct inode *new_inode_single(struct super_block *sb) +{ + return __new_inode(sb, 1); +} +extern void iput_single(struct inode *); + extern int should_remove_suid(struct dentry *); extern int file_remove_suid(struct file *); diff --git a/net/socket.c b/net/socket.c index 231cd66..f1e656c 100644 --- a/net/socket.c +++ b/net/socket.c @@ -463,7 +463,7 @@ static struct socket *sock_alloc(void) struct inode *inode; struct socket *sock; - inode = new_inode(sock_mnt->mnt_sb); + inode = new_inode_single(sock_mnt->mnt_sb); if (!inode) return NULL; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH v2 5/5] fs: new_inode_single() and iput_single() 2008-11-29 8:45 ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet @ 2008-11-29 11:14 ` Jörn Engel 0 siblings, 0 replies; 191+ messages in thread From: Jörn Engel @ 2008-11-29 11:14 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro On Sat, 29 November 2008 09:45:09 +0100, Eric Dumazet wrote: > > +void iput_single(struct inode *inode) > +{ > + if (atomic_dec_and_test(&inode->i_count)) { > + destroy_inode(inode); > + percpu_counter_dec(&nr_inodes); > + } > +} I wonder if it is possible to avoid the atomic_dec_and_test() here, at least in the common case, and combine it with the atomic_dec_and_test() of the dentry. A quick look at fs/inode.c indicates that inode->i_count may never get changed for a SINGLE inode, except during creation or deletion. It might be worth to - remove the conditional from iput_single() and measure that it makes a difference, - poison SINGLE inodes with some value and - put a BUG_ON() in __iget() that checks for the poison value. I _think_ the BUG_ON() is unnecessary, but at least my brain is not sufficient to convince me. Can inotify somehow get a hold of a socket? Or dquot (how insane would that be?) Jörn -- Mac is for working, Linux is for Networking, Windows is for Solitaire! -- stolen from dc ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 1/6] fs: Introduce a per_cpu nr_dentry 2008-11-21 15:34 ` Ingo Molnar 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet @ 2008-11-26 23:30 ` Eric Dumazet 2008-11-27 9:41 ` Christoph Hellwig 2008-11-26 23:32 ` [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator Eric Dumazet ` (3 subsequent siblings) 5 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-26 23:30 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 469 bytes --] Adding a per_cpu nr_dentry avoids cache line ping pongs between cpus to maintain this metric. We centralize decrements of nr_dentry in d_free(), and increments in d_alloc(). d_alloc() can avoid taking dcache_lock if parent is NULL Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/dcache.c | 55 ++++++++++++++++++++++++++++--------------- include/linux/fs.h | 2 + kernel/sysctl.c | 2 - 3 files changed, 40 insertions(+), 19 deletions(-) [-- Attachment #2: per_cpu_nr_dentry.patch --] [-- Type: text/plain, Size: 4782 bytes --] diff --git a/fs/dcache.c b/fs/dcache.c index a1d86c7..42ed9fc 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -61,12 +61,38 @@ static struct kmem_cache *dentry_cache __read_mostly; static unsigned int d_hash_mask __read_mostly; static unsigned int d_hash_shift __read_mostly; static struct hlist_head *dentry_hashtable __read_mostly; +static DEFINE_PER_CPU(int, nr_dentry); /* Statistics gathering. */ struct dentry_stat_t dentry_stat = { .age_limit = 45, }; +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int cpu; + int counter = 0; + + for_each_possible_cpu(cpu) + counter += per_cpu(nr_dentry, cpu); + if (counter < 0) + counter = 0; + dentry_stat.nr_dentry = counter; + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void __d_free(struct dentry *dentry) { WARN_ON(!list_empty(&dentry->d_alias)); @@ -82,8 +108,7 @@ static void d_callback(struct rcu_head *head) } /* - * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry - * inside dcache_lock. + * no dcache_lock, please. */ static void d_free(struct dentry *dentry) { @@ -94,6 +119,8 @@ static void d_free(struct dentry *dentry) __d_free(dentry); else call_rcu(&dentry->d_u.d_rcu, d_callback); + get_cpu_var(nr_dentry)--; + put_cpu_var(nr_dentry); } /* @@ -172,7 +199,6 @@ static struct dentry *d_kill(struct dentry *dentry) struct dentry *parent; list_del(&dentry->d_u.d_child); - dentry_stat.nr_dentry--; /* For d_free, below */ /*drops the locks, at that point nobody can reach this dentry */ dentry_iput(dentry); if (IS_ROOT(dentry)) @@ -619,7 +645,6 @@ void shrink_dcache_sb(struct super_block * sb) static void shrink_dcache_for_umount_subtree(struct dentry *dentry) { struct dentry *parent; - unsigned detached = 0; BUG_ON(!IS_ROOT(dentry)); @@ -678,7 +703,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) } list_del(&dentry->d_u.d_child); - detached++; inode = dentry->d_inode; if (inode) { @@ -696,7 +720,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) * otherwise we ascend to the parent and move to the * next sibling if there is one */ if (!parent) - goto out; + return; dentry = parent; @@ -705,11 +729,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) dentry = list_entry(dentry->d_subdirs.next, struct dentry, d_u.d_child); } -out: - /* several dentries were freed, need to correct nr_dentry */ - spin_lock(&dcache_lock); - dentry_stat.nr_dentry -= detached; - spin_unlock(&dcache_lock); } /* @@ -943,8 +962,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) dentry->d_flags = DCACHE_UNHASHED; spin_lock_init(&dentry->d_lock); dentry->d_inode = NULL; - dentry->d_parent = NULL; - dentry->d_sb = NULL; dentry->d_op = NULL; dentry->d_fsdata = NULL; dentry->d_mounted = 0; @@ -959,15 +976,17 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) if (parent) { dentry->d_parent = dget(parent); dentry->d_sb = parent->d_sb; + spin_lock(&dcache_lock); + list_add(&dentry->d_u.d_child, &parent->d_subdirs); + spin_unlock(&dcache_lock); } else { + dentry->d_parent = NULL; + dentry->d_sb = NULL; INIT_LIST_HEAD(&dentry->d_u.d_child); } - spin_lock(&dcache_lock); - if (parent) - list_add(&dentry->d_u.d_child, &parent->d_subdirs); - dentry_stat.nr_dentry++; - spin_unlock(&dcache_lock); + get_cpu_var(nr_dentry)++; + put_cpu_var(nr_dentry); return dentry; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 0dcdd94..c5e7aa5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata) struct ctl_table; int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 9d048fa..eebddef 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = { .data = &dentry_stat, .maxlen = 6*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_dentry, }, { .ctl_name = FS_OVERFLOWUID, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH 1/6] fs: Introduce a per_cpu nr_dentry 2008-11-26 23:30 ` [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Eric Dumazet @ 2008-11-27 9:41 ` Christoph Hellwig 0 siblings, 0 replies; 191+ messages in thread From: Christoph Hellwig @ 2008-11-27 9:41 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig Looks good modulo the exact version of the for_each_cpu loops that the experts in that area can help with. Same for the per_cpu nr_inodes patch. ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator 2008-11-21 15:34 ` Ingo Molnar 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet 2008-11-26 23:30 ` [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Eric Dumazet @ 2008-11-26 23:32 ` Eric Dumazet 2008-11-27 9:46 ` Christoph Hellwig 2008-11-26 23:32 ` [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Eric Dumazet ` (2 subsequent siblings) 5 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 565 bytes --] new_inode() dirties a contended cache line to get inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino. Note : last_ino_get() method must be called with preemption disabled on SMP. (socket8 bench result : no differences, but this is because inode_lock cost is too heavy) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/inode.c | 27 +++++++++++++++++++++++++-- 1 files changed, 25 insertions(+), 2 deletions(-) [-- Attachment #2: last_ino.patch --] [-- Type: text/plain, Size: 1308 bytes --] diff --git a/fs/inode.c b/fs/inode.c index 0487ddb..d850050 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -534,6 +534,30 @@ repeat: return node ? inode : NULL; } +#ifdef CONFIG_SMP +/* + * each cpu owns a block of 1024 numbers. + * The global 'last_ino' is dirtied once every 1024 allocations + */ +static DEFINE_PER_CPU(int, cpu_ino_alloc) = {0}; +static int last_ino_get(void) +{ + static atomic_t last_ino; + int *ptr = &__raw_get_cpu_var(cpu_ino_alloc); + + if (unlikely((*ptr & 1023) == 0)) + *ptr = atomic_add_return(1024, &last_ino); + return --(*ptr); +} +#else +static int last_ino_get(void) +{ + static int last_ino; + + return ++last_ino; +} +#endif + /** * new_inode - obtain an inode * @sb: superblock @@ -553,7 +577,6 @@ struct inode *new_inode(struct super_block *sb) * error if st_ino won't fit in target struct field. Use 32bit counter * here to attempt to avoid that. */ - static unsigned int last_ino; struct inode * inode; spin_lock_prefetch(&inode_lock); @@ -564,7 +587,7 @@ struct inode *new_inode(struct super_block *sb) inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); - inode->i_ino = ++last_ino; + inode->i_ino = last_ino_get(); inode->i_state = 0; spin_unlock(&inode_lock); } ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator 2008-11-26 23:32 ` [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator Eric Dumazet @ 2008-11-27 9:46 ` Christoph Hellwig 0 siblings, 0 replies; 191+ messages in thread From: Christoph Hellwig @ 2008-11-27 9:46 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig On Thu, Nov 27, 2008 at 12:32:24AM +0100, Eric Dumazet wrote: > new_inode() dirties a contended cache line to get inode numbers. > > Solve this problem by providing to each cpu a per_cpu variable, > feeded by the shared last_ino, but once every 1024 allocations. > > This reduce contention on the shared last_ino. > > Note : last_ino_get() method must be called with preemption > disabled on SMP. Looks a little clumsy. One idea might be to have a special slab for synthetic inodes using new_inode and only assign it on the first allocation and after that re-use it. ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-21 15:34 ` Ingo Molnar ` (2 preceding siblings ...) 2008-11-26 23:32 ` [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator Eric Dumazet @ 2008-11-26 23:32 ` Eric Dumazet 2008-11-27 9:32 ` Peter Zijlstra 2008-11-26 23:32 ` [PATCH 5/6] fs: Introduce special inodes Eric Dumazet 2008-11-26 23:32 ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet 5 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 473 bytes --] Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes metric dont need inode_lock anymore. (socket8 bench result : 25s to 20.5s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/fs-writeback.c | 2 - fs/inode.c | 51 +++++++++++++++++++++++++++++++++++------- include/linux/fs.h | 3 ++ kernel/sysctl.c | 4 +-- mm/page-writeback.c | 2 - 5 files changed, 50 insertions(+), 12 deletions(-) [-- Attachment #2: per_cpu_nr_inodes.patch --] [-- Type: text/plain, Size: 5705 bytes --] diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d0ff0b8..b591cdd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait) unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS); wbc.nr_to_write = nr_dirty + nr_unstable + - (inodes_stat.nr_inodes - inodes_stat.nr_unused) + + (get_nr_inodes() - inodes_stat.nr_unused) + nr_dirty + nr_unstable; wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ sync_sb_inodes(sb, &wbc); diff --git a/fs/inode.c b/fs/inode.c index d850050..8d8d40e 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex); * Statistics gathering.. */ struct inodes_stat_t inodes_stat; +static DEFINE_PER_CPU(int, nr_inodes); static struct kmem_cache * inode_cachep __read_mostly; +int get_nr_inodes(void) +{ + int cpu; + int counter = 0; + + for_each_possible_cpu(cpu) + counter += per_cpu(nr_inodes, cpu); + if (counter < 0) + counter = 0; + return counter; +} + +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + inodes_stat.nr_inodes = get_nr_inodes(); + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void wake_up_inode(struct inode *inode) { /* @@ -306,9 +337,8 @@ static void dispose_list(struct list_head *head) destroy_inode(inode); nr_disposed++; } - spin_lock(&inode_lock); - inodes_stat.nr_inodes -= nr_disposed; - spin_unlock(&inode_lock); + get_cpu_var(nr_inodes) -= nr_disposed; + put_cpu_var(nr_inodes); } /* @@ -584,10 +614,11 @@ struct inode *new_inode(struct super_block *sb) inode = alloc_inode(sb); if (inode) { spin_lock(&inode_lock); - inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); + get_cpu_var(nr_inodes)--; inode->i_ino = last_ino_get(); + put_cpu_var(nr_inodes); inode->i_state = 0; spin_unlock(&inode_lock); } @@ -645,7 +676,8 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h if (set(inode, data)) goto set_failed; - inodes_stat.nr_inodes++; + get_cpu_var(nr_inodes)++; + put_cpu_var(nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -694,7 +726,8 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he old = find_inode_fast(sb, head, ino); if (!old) { inode->i_ino = ino; - inodes_stat.nr_inodes++; + get_cpu_var(nr_inodes)++; + put_cpu_var(nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -1065,8 +1098,9 @@ void generic_delete_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + get_cpu_var(nr_inodes)--; + put_cpu_var(nr_inodes); security_inode_delete(inode); @@ -1116,8 +1150,9 @@ static void generic_forget_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + get_cpu_var(nr_inodes)--; + put_cpu_var(nr_inodes); if (inode->i_data.nrpages) truncate_inode_pages(&inode->i_data, 0); clear_inode(inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index c5e7aa5..2482977 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -47,6 +47,7 @@ struct inodes_stat_t { int dummy[5]; /* padding for sysctl ABI compatibility */ }; extern struct inodes_stat_t inodes_stat; +extern int get_nr_inodes(void); extern int leases_enable, lease_break_time; @@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index eebddef..eebed01 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 2*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .ctl_name = FS_STATINODE, @@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 7*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .procname = "file-nr", diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2970e35..a71a922 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg) next_jif = start_jif + dirty_writeback_interval; nr_to_write = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS) + - (inodes_stat.nr_inodes - inodes_stat.nr_unused); + (get_nr_inodes() - inodes_stat.nr_unused); while (nr_to_write > 0) { wbc.more_io = 0; wbc.encountered_congestion = 0; ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-26 23:32 ` [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Eric Dumazet @ 2008-11-27 9:32 ` Peter Zijlstra 2008-11-27 9:39 ` Peter Zijlstra ` (3 more replies) 0 siblings, 4 replies; 191+ messages in thread From: Peter Zijlstra @ 2008-11-27 9:32 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote: > Avoids cache line ping pongs between cpus and prepare next patch, > because updates of nr_inodes metric dont need inode_lock anymore. > > (socket8 bench result : 25s to 20.5s) > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex); > * Statistics gathering.. > */ > struct inodes_stat_t inodes_stat; > +static DEFINE_PER_CPU(int, nr_inodes); > > static struct kmem_cache * inode_cachep __read_mostly; > > +int get_nr_inodes(void) > +{ > + int cpu; > + int counter = 0; > + > + for_each_possible_cpu(cpu) > + counter += per_cpu(nr_inodes, cpu); > + if (counter < 0) > + counter = 0; > + return counter; > +} It would be good to get a cpu hotplug handler here and move to for_each_online_cpu(). People are wanting distro's to be build with NR_CPUS=4096. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-27 9:32 ` Peter Zijlstra @ 2008-11-27 9:39 ` Peter Zijlstra 2008-11-27 9:48 ` Christoph Hellwig 2008-11-27 10:01 ` Eric Dumazet ` (2 subsequent siblings) 3 siblings, 1 reply; 191+ messages in thread From: Peter Zijlstra @ 2008-11-27 9:39 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis On Thu, 2008-11-27 at 10:33 +0100, Peter Zijlstra wrote: > On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote: > > Avoids cache line ping pongs between cpus and prepare next patch, > > because updates of nr_inodes metric dont need inode_lock anymore. > > > > (socket8 bench result : 25s to 20.5s) > > > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > --- > > > @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex); > > * Statistics gathering.. > > */ > > struct inodes_stat_t inodes_stat; > > +static DEFINE_PER_CPU(int, nr_inodes); > > > > static struct kmem_cache * inode_cachep __read_mostly; > > > > +int get_nr_inodes(void) > > +{ > > + int cpu; > > + int counter = 0; > > + > > + for_each_possible_cpu(cpu) > > + counter += per_cpu(nr_inodes, cpu); > > + if (counter < 0) > > + counter = 0; > > + return counter; > > +} > > It would be good to get a cpu hotplug handler here and move to > for_each_online_cpu(). People are wanting distro's to be build with > NR_CPUS=4096. Also, this trade-off between global vs per_cpu only works if get_nr_inodes() is called significantly less than nr_inodes is changed. With it being called from writeback that might not be true for all workloads. One thing you can do about it is use the regular per-cpu counter stuff, which allows you to do an approximation of the global number (it also does all the hotplug stuff for you already). ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-27 9:39 ` Peter Zijlstra @ 2008-11-27 9:48 ` Christoph Hellwig 0 siblings, 0 replies; 191+ messages in thread From: Christoph Hellwig @ 2008-11-27 9:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Eric Dumazet, Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis On Thu, Nov 27, 2008 at 10:39:31AM +0100, Peter Zijlstra wrote: > With it being called from writeback that might not be true for all > workloads. One thing you can do about it is use the regular per-cpu > counter stuff, which allows you to do an approximation of the global > number (it also does all the hotplug stuff for you already). The way it's used in writeback is utterly stupid and should be fixed :) But otherwise agreed. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-27 9:32 ` Peter Zijlstra 2008-11-27 9:39 ` Peter Zijlstra @ 2008-11-27 10:01 ` Eric Dumazet 2008-11-27 10:07 ` Andi Kleen 2008-11-27 14:46 ` Christoph Lameter 3 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-27 10:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis Peter Zijlstra a écrit : > On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote: >> Avoids cache line ping pongs between cpus and prepare next patch, >> because updates of nr_inodes metric dont need inode_lock anymore. >> >> (socket8 bench result : 25s to 20.5s) >> >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >> --- > >> @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex); >> * Statistics gathering.. >> */ >> struct inodes_stat_t inodes_stat; >> +static DEFINE_PER_CPU(int, nr_inodes); >> >> static struct kmem_cache * inode_cachep __read_mostly; >> >> +int get_nr_inodes(void) >> +{ >> + int cpu; >> + int counter = 0; >> + >> + for_each_possible_cpu(cpu) >> + counter += per_cpu(nr_inodes, cpu); >> + if (counter < 0) >> + counter = 0; >> + return counter; >> +} > > It would be good to get a cpu hotplug handler here and move to > for_each_online_cpu(). People are wanting distro's to be build with > NR_CPUS=4096. Hum, I guess we can use regular percpu_counter for this... ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-27 9:32 ` Peter Zijlstra 2008-11-27 9:39 ` Peter Zijlstra 2008-11-27 10:01 ` Eric Dumazet @ 2008-11-27 10:07 ` Andi Kleen 2008-11-27 14:46 ` Christoph Lameter 3 siblings, 0 replies; 191+ messages in thread From: Andi Kleen @ 2008-11-27 10:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Eric Dumazet, Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis Peter Zijlstra <a.p.zijlstra@chello.nl> writes: >> >> +int get_nr_inodes(void) >> +{ >> + int cpu; >> + int counter = 0; >> + >> + for_each_possible_cpu(cpu) >> + counter += per_cpu(nr_inodes, cpu); >> + if (counter < 0) >> + counter = 0; >> + return counter; >> +} > > It would be good to get a cpu hotplug handler here and move to > for_each_online_cpu(). People are wanting distro's to be build with > NR_CPUS=4096. Doesn't matter, possible cpus is always only set to what the machine supports. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-27 9:32 ` Peter Zijlstra ` (2 preceding siblings ...) 2008-11-27 10:07 ` Andi Kleen @ 2008-11-27 14:46 ` Christoph Lameter 3 siblings, 0 replies; 191+ messages in thread From: Christoph Lameter @ 2008-11-27 14:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Eric Dumazet, Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Hellwig, travis On Thu, 27 Nov 2008, Peter Zijlstra wrote: > It would be good to get a cpu hotplug handler here and move to > for_each_online_cpu(). People are wanting distro's to be build with > NR_CPUS=4096. NR_CPUS=4096 does not necessarily increase the number of possible cpus. ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 5/6] fs: Introduce special inodes 2008-11-21 15:34 ` Ingo Molnar ` (3 preceding siblings ...) 2008-11-26 23:32 ` [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Eric Dumazet @ 2008-11-26 23:32 ` Eric Dumazet 2008-11-27 8:20 ` David Miller 2008-11-26 23:32 ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet 5 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 995 bytes --] Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. In new_inode(), we test if super block has MS_SPECIAL flag set. If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list As inode_lock was taken only to protect these lists, we avoid it as well Using iput_special() from dput_special() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we really need a different flag. (socket8 bench result : from 20.5s to 2.94s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 1 + fs/dcache.c | 2 +- fs/inode.c | 25 ++++++++++++++++++------- fs/pipe.c | 3 ++- include/linux/fs.h | 2 ++ net/socket.c | 1 + 6 files changed, 25 insertions(+), 9 deletions(-) [-- Attachment #2: special_inodes.patch --] [-- Type: text/plain, Size: 3551 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 4f20d48..a0212b3 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -158,6 +158,7 @@ static int __init anon_inode_init(void) error = PTR_ERR(anon_inode_mnt); goto err_unregister_filesystem; } + anon_inode_mnt->mnt_sb->s_flags |= MS_SPECIAL; anon_inode_inode = anon_inode_mkinode(); if (IS_ERR(anon_inode_inode)) { error = PTR_ERR(anon_inode_inode); diff --git a/fs/dcache.c b/fs/dcache.c index d73763b..bade7d7 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -239,7 +239,7 @@ static void dput_special(struct dentry *dentry) return; inode = dentry->d_inode; if (inode) - iput(inode); + iput_special(inode); d_free(dentry); } diff --git a/fs/inode.c b/fs/inode.c index 8d8d40e..1bb6553 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -228,6 +228,14 @@ void destroy_inode(struct inode *inode) kmem_cache_free(inode_cachep, (inode)); } +void iput_special(struct inode *inode) +{ + if (atomic_dec_and_test(&inode->i_count)) { + destroy_inode(inode); + get_cpu_var(nr_inodes)--; + put_cpu_var(nr_inodes); + } +} /* * These are initializations that only need to be done @@ -609,18 +617,21 @@ struct inode *new_inode(struct super_block *sb) */ struct inode * inode; - spin_lock_prefetch(&inode_lock); - inode = alloc_inode(sb); if (inode) { - spin_lock(&inode_lock); - list_add(&inode->i_list, &inode_in_use); - list_add(&inode->i_sb_list, &sb->s_inodes); + inode->i_state = 0; + if (sb->s_flags & MS_SPECIAL) { + INIT_LIST_HEAD(&inode->i_list); + INIT_LIST_HEAD(&inode->i_sb_list); + } else { + spin_lock(&inode_lock); + list_add(&inode->i_list, &inode_in_use); + list_add(&inode->i_sb_list, &sb->s_inodes); + spin_unlock(&inode_lock); + } get_cpu_var(nr_inodes)--; inode->i_ino = last_ino_get(); put_cpu_var(nr_inodes); - inode->i_state = 0; - spin_unlock(&inode_lock); } return inode; } diff --git a/fs/pipe.c b/fs/pipe.c index 5cc132a..6fca681 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void) if (IS_ERR(pipe_mnt)) { err = PTR_ERR(pipe_mnt); unregister_filesystem(&pipe_fs_type); - } + } else + pipe_mnt->mnt_sb->s_flags |= MS_SPECIAL; } return err; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 2482977..dd0e8a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -136,6 +136,7 @@ extern int dir_notify_enable; #define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */ #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */ #define MS_I_VERSION (1<<23) /* Update inode I_version field */ +#define MS_SPECIAL (1<<24) /* special fs (inodes not in sb->s_inodes) */ #define MS_ACTIVE (1<<30) #define MS_NOUSER (1<<31) @@ -1898,6 +1899,7 @@ extern void __iget(struct inode * inode); extern void iget_failed(struct inode *); extern void clear_inode(struct inode *); extern void destroy_inode(struct inode *); +extern void iput_special(struct inode *inode); extern struct inode *new_inode(struct super_block *); extern int should_remove_suid(struct dentry *); extern int file_remove_suid(struct file *); diff --git a/net/socket.c b/net/socket.c index f41b6c6..4177456 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2205,6 +2205,7 @@ static int __init sock_init(void) init_inodecache(); register_filesystem(&sock_fs_type); sock_mnt = kern_mount(&sock_fs_type); + sock_mnt->mnt_sb->s_flags |= MS_SPECIAL; /* The real protocol initialization is performed in later initcalls. */ ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH 5/6] fs: Introduce special inodes 2008-11-26 23:32 ` [PATCH 5/6] fs: Introduce special inodes Eric Dumazet @ 2008-11-27 8:20 ` David Miller 0 siblings, 0 replies; 191+ messages in thread From: David Miller @ 2008-11-27 8:20 UTC (permalink / raw) To: dada1 Cc: mingo, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, netdev, cl, hch From: Eric Dumazet <dada1@cosmosbay.com> Date: Thu, 27 Nov 2008 00:32:41 +0100 > Goal of this patch is to not touch inode_lock for socket/pipes/anonfd > inodes allocation/freeing. > > In new_inode(), we test if super block has MS_SPECIAL flag set. > If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list > As inode_lock was taken only to protect these lists, we avoid it as well > > Using iput_special() from dput_special() avoids taking inode_lock > at freeing time. > > This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines > in iput() > > Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we > really need a different flag. > > (socket8 bench result : from 20.5s to 2.94s) > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> No problem with networking part: Acked-by: David S. Miller <davem@davemloft.net> ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-21 15:34 ` Ingo Molnar ` (4 preceding siblings ...) 2008-11-26 23:32 ` [PATCH 5/6] fs: Introduce special inodes Eric Dumazet @ 2008-11-26 23:32 ` Eric Dumazet 2008-11-27 8:21 ` David Miller ` (2 more replies) 5 siblings, 3 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 511 bytes --] This function arms a flag (MNT_SPECIAL) on the vfs, to avoid refcounting on permanent system vfs. Use this function for sockets, pipes, anonymous fds. (socket8 bench result : from 2.94s to 2.23s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 2 +- fs/pipe.c | 2 +- fs/super.c | 9 +++++++++ include/linux/fs.h | 1 + include/linux/mount.h | 5 +++-- net/socket.c | 2 +- 6 files changed, 16 insertions(+), 5 deletions(-) [-- Attachment #2: mnt_special.patch --] [-- Type: text/plain, Size: 3352 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index a0212b3..42dfe28 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -153,7 +153,7 @@ static int __init anon_inode_init(void) error = register_filesystem(&anon_inode_fs_type); if (error) goto err_exit; - anon_inode_mnt = kern_mount(&anon_inode_fs_type); + anon_inode_mnt = kern_mount_special(&anon_inode_fs_type); if (IS_ERR(anon_inode_mnt)) { error = PTR_ERR(anon_inode_mnt); goto err_unregister_filesystem; diff --git a/fs/pipe.c b/fs/pipe.c index 6fca681..391d4fe 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -1074,7 +1074,7 @@ static int __init init_pipe_fs(void) int err = register_filesystem(&pipe_fs_type); if (!err) { - pipe_mnt = kern_mount(&pipe_fs_type); + pipe_mnt = kern_mount_special(&pipe_fs_type); if (IS_ERR(pipe_mnt)) { err = PTR_ERR(pipe_mnt); unregister_filesystem(&pipe_fs_type); diff --git a/fs/super.c b/fs/super.c index 400a760..a8e14f7 100644 --- a/fs/super.c +++ b/fs/super.c @@ -982,3 +982,12 @@ struct vfsmount *kern_mount_data(struct file_system_type *type, void *data) } EXPORT_SYMBOL_GPL(kern_mount_data); + +struct vfsmount *kern_mount_special(struct file_system_type *type) +{ + struct vfsmount *res = kern_mount_data(type, NULL); + + if (!IS_ERR(res)) + res->mnt_flags |= MNT_SPECIAL; + return res; +} diff --git a/include/linux/fs.h b/include/linux/fs.h index dd0e8a5..a92544a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1591,6 +1591,7 @@ extern int register_filesystem(struct file_system_type *); extern int unregister_filesystem(struct file_system_type *); extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data); #define kern_mount(type) kern_mount_data(type, NULL) +extern struct vfsmount *kern_mount_special(struct file_system_type *); extern int may_umount_tree(struct vfsmount *); extern int may_umount(struct vfsmount *); extern long do_mount(char *, char *, char *, unsigned long, void *); diff --git a/include/linux/mount.h b/include/linux/mount.h index cab2a85..cb4fa90 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -30,6 +30,7 @@ struct mnt_namespace; #define MNT_SHRINKABLE 0x100 #define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */ +#define MNT_SPECIAL 0x400 /* special mount (pipes,sockets,...) */ #define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */ #define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */ @@ -73,7 +74,7 @@ struct vfsmount { static inline struct vfsmount *mntget(struct vfsmount *mnt) { - if (mnt) + if (mnt && !(mnt->mnt_flags & MNT_SPECIAL)) atomic_inc(&mnt->mnt_count); return mnt; } @@ -87,7 +88,7 @@ extern int __mnt_is_readonly(struct vfsmount *mnt); static inline void mntput(struct vfsmount *mnt) { - if (mnt) { + if (mnt && !(mnt->mnt_flags & MNT_SPECIAL)) { mnt->mnt_expiry_mark = 0; mntput_no_expire(mnt); } diff --git a/net/socket.c b/net/socket.c index 4177456..2857d70 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2204,7 +2204,7 @@ static int __init sock_init(void) init_inodecache(); register_filesystem(&sock_fs_type); - sock_mnt = kern_mount(&sock_fs_type); + sock_mnt = kern_mount_special(&sock_fs_type); sock_mnt->mnt_sb->s_flags |= MS_SPECIAL; /* The real protocol initialization is performed in later initcalls. ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-26 23:32 ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet @ 2008-11-27 8:21 ` David Miller 2008-11-27 9:53 ` Christoph Hellwig 2008-11-28 9:26 ` Al Viro 2 siblings, 0 replies; 191+ messages in thread From: David Miller @ 2008-11-27 8:21 UTC (permalink / raw) To: dada1 Cc: mingo, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, netdev, cl, hch From: Eric Dumazet <dada1@cosmosbay.com> Date: Thu, 27 Nov 2008 00:32:59 +0100 > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid > refcounting on permanent system vfs. > Use this function for sockets, pipes, anonymous fds. > > (socket8 bench result : from 2.94s to 2.23s) > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> For networking bits: Acked-by: David S. Miller <davem@davemloft.net> ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-26 23:32 ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet 2008-11-27 8:21 ` David Miller @ 2008-11-27 9:53 ` Christoph Hellwig 2008-11-27 10:04 ` Eric Dumazet 2008-11-28 9:26 ` Al Viro 2 siblings, 1 reply; 191+ messages in thread From: Christoph Hellwig @ 2008-11-27 9:53 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid > refcounting on permanent system vfs. > Use this function for sockets, pipes, anonymous fds. special is not a useful name for a flag, by definition everything that needs a flag is special compared to the version that doesn't need a flag. The general idea of skippign the writer counts makes sense, but please give it a descriptive name that explains the not unmountable thing. And please kill your kern_mount wrapper and just set the flag manually. Also I think it should be a superblock flag, not a mount flag as you don't want thse to differ for multiple mounts of the same filesystem. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-27 9:53 ` Christoph Hellwig @ 2008-11-27 10:04 ` Eric Dumazet 2008-11-27 10:10 ` Christoph Hellwig 0 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-27 10:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter Christoph Hellwig a écrit : > On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: >> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid >> refcounting on permanent system vfs. >> Use this function for sockets, pipes, anonymous fds. > > special is not a useful name for a flag, by definition everything that > needs a flag is special compared to the version that doesn't need a > flag. > > The general idea of skippign the writer counts makes sense, but please > give it a descriptive name that explains the not unmountable thing. > And please kill your kern_mount wrapper and just set the flag manually. > > Also I think it should be a superblock flag, not a mount flag as you > don't want thse to differ for multiple mounts of the same filesystem. > > Hum.. we have a superblock flag already, but testing it in mntput()/mntget() is going to be a litle bit expensive if we add a derefence ? if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) { ... } ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-27 10:04 ` Eric Dumazet @ 2008-11-27 10:10 ` Christoph Hellwig 0 siblings, 0 replies; 191+ messages in thread From: Christoph Hellwig @ 2008-11-27 10:10 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Hellwig, Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter On Thu, Nov 27, 2008 at 11:04:38AM +0100, Eric Dumazet wrote: > Hum.. we have a superblock flag already, but testing it in mntput()/mntget() > is going to be a litle bit expensive if we add a derefence ? > > if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) { > ... > } Well, run a benchmark to see if it makes any difference. And when it does please always set the mount flag from the common mount code when it's set on the superblock, and document that this is the only valid way to set it. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-26 23:32 ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet 2008-11-27 8:21 ` David Miller 2008-11-27 9:53 ` Christoph Hellwig @ 2008-11-28 9:26 ` Al Viro 2008-11-28 9:34 ` Al Viro ` (2 more replies) 2 siblings, 3 replies; 191+ messages in thread From: Al Viro @ 2008-11-28 9:26 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth, ink On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid > refcounting on permanent system vfs. > Use this function for sockets, pipes, anonymous fds. IMO that's pushing it past the point of usefulness; unless you can show that this really gives considerable win on pipes et.al. *AND* that it doesn't hurt other loads... dput() part: again, I want to see what happens on other loads; it's probably fine (and win is certainly more than from mntput() change), but... The thing is, atomic_dec_and_lock() in there is often done on dentries with d_count > 1 and that's fairly cheap (and doesn't involve contention on dcache_lock on sane targets). FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock() in a special way, I'd try to compare with if (atomic_add_unless(&dentry->d_count, -1, 1)) return; if (your flag) sod off to special spin_lock(&dcache_lock); if (atomic_dec_and_test(&dentry->d_count)) { spin_unlock(&dcache_lock); return; } the rest as usual As for the alpha... unless I'm misreading the assembler in arch/alpha/lib/dec_and_lock.c, it looks like we have essentially an implementation of atomic_add_unless() in there and one that just might be better than what we've got in arch/alpha/include/asm/atomic.h. How about 1: ldl_l x, addr cmpne x, u, y /* y = x != u */ beq y, 3f /* if !y -> bugger off, return 0 */ addl x, a, y stl_c y, addr /* y <- *addr has not changed since ldl_l */ beq y, 2f 3: /* return value is in y */ .subsection 2 /* out of the way */ 2: br 1b .previous for atomic_add_unless() guts? With that we are rid of HAVE_DEC_LOCK and get a uniform implementation of atomic_dec_and_lock() for all targets... AFAICS, that would be static __inline__ int atomic_add_unless(atomic_t *v, int a, int u) { unsigned long temp, res; __asm__ __volatile__( "1: ldl_l %0,%1\n" " cmpne %0,%4,%2\n" " beq %4,3f\n" " addl %0,%3,%4\n" " stl_c %2,%1\n" " beq %2,2f\n" "3:\n" ".subsection 2\n" "2: br 1b\n" ".previous" :"=&r" (temp), "=m" (v->counter), "=&r" (res) :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory"); smp_mb(); return res; } static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u) { unsigned long temp, res; __asm__ __volatile__( "1: ldq_l %0,%1\n" " cmpne %0,%4,%2\n" " beq %4,3f\n" " addq %0,%3,%4\n" " stq_c %2,%1\n" " beq %2,2f\n" "3:\n" ".subsection 2\n" "2: br 1b\n" ".previous" :"=&r" (temp), "=m" (v->counter), "=&r" (res) :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory"); smp_mb(); return res; } Comments? ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-28 9:26 ` Al Viro @ 2008-11-28 9:34 ` Al Viro 2008-11-28 18:02 ` Ingo Molnar 2008-11-28 22:37 ` Eric Dumazet 2 siblings, 0 replies; 191+ messages in thread From: Al Viro @ 2008-11-28 9:34 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth, ink On Fri, Nov 28, 2008 at 09:26:04AM +0000, Al Viro wrote: gyah... That would be > static __inline__ int atomic_add_unless(atomic_t *v, int a, int u) > { > unsigned long temp, res; > __asm__ __volatile__( > "1: ldl_l %0,%1\n" > " cmpne %0,%4,%2\n" " beq %2,3f\n" " addl %0,%3,%2\n" > " stl_c %2,%1\n" > " beq %2,2f\n" > "3:\n" > ".subsection 2\n" > "2: br 1b\n" > ".previous" > :"=&r" (temp), "=m" (v->counter), "=&r" (res) > :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory"); > smp_mb(); > return res; > } > > static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u) > { > unsigned long temp, res; > __asm__ __volatile__( > "1: ldq_l %0,%1\n" > " cmpne %0,%4,%2\n" " beq %2,3f\n" " addq %0,%3,%2\n" > " stq_c %2,%1\n" > " beq %2,2f\n" > "3:\n" > ".subsection 2\n" > "2: br 1b\n" > ".previous" > :"=&r" (temp), "=m" (v->counter), "=&r" (res) > :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory"); > smp_mb(); > return res; > } > > Comments? > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-28 9:26 ` Al Viro 2008-11-28 9:34 ` Al Viro @ 2008-11-28 18:02 ` Ingo Molnar 2008-11-28 18:58 ` Ingo Molnar 2008-11-28 22:20 ` Eric Dumazet 2008-11-28 22:37 ` Eric Dumazet 2 siblings, 2 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-28 18:02 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth, ink * Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: > > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid > > refcounting on permanent system vfs. > > Use this function for sockets, pipes, anonymous fds. > > IMO that's pushing it past the point of usefulness; unless you can show > that this really gives considerable win on pipes et.al. *AND* that it > doesn't hurt other loads... The numbers look pretty convincing: > > (socket8 bench result : from 2.94s to 2.23s) And i wouldnt expect it to hurt real-filesystem workloads. Here's the contemporary trace of a typical ext3- sys_open(): 0) | sys_open() { 0) | do_sys_open() { 0) | getname() { 0) 0.367 us | kmem_cache_alloc(); 0) | strncpy_from_user(); { 0) | _cond_resched() { 0) | need_resched() { 0) 0.363 us | constant_test_bit(); 0) 1. 47 us | } 0) 1.815 us | } 0) 2.587 us | } 0) 4. 22 us | } 0) | alloc_fd() { 0) 0.480 us | _spin_lock(); 0) 0.487 us | expand_files(); 0) 2.356 us | } 0) | do_filp_open() { 0) | path_lookup_open() { 0) | get_empty_filp() { 0) 0.439 us | kmem_cache_alloc(); 0) | security_file_alloc() { 0) 0.316 us | cap_file_alloc_security(); 0) 1. 87 us | } 0) 3.189 us | } 0) | do_path_lookup() { 0) 0.366 us | _read_lock(); 0) | path_walk() { 0) | __link_path_walk() { 0) | inode_permission() { 0) | ext3_permission() { 0) 0.441 us | generic_permission(); 0) 1.247 us | } 0) | security_inode_permission() { 0) 0.411 us | cap_inode_permission(); 0) 1.186 us | } 0) 3.555 us | } 0) | do_lookup() { 0) | __d_lookup() { 0) 0.486 us | _spin_lock(); 0) 1.369 us | } 0) 0.442 us | __follow_mount(); 0) 3. 14 us | } 0) | path_to_nameidata() { 0) 0.476 us | dput(); 0) 1.235 us | } 0) | inode_permission() { 0) | ext3_permission() { 0) | generic_permission() { 0) | in_group_p() { 0) 0.410 us | groups_search(); 0) 1.172 us | } 0) 1.994 us | } 0) 2.789 us | } 0) | security_inode_permission() { 0) 0.454 us | cap_inode_permission(); 0) 1.238 us | } 0) 5.262 us | } 0) | do_lookup() { 0) | __d_lookup() { 0) 0.480 us | _spin_lock(); 0) 1.621 us | } 0) 0.456 us | __follow_mount(); 0) 3.215 us | } 0) | path_to_nameidata() { 0) 0.420 us | dput(); 0) 1.193 us | } 0) + 23.551 us | } 0) | path_put() { 0) 0.420 us | dput(); 0) | mntput() { 0) 0.359 us | mntput_no_expire(); 0) 1. 50 us | } 0) 2.544 us | } 0) + 27.253 us | } 0) + 28.850 us | } 0) + 33.217 us | } 0) | may_open() { 0) | inode_permission() { 0) | ext3_permission() { 0) 0.480 us | generic_permission(); 0) 1.229 us | } 0) | security_inode_permission() { 0) 0.405 us | cap_inode_permission(); 0) 1.196 us | } 0) 3.589 us | } 0) 4.600 us | } 0) | nameidata_to_filp() { 0) | __dentry_open() { 0) | file_move() { 0) 0.470 us | _spin_lock(); 0) 1.243 us | } 0) | security_dentry_open() { 0) 0.344 us | cap_dentry_open(); 0) 1.139 us | } 0) 0.412 us | generic_file_open(); 0) 0.561 us | file_ra_state_init(); 0) 5.714 us | } 0) 6.483 us | } 0) + 46.494 us | } 0) 0.453 us | inotify_dentry_parent_queue_event(); 0) 0.403 us | inotify_inode_queue_event(); 0) | fd_install() { 0) 0.440 us | _spin_lock(); 0) 1.247 us | } 0) | putname() { 0) | kmem_cache_free() { 0) | virt_to_head_page() { 0) 0.369 us | constant_test_bit(); 0) 1. 23 us | } 0) 1.738 us | } 0) 2.422 us | } 0) + 60.560 us | } 0) + 61.368 us | } and here's a sys_close(): 0) | sys_close() { 0) 0.540 us | _spin_lock(); 0) | filp_close() { 0) 0.437 us | dnotify_flush(); 0) 0.401 us | locks_remove_posix(); 0) 0.349 us | fput(); 0) 2.679 us | } 0) 4.452 us | } i'd be surprised to see a flag to show up in that codepath. Eric, does your testing confirm that? Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-28 18:02 ` Ingo Molnar @ 2008-11-28 18:58 ` Ingo Molnar 2008-11-28 22:20 ` Eric Dumazet 1 sibling, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-28 18:58 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth, ink * Ingo Molnar <mingo@elte.hu> wrote: > And i wouldnt expect it to hurt real-filesystem workloads. > > Here's the contemporary trace of a typical ext3- sys_open(): here's a sys_open() that has to touch atime: 0) | sys_open() { 0) | do_sys_open() { 0) | getname() { 0) 0.377 us | kmem_cache_alloc(); 0) | strncpy_from_user() { 0) | _cond_resched() { 0) | need_resched() { 0) 0.353 us | constant_test_bit(); 0) 1. 45 us | } 0) 1.739 us | } 0) 2.492 us | } 0) 3.934 us | } 0) | alloc_fd() { 0) 0.374 us | _spin_lock(); 0) 0.447 us | expand_files(); 0) 2.124 us | } 0) | do_filp_open() { 0) | path_lookup_open() { 0) | get_empty_filp() { 0) 0.689 us | kmem_cache_alloc(); 0) | security_file_alloc() { 0) 0.327 us | cap_file_alloc_security(); 0) 1. 71 us | } 0) 2.869 us | } 0) | do_path_lookup() { 0) 0.460 us | _read_lock(); 0) | path_walk() { 0) | __link_path_walk() { 0) | inode_permission() { 0) | ext3_permission() { 0) 0.434 us | generic_permission(); 0) 1.191 us | } 0) | security_inode_permission() { 0) 0.400 us | cap_inode_permission(); 0) 1.130 us | } 0) 3.453 us | } 0) | do_lookup() { 0) | __d_lookup() { 0) 0.489 us | _spin_lock(); 0) 1.525 us | } 0) 0.449 us | __follow_mount(); 0) 3.115 us | } 0) | path_to_nameidata() { 0) 0.422 us | dput(); 0) 1.204 us | } 0) | inode_permission() { 0) | ext3_permission() { 0) 0.391 us | generic_permission(); 0) 1.223 us | } 0) | security_inode_permission() { 0) 0.406 us | cap_inode_permission(); 0) 1.189 us | } 0) 3.565 us | } 0) | do_lookup() { 0) | __d_lookup() { 0) 0.527 us | _spin_lock(); 0) 1.633 us | } 0) 0.440 us | __follow_mount(); 0) 3.223 us | } 0) | do_follow_link() { 0) | _cond_resched() { 0) | need_resched() { 0) 0.361 us | constant_test_bit(); 0) 1. 64 us | } 0) 1.749 us | } 0) | security_inode_follow_link() { 0) 0.390 us | cap_inode_follow_link(); 0) 1.260 us | } 0) | touch_atime() { 0) | mnt_want_write() { 0) 0.360 us | _spin_lock(); 0) 1.137 us | } 0) | mnt_drop_write() { 0) 0.348 us | _spin_lock(); 0) 1.102 us | } 0) 3.402 us | } 0) 0.446 us | ext3_follow_link(); 0) | __link_path_walk() { 0) | inode_permission() { 0) | ext3_permission() { 0) | generic_permission() { 0) 4.481 us | } 0) | security_inode_permission() { 0) 0.402 us | cap_inode_permission(); 0) 1.127 us | } 0) 6.747 us | } 0) | do_lookup() { 0) | __d_lookup() { 0) 0.547 us | _spin_lock(); 0) 1.758 us | } 0) 0.465 us | __follow_mount(); 0) 3.368 us | } 0) | path_to_nameidata() { 0) 0.419 us | dput(); 0) 1.203 us | } 0) + 13. 40 us | } 0) | path_put() { 0) 0.429 us | dput(); 0) | mntput() { 0) 0.367 us | mntput_no_expire(); 0) 1.130 us | } 0) 2.660 us | } 0) | path_put() { 0) | dput() { 0) | _cond_resched() { 0) | need_resched() { 0) 0.382 us | constant_test_bit(); 0) 1. 67 us | } 0) 1.808 us | } 0) 0.399 us | _spin_lock(); 0) 0.452 us | _spin_lock(); 0) 4.270 us | } 0) | mntput() { 0) 0.375 us | mntput_no_expire(); 0) 1. 62 us | } 0) 6.547 us | } 0) + 32.702 us | } 0) + 50.413 us | } 0) | path_put() { 0) 0.421 us | dput(); 0) | mntput() { 0) 0.364 us | mntput_no_expire(); 0) 1. 64 us | } 0) 2.545 us | } 0) + 54.147 us | } 0) + 55.780 us | } 0) + 59.714 us | } 0) | may_open() { 0) | inode_permission() { 0) | ext3_permission() { 0) 0.406 us | generic_permission(); 0) 1.189 us | } 0) | security_inode_permission() { 0) 0.388 us | cap_inode_permission(); 0) 1.175 us | } 0) 3.498 us | } 0) 4.328 us | } 0) | nameidata_to_filp() { 0) | __dentry_open() { 0) | file_move() { 0) 0.361 us | _spin_lock(); 0) 1.102 us | } 0) | security_dentry_open() { 0) 0.356 us | cap_dentry_open(); 0) 1.121 us | } 0) 0.400 us | generic_file_open(); 0) 0.544 us | file_ra_state_init(); 0) 5. 11 us | } 0) 5.709 us | } 0) + 71.181 us | } 0) 0.453 us | inotify_dentry_parent_queue_event(); 0) 0.403 us | inotify_inode_queue_event(); 0) | fd_install() { 0) 0.411 us | _spin_lock(); 0) 1.217 us | } 0) | putname() { 0) | kmem_cache_free() { 0) | virt_to_head_page() { 0) 0.371 us | constant_test_bit(); 0) 1. 47 us | } 0) 1.752 us | } 0) 2.446 us | } 0) + 84.676 us | } 0) + 85.365 us | } Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-28 18:02 ` Ingo Molnar 2008-11-28 18:58 ` Ingo Molnar @ 2008-11-28 22:20 ` Eric Dumazet 1 sibling, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-28 22:20 UTC (permalink / raw) To: Ingo Molnar Cc: Al Viro, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth, ink Ingo Molnar a écrit : > * Al Viro <viro@ZenIV.linux.org.uk> wrote: > >> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: >>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid >>> refcounting on permanent system vfs. >>> Use this function for sockets, pipes, anonymous fds. >> IMO that's pushing it past the point of usefulness; unless you can show >> that this really gives considerable win on pipes et.al. *AND* that it >> doesn't hurt other loads... > > The numbers look pretty convincing: > >>> (socket8 bench result : from 2.94s to 2.23s) > > And i wouldnt expect it to hurt real-filesystem workloads. > > Here's the contemporary trace of a typical ext3- sys_open(): > > 0) | sys_open() { > 0) | do_sys_open() { > 0) | getname() { > 0) 0.367 us | kmem_cache_alloc(); > 0) | strncpy_from_user(); { > 0) | _cond_resched() { > 0) | need_resched() { > 0) 0.363 us | constant_test_bit(); > 0) 1. 47 us | } > 0) 1.815 us | } > 0) 2.587 us | } > 0) 4. 22 us | } > 0) | alloc_fd() { > 0) 0.480 us | _spin_lock(); > 0) 0.487 us | expand_files(); > 0) 2.356 us | } > 0) | do_filp_open() { > 0) | path_lookup_open() { > 0) | get_empty_filp() { > 0) 0.439 us | kmem_cache_alloc(); > 0) | security_file_alloc() { > 0) 0.316 us | cap_file_alloc_security(); > 0) 1. 87 us | } > 0) 3.189 us | } > 0) | do_path_lookup() { > 0) 0.366 us | _read_lock(); > 0) | path_walk() { > 0) | __link_path_walk() { > 0) | inode_permission() { > 0) | ext3_permission() { > 0) 0.441 us | generic_permission(); > 0) 1.247 us | } > 0) | security_inode_permission() { > 0) 0.411 us | cap_inode_permission(); > 0) 1.186 us | } > 0) 3.555 us | } > 0) | do_lookup() { > 0) | __d_lookup() { > 0) 0.486 us | _spin_lock(); > 0) 1.369 us | } > 0) 0.442 us | __follow_mount(); > 0) 3. 14 us | } > 0) | path_to_nameidata() { > 0) 0.476 us | dput(); > 0) 1.235 us | } > 0) | inode_permission() { > 0) | ext3_permission() { > 0) | generic_permission() { > 0) | in_group_p() { > 0) 0.410 us | groups_search(); > 0) 1.172 us | } > 0) 1.994 us | } > 0) 2.789 us | } > 0) | security_inode_permission() { > 0) 0.454 us | cap_inode_permission(); > 0) 1.238 us | } > 0) 5.262 us | } > 0) | do_lookup() { > 0) | __d_lookup() { > 0) 0.480 us | _spin_lock(); > 0) 1.621 us | } > 0) 0.456 us | __follow_mount(); > 0) 3.215 us | } > 0) | path_to_nameidata() { > 0) 0.420 us | dput(); > 0) 1.193 us | } > 0) + 23.551 us | } > 0) | path_put() { > 0) 0.420 us | dput(); > 0) | mntput() { > 0) 0.359 us | mntput_no_expire(); > 0) 1. 50 us | } > 0) 2.544 us | } > 0) + 27.253 us | } > 0) + 28.850 us | } > 0) + 33.217 us | } > 0) | may_open() { > 0) | inode_permission() { > 0) | ext3_permission() { > 0) 0.480 us | generic_permission(); > 0) 1.229 us | } > 0) | security_inode_permission() { > 0) 0.405 us | cap_inode_permission(); > 0) 1.196 us | } > 0) 3.589 us | } > 0) 4.600 us | } > 0) | nameidata_to_filp() { > 0) | __dentry_open() { > 0) | file_move() { > 0) 0.470 us | _spin_lock(); > 0) 1.243 us | } > 0) | security_dentry_open() { > 0) 0.344 us | cap_dentry_open(); > 0) 1.139 us | } > 0) 0.412 us | generic_file_open(); > 0) 0.561 us | file_ra_state_init(); > 0) 5.714 us | } > 0) 6.483 us | } > 0) + 46.494 us | } > 0) 0.453 us | inotify_dentry_parent_queue_event(); > 0) 0.403 us | inotify_inode_queue_event(); > 0) | fd_install() { > 0) 0.440 us | _spin_lock(); > 0) 1.247 us | } > 0) | putname() { > 0) | kmem_cache_free() { > 0) | virt_to_head_page() { > 0) 0.369 us | constant_test_bit(); > 0) 1. 23 us | } > 0) 1.738 us | } > 0) 2.422 us | } > 0) + 60.560 us | } > 0) + 61.368 us | } > > and here's a sys_close(): > > 0) | sys_close() { > 0) 0.540 us | _spin_lock(); > 0) | filp_close() { > 0) 0.437 us | dnotify_flush(); > 0) 0.401 us | locks_remove_posix(); > 0) 0.349 us | fput(); > 0) 2.679 us | } > 0) 4.452 us | } > > i'd be surprised to see a flag to show up in that codepath. Eric, does > your testing confirm that? On a socket/pipe, definitly no, because inode->i_sb->s_flags is not contended. But on a shared inode, it might hurt : offsetof(struct inode, i_count)=0x24 offsetof(struct inode, i_lock)=0x70 offsetof(struct inode, i_sb)=0x9c offsetof(struct inode, i_writecount)=0x144 So i_sb sits in a probably contended cache line I wonder why i_writecount sits so far from i_count, that doesnt make sense. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-28 9:26 ` Al Viro 2008-11-28 9:34 ` Al Viro 2008-11-28 18:02 ` Ingo Molnar @ 2008-11-28 22:37 ` Eric Dumazet 2008-11-28 22:43 ` Eric Dumazet 2 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-28 22:37 UTC (permalink / raw) To: Al Viro Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth, ink Al Viro a écrit : > On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: >> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid >> refcounting on permanent system vfs. >> Use this function for sockets, pipes, anonymous fds. > > IMO that's pushing it past the point of usefulness; unless you can show > that this really gives considerable win on pipes et.al. *AND* that it > doesn't hurt other loads... Well, if this is the last cache line that might be shared, then yes, numbers can talk. But coming from 10 to 1 instead of 0 is OK I guess > > dput() part: again, I want to see what happens on other loads; it's probably > fine (and win is certainly more than from mntput() change), but... The > thing is, atomic_dec_and_lock() in there is often done on dentries with > d_count > 1 and that's fairly cheap (and doesn't involve contention on > dcache_lock on sane targets). > > FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock() > in a special way, I'd try to compare with > if (atomic_add_unless(&dentry->d_count, -1, 1)) > return; I dont know, but *reading* d_count before trying to write it is expensive on modern cpus. Oprofile clearly show that on Intel Core2. Then, *testing* the flag before doing the atomic_something() has the same problem. Or we should put flag in a different cache line. I am lazy (time for a sleep here), maybe we are smart here and use a trick like that already ? atomic_t atomic_read_with_write_intent(atomic_t *v) { int val = 0; /* * No LOCK prefix here, we only give a write intent hint to cpu */ asm volatile("xaddl %0, %1" : "+r" (val), "+m" (v->counter) : : "memory"); return val; } > if (your flag) > sod off to special > spin_lock(&dcache_lock); > if (atomic_dec_and_test(&dentry->d_count)) { > spin_unlock(&dcache_lock); > return; > } > the rest as usual > ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-28 22:37 ` Eric Dumazet @ 2008-11-28 22:43 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-28 22:43 UTC (permalink / raw) To: Al Viro Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth, ink Eric Dumazet a écrit : > Al Viro a écrit : >> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: >>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid >>> refcounting on permanent system vfs. >>> Use this function for sockets, pipes, anonymous fds. >> >> IMO that's pushing it past the point of usefulness; unless you can show >> that this really gives considerable win on pipes et.al. *AND* that it >> doesn't hurt other loads... > > Well, if this is the last cache line that might be shared, then yes, > numbers can talk. > But coming from 10 to 1 instead of 0 is OK I guess > >> >> dput() part: again, I want to see what happens on other loads; it's >> probably >> fine (and win is certainly more than from mntput() change), but... The >> thing is, atomic_dec_and_lock() in there is often done on dentries with >> d_count > 1 and that's fairly cheap (and doesn't involve contention on >> dcache_lock on sane targets). >> >> FWIW, unless there's a really good reason to do alpha >> atomic_dec_and_lock() >> in a special way, I'd try to compare with > >> if (atomic_add_unless(&dentry->d_count, -1, 1)) >> return; > > I dont know, but *reading* d_count before trying to write it is expensive > on modern cpus. Oprofile clearly show that on Intel Core2. > > Then, *testing* the flag before doing the atomic_something() has the same > problem. Or we should put flag in a different cache line. > > I am lazy (time for a sleep here), maybe we are smart here and use a > trick like that already ? > > atomic_t atomic_read_with_write_intent(atomic_t *v) > { > int val = 0; > /* > * No LOCK prefix here, we only give a write intent hint to cpu > */ > asm volatile("xaddl %0, %1" > : "+r" (val), "+m" (v->counter) > : : "memory"); > return val; > } Forget it, its wrong... I really need to sleep :) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent 2008-11-21 15:13 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Eric Dumazet 2008-11-21 15:21 ` Ingo Molnar @ 2008-11-21 15:36 ` Christoph Hellwig 2008-11-21 17:58 ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet 1 sibling, 1 reply; 191+ messages in thread From: Christoph Hellwig @ 2008-11-21 15:36 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel On Fri, Nov 21, 2008 at 04:13:38PM +0100, Eric Dumazet wrote: > [PATCH] fs: pipe/sockets/anon dentries should not have a parent > > Linking pipe/sockets/anon dentries to one root 'parent' has no functional > impact at all, but a scalability one. > > We can avoid touching a cache line at allocation stage (inside d_alloc(), no need > to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count) > We avoid an expensive atomic_dec_and_lock() call on the root dentry. > > If we correct dnotify_parent() and inotify_d_instantiate() to take into account > a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry. Sorry folks, but a NULL d_parent is a no-go from the VFS perspective, but you can set d_parent to the dentry itself which is the magic used for root of tree dentries. They should also be marked DCACHE_DISCONNECTED to make sure this is not unexpected. And this kind of stuff really needs to go through -fsdevel. ^ permalink raw reply [flat|nested] 191+ messages in thread
* [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent 2008-11-21 15:36 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig @ 2008-11-21 17:58 ` Eric Dumazet 2008-11-21 18:43 ` Matthew Wilcox 0 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-21 17:58 UTC (permalink / raw) To: Christoph Hellwig Cc: David Miller, mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 5101 bytes --] Christoph Hellwig a écrit : > On Fri, Nov 21, 2008 at 04:13:38PM +0100, Eric Dumazet wrote: >> [PATCH] fs: pipe/sockets/anon dentries should not have a parent >> >> Linking pipe/sockets/anon dentries to one root 'parent' has no functional >> impact at all, but a scalability one. >> >> We can avoid touching a cache line at allocation stage (inside d_alloc(), no need >> to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count) >> We avoid an expensive atomic_dec_and_lock() call on the root dentry. >> >> If we correct dnotify_parent() and inotify_d_instantiate() to take into account >> a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry. > > Sorry folks, but a NULL d_parent is a no-go from the VFS perspective, > but you can set d_parent to the dentry itself which is the magic used > for root of tree dentries. They should also be marked > DCACHE_DISCONNECTED to make sure this is not unexpected. > > And this kind of stuff really needs to go through -fsdevel. Thanks Christoph for your review, sorry for fsdevel being forgotten. d_alloc_root() is not an option here, since we also want such dentries to be unhashed. So here is a second version, with the introduction of a new helper, d_alloc_unhashed(), to be used by pipes, sockets and anon I got even better numbers, probably because dnotify/inotify dont have the NULL d_parent test anymore. [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Linking pipe/sockets/anon dentries to one root 'parent' has no functional impact at all, but a scalability one. We can avoid touching a cache line at allocation stage (inside d_alloc(), no need to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count) We avoid an expensive atomic_dec_and_lock() call on the root dentry. We add d_alloc_unhashed(const char *name, struct inode *inode) helper to be used by pipes/socket/anon. This function is about the same as d_alloc_root() but for unhashed entries. Before patch, time to run 8 * 1 million of close(socket()) calls on 8 CPUS was : real 0m27.496s user 0m0.657s sys 3m39.092s After patch : real 0m23.843s user 0m0.616s sys 3m9.732s Old oprofile : CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 164257 164257 11.0245 11.0245 init_file 155488 319745 10.4359 21.4604 d_alloc 151887 471632 10.1942 31.6547 _atomic_dec_and_lock 91620 563252 6.1493 37.8039 inet_create 74245 637497 4.9831 42.7871 kmem_cache_alloc 46702 684199 3.1345 45.9216 dentry_iput 46186 730385 3.0999 49.0215 tcp_close 42824 773209 2.8742 51.8957 kmem_cache_free 37275 810484 2.5018 54.3975 wake_up_inode 36553 847037 2.4533 56.8508 tcp_v4_init_sock 35661 882698 2.3935 59.2443 inotify_d_instantiate 32998 915696 2.2147 61.4590 sysenter_past_esp 31442 947138 2.1103 63.5693 d_instantiate 31303 978441 2.1010 65.6703 generic_forget_inode 27533 1005974 1.8479 67.5183 vfs_dq_drop 24237 1030211 1.6267 69.1450 sock_attach_fd 19290 1049501 1.2947 70.4397 __copy_from_user_ll New oprofile : CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 148703 148703 10.8581 10.8581 inet_create 116680 265383 8.5198 19.3779 new_inode 108912 374295 7.9526 27.3306 init_file 82911 457206 6.0541 33.3846 kmem_cache_alloc 65690 522896 4.7966 38.1812 wake_up_inode 53286 576182 3.8909 42.0721 _atomic_dec_and_lock 43814 619996 3.1992 45.2713 generic_forget_inode 41993 661989 3.0663 48.3376 d_alloc 41244 703233 3.0116 51.3492 kmem_cache_free 39244 742477 2.8655 54.2148 tcp_v4_init_sock 37402 779879 2.7310 56.9458 tcp_close 33336 813215 2.4342 59.3800 sysenter_past_esp 28596 841811 2.0880 61.4680 inode_has_buffers 25769 867580 1.8816 63.3496 d_kill 22606 890186 1.6507 65.0003 dentry_iput 20224 910410 1.4767 66.4770 vfs_dq_drop 19800 930210 1.4458 67.9228 __copy_from_user_ll Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 9 +-------- fs/dcache.c | 31 +++++++++++++++++++++++++++++++ fs/pipe.c | 10 +--------- include/linux/dcache.h | 1 + net/socket.c | 10 +--------- 5 files changed, 35 insertions(+), 26 deletions(-) [-- Attachment #2: d_alloc_unhashed.patch --] [-- Type: text/plain, Size: 4728 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 3662dd4..9fd0515 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -71,7 +71,6 @@ static struct dentry_operations anon_inodefs_dentry_operations = { int anon_inode_getfd(const char *name, const struct file_operations *fops, void *priv, int flags) { - struct qstr this; struct dentry *dentry; struct file *file; int error, fd; @@ -89,10 +88,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, * using the inode sequence number. */ error = -ENOMEM; - this.name = name; - this.len = strlen(name); - this.hash = 0; - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); + dentry = d_alloc_unhashed(name, anon_inode_inode); if (!dentry) goto err_put_unused_fd; @@ -104,9 +100,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, atomic_inc(&anon_inode_inode->i_count); dentry->d_op = &anon_inodefs_dentry_operations; - /* Do not publish this dentry inside the global dentry hash table */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, anon_inode_inode); error = -ENFILE; file = alloc_file(anon_inode_mnt, dentry, diff --git a/fs/dcache.c b/fs/dcache.c index a1d86c7..a5477fd 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1111,6 +1111,37 @@ struct dentry * d_alloc_root(struct inode * root_inode) return res; } +/** + * d_alloc_unhashed - allocate unhashed dentry + * @inode: inode to allocate the dentry for + * @name: dentry name + * + * Allocate an unhashed dentry for the inode given. The inode is + * instantiated and returned. %NULL is returned if there is insufficient + * memory. Unhashed dentries have themselves as a parent. + */ + +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode) +{ + struct qstr q = { .name = name, .len = strlen(name) }; + struct dentry *res; + + res = d_alloc(NULL, &q); + if (res) { + res->d_sb = inode->i_sb; + res->d_parent = res; + /* + * We dont want to push this dentry into global dentry hash table. + * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED + * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon + */ + res->d_flags &= ~DCACHE_UNHASHED; + res->d_flags |= DCACHE_DISCONNECTED; + d_instantiate(res, inode); + } + return res; +} + static inline struct hlist_head *d_hash(struct dentry *parent, unsigned long hash) { diff --git a/fs/pipe.c b/fs/pipe.c index 7aea8b8..29fcac2 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -918,7 +918,6 @@ struct file *create_write_pipe(int flags) struct inode *inode; struct file *f; struct dentry *dentry; - struct qstr name = { .name = "" }; err = -ENFILE; inode = get_pipe_inode(); @@ -926,18 +925,11 @@ struct file *create_write_pipe(int flags) goto err; err = -ENOMEM; - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_unhashed("", inode); if (!dentry) goto err_inode; dentry->d_op = &pipefs_dentry_operations; - /* - * We dont want to publish this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on pipes - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, inode); err = -ENFILE; f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index a37359d..12438d6 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -238,6 +238,7 @@ extern int d_invalidate(struct dentry *); /* only used at mount-time */ extern struct dentry * d_alloc_root(struct inode *); +extern struct dentry * d_alloc_unhashed(const char *, struct inode *); /* <clickety>-<click> the ramfs-type tree */ extern void d_genocide(struct dentry *); diff --git a/net/socket.c b/net/socket.c index e9d65ea..b659b5d 100644 --- a/net/socket.c +++ b/net/socket.c @@ -371,20 +371,12 @@ static int sock_alloc_fd(struct file **filep, int flags) static int sock_attach_fd(struct socket *sock, struct file *file, int flags) { struct dentry *dentry; - struct qstr name = { .name = "" }; - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_unhashed("", SOCK_INODE(sock)); if (unlikely(!dentry)) return -ENOMEM; dentry->d_op = &sockfs_dentry_operations; - /* - * We dont want to push this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on sockets - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, SOCK_INODE(sock)); sock->file = file; init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent 2008-11-21 17:58 ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet @ 2008-11-21 18:43 ` Matthew Wilcox 2008-11-23 3:53 ` Eric Dumazet 0 siblings, 1 reply; 191+ messages in thread From: Matthew Wilcox @ 2008-11-21 18:43 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Hellwig, David Miller, mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel On Fri, Nov 21, 2008 at 06:58:29PM +0100, Eric Dumazet wrote: > +/** > + * d_alloc_unhashed - allocate unhashed dentry > + * @inode: inode to allocate the dentry for > + * @name: dentry name It's normal to list the parameters in the order they're passed to the function. Not sure if we have a tool that checks for this or not -- Randy? > + * > + * Allocate an unhashed dentry for the inode given. The inode is > + * instantiated and returned. %NULL is returned if there is insufficient > + * memory. Unhashed dentries have themselves as a parent. > + */ > + > +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode) > +{ > + struct qstr q = { .name = name, .len = strlen(name) }; > + struct dentry *res; > + > + res = d_alloc(NULL, &q); > + if (res) { > + res->d_sb = inode->i_sb; > + res->d_parent = res; > + /* > + * We dont want to push this dentry into global dentry hash table. > + * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED > + * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon > + */ Line length ... as checkpatch would have warned you ;-) And there are several other grammatical nitpicks with this comment. Try this: /* * We don't want to put this dentry in the global dentry * hash table, so we pretend the dentry is already hashed * by unsetting DCACHE_UNHASHED. This permits * /proc/$pid/fd/XXX t work for sockets, pipes and * anonymous files (signalfd, timerfd, etc). */ > + res->d_flags &= ~DCACHE_UNHASHED; > + res->d_flags |= DCACHE_DISCONNECTED; Is this really better than: res->d_flags = res->d_flags & ~DCACHE_UNHASHED | DCACHE_DISCONNECTED; Anyway, nice cleanup. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent 2008-11-21 18:43 ` Matthew Wilcox @ 2008-11-23 3:53 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-23 3:53 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, David Miller, mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 5733 bytes --] Matthew Wilcox a écrit : > On Fri, Nov 21, 2008 at 06:58:29PM +0100, Eric Dumazet wrote: >> +/** >> + * d_alloc_unhashed - allocate unhashed dentry >> + * @inode: inode to allocate the dentry for >> + * @name: dentry name > > It's normal to list the parameters in the order they're passed to the > function. Not sure if we have a tool that checks for this or not -- > Randy? Yes, no problem, better to have the same order. > >> + * >> + * Allocate an unhashed dentry for the inode given. The inode is >> + * instantiated and returned. %NULL is returned if there is insufficient >> + * memory. Unhashed dentries have themselves as a parent. >> + */ >> + >> +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode) >> +{ >> + struct qstr q = { .name = name, .len = strlen(name) }; >> + struct dentry *res; >> + >> + res = d_alloc(NULL, &q); >> + if (res) { >> + res->d_sb = inode->i_sb; >> + res->d_parent = res; >> + /* >> + * We dont want to push this dentry into global dentry hash table. >> + * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED >> + * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon >> + */ > > Line length ... as checkpatch would have warned you ;-) > > And there are several other grammatical nitpicks with this comment. Try > this: > > /* > * We don't want to put this dentry in the global dentry > * hash table, so we pretend the dentry is already hashed > * by unsetting DCACHE_UNHASHED. This permits > * /proc/$pid/fd/XXX t work for sockets, pipes and > * anonymous files (signalfd, timerfd, etc). > */ Yes, this is better. > >> + res->d_flags &= ~DCACHE_UNHASHED; >> + res->d_flags |= DCACHE_DISCONNECTED; > > Is this really better than: > > res->d_flags = res->d_flags & ~DCACHE_UNHASHED | > DCACHE_DISCONNECTED; Well, I personally prefer the two lines, intention is more readable :) > > Anyway, nice cleanup. > Thanks Matthew, here is an updated version of the patch. [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Linking pipe/sockets/anon dentries to one root 'parent' has no functional impact at all, but a scalability one. We can avoid touching a cache line at allocation stage (inside d_alloc(), no need to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count) We avoid an expensive atomic_dec_and_lock() call on the root dentry. We add d_alloc_unhashed(const char *name, struct inode *inode) helper to be used by pipes/socket/anon. This function is about the same as d_alloc_root() but for unhashed entries. Before patch, time to run 8 * 1 million of close(socket()) calls on 8 CPUS was : real 0m27.496s user 0m0.657s sys 3m39.092s After patch : real 0m23.843s user 0m0.616s sys 3m9.732s Old oprofile : CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 164257 164257 11.0245 11.0245 init_file 155488 319745 10.4359 21.4604 d_alloc 151887 471632 10.1942 31.6547 _atomic_dec_and_lock 91620 563252 6.1493 37.8039 inet_create 74245 637497 4.9831 42.7871 kmem_cache_alloc 46702 684199 3.1345 45.9216 dentry_iput 46186 730385 3.0999 49.0215 tcp_close 42824 773209 2.8742 51.8957 kmem_cache_free 37275 810484 2.5018 54.3975 wake_up_inode 36553 847037 2.4533 56.8508 tcp_v4_init_sock 35661 882698 2.3935 59.2443 inotify_d_instantiate 32998 915696 2.2147 61.4590 sysenter_past_esp 31442 947138 2.1103 63.5693 d_instantiate 31303 978441 2.1010 65.6703 generic_forget_inode 27533 1005974 1.8479 67.5183 vfs_dq_drop 24237 1030211 1.6267 69.1450 sock_attach_fd 19290 1049501 1.2947 70.4397 __copy_from_user_ll New oprofile : CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 148703 148703 10.8581 10.8581 inet_create 116680 265383 8.5198 19.3779 new_inode 108912 374295 7.9526 27.3306 init_file 82911 457206 6.0541 33.3846 kmem_cache_alloc 65690 522896 4.7966 38.1812 wake_up_inode 53286 576182 3.8909 42.0721 _atomic_dec_and_lock 43814 619996 3.1992 45.2713 generic_forget_inode 41993 661989 3.0663 48.3376 d_alloc 41244 703233 3.0116 51.3492 kmem_cache_free 39244 742477 2.8655 54.2148 tcp_v4_init_sock 37402 779879 2.7310 56.9458 tcp_close 33336 813215 2.4342 59.3800 sysenter_past_esp 28596 841811 2.0880 61.4680 inode_has_buffers 25769 867580 1.8816 63.3496 d_kill 22606 890186 1.6507 65.0003 dentry_iput 20224 910410 1.4767 66.4770 vfs_dq_drop 19800 930210 1.4458 67.9228 __copy_from_user_ll Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 9 +-------- fs/dcache.c | 33 +++++++++++++++++++++++++++++++++ fs/pipe.c | 10 +--------- include/linux/dcache.h | 1 + net/socket.c | 10 +--------- 5 files changed, 37 insertions(+), 26 deletions(-) [-- Attachment #2: d_alloc_unhashed2.patch --] [-- Type: text/plain, Size: 4788 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 3662dd4..9fd0515 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -71,7 +71,6 @@ static struct dentry_operations anon_inodefs_dentry_operations = { int anon_inode_getfd(const char *name, const struct file_operations *fops, void *priv, int flags) { - struct qstr this; struct dentry *dentry; struct file *file; int error, fd; @@ -89,10 +88,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, * using the inode sequence number. */ error = -ENOMEM; - this.name = name; - this.len = strlen(name); - this.hash = 0; - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); + dentry = d_alloc_unhashed(name, anon_inode_inode); if (!dentry) goto err_put_unused_fd; @@ -104,9 +100,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, atomic_inc(&anon_inode_inode->i_count); dentry->d_op = &anon_inodefs_dentry_operations; - /* Do not publish this dentry inside the global dentry hash table */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, anon_inode_inode); error = -ENFILE; file = alloc_file(anon_inode_mnt, dentry, diff --git a/fs/dcache.c b/fs/dcache.c index a1d86c7..43ef88d 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1111,6 +1111,39 @@ struct dentry * d_alloc_root(struct inode * root_inode) return res; } +/** + * d_alloc_unhashed - allocate unhashed dentry + * @name: dentry name + * @inode: inode to allocate the dentry for + * + * Allocate an unhashed dentry for the inode given. The inode is + * instantiated and returned. %NULL is returned if there is insufficient + * memory. Unhashed dentries have themselves as a parent. + */ + +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode) +{ + struct qstr q = { .name = name, .len = strlen(name) }; + struct dentry *res; + + res = d_alloc(NULL, &q); + if (res) { + res->d_sb = inode->i_sb; + res->d_parent = res; + /* + * We dont want to push this dentry into global dentry + * hash table, so we pretend the dentry is already hashed + * by unsetting DCACHE_UNHASHED. This permits + * /proc/$pid/fd/XXX to work for sockets, pipes, and + * anonymous files (signalfd, timerfd, ...) + */ + res->d_flags &= ~DCACHE_UNHASHED; + res->d_flags |= DCACHE_DISCONNECTED; + d_instantiate(res, inode); + } + return res; +} + static inline struct hlist_head *d_hash(struct dentry *parent, unsigned long hash) { diff --git a/fs/pipe.c b/fs/pipe.c index 7aea8b8..29fcac2 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -918,7 +918,6 @@ struct file *create_write_pipe(int flags) struct inode *inode; struct file *f; struct dentry *dentry; - struct qstr name = { .name = "" }; err = -ENFILE; inode = get_pipe_inode(); @@ -926,18 +925,11 @@ struct file *create_write_pipe(int flags) goto err; err = -ENOMEM; - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_unhashed("", inode); if (!dentry) goto err_inode; dentry->d_op = &pipefs_dentry_operations; - /* - * We dont want to publish this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on pipes - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, inode); err = -ENFILE; f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index a37359d..12438d6 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -238,6 +238,7 @@ extern int d_invalidate(struct dentry *); /* only used at mount-time */ extern struct dentry * d_alloc_root(struct inode *); +extern struct dentry * d_alloc_unhashed(const char *, struct inode *); /* <clickety>-<click> the ramfs-type tree */ extern void d_genocide(struct dentry *); diff --git a/net/socket.c b/net/socket.c index e9d65ea..b659b5d 100644 --- a/net/socket.c +++ b/net/socket.c @@ -371,20 +371,12 @@ static int sock_alloc_fd(struct file **filep, int flags) static int sock_attach_fd(struct socket *sock, struct file *file, int flags) { struct dentry *dentry; - struct qstr name = { .name = "" }; - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_unhashed("", SOCK_INODE(sock)); if (unlikely(!dentry)) return -ENOMEM; dentry->d_op = &sockfs_dentry_operations; - /* - * We dont want to push this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on sockets - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, SOCK_INODE(sock)); sock->file = file; init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, ^ permalink raw reply related [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-21 8:51 ` Eric Dumazet 2008-11-21 9:05 ` David Miller @ 2008-11-21 9:18 ` Ingo Molnar 1 sibling, 0 replies; 191+ messages in thread From: Ingo Molnar @ 2008-11-21 9:18 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Lameter, Rafael J. Wysocki, Linux Kernel Mailing List, Kernel Testers List, Mike Galbraith, Peter Zijlstra, David S. Miller * Eric Dumazet <dada1@cosmosbay.com> wrote: > Ingo Molnar a écrit : >> * Christoph Lameter <cl@linux-foundation.org> wrote: >> >>> hmmm... Well we are almost there. >>> >>> 2.6.22: >>> >>> Throughput 2526.15 MB/sec 8 procs >>> >>> 2.6.28-rc5: >>> >>> Throughput 2486.2 MB/sec 8 procs >>> >>> 8p Dell 1950 and the number of processors specified on the tbench >>> command line. >> >> And with net-next we might even be able to get past that magic limit? >> net-next is linus-latest plus the latest and greatest networking bits: >> >> $ cat .git/config >> >> [remote "net-next"] >> url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git >> fetch = +refs/heads/*:refs/remotes/net-next/* >> >> ... so might be worth a test. Just to satisfy our curiosity and to >> possibly close the entry :-) >> > > Well, bits in net-next are new stuff for 2.6.29, not really > regression fixes, but yes, they should give nice tbench speedups. yeah, i know - technically these are lots-of-kernel-releases effects so not bona fide latest-cycle regressions anyway. But it doesnt matter how we call them, we want improvement in these metrics. > Now, I wish sockets and pipes not going through dcache, not tbench > affair of course but real workloads... > > running 8 processes on a 8 way machine doing a > > for (;;) > close(socket(AF_INET, SOCK_STREAM, 0)); > > is slow as hell, we hit so many contended cache lines ... > > ticket spin locks are slower in this case (dcache_lock for example > is taken twice when we allocate a socket(), once in d_alloc(), > another one in d_instantiate()) hm, weird - since there's no real VFS namespace impact i fail to realize the fundamental need that causes us to hit the dcache_lock. (perhaps there's none and this is fixable) The general concept of mapping sockets to fds is a fundamental and powerful abstraction. There are APIs that also connect them to the VFS namespace (such as unix domain sockets) - but those should be special cases, not impacting normal TCP sockets. Ingo ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-21 8:30 ` Ingo Molnar 2008-11-21 8:51 ` Eric Dumazet @ 2008-11-21 9:03 ` David Miller 2008-11-21 16:11 ` Christoph Lameter 2 siblings, 0 replies; 191+ messages in thread From: David Miller @ 2008-11-21 9:03 UTC (permalink / raw) To: mingo; +Cc: cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra From: Ingo Molnar <mingo@elte.hu> Date: Fri, 21 Nov 2008 09:30:44 +0100 > > * Christoph Lameter <cl@linux-foundation.org> wrote: > > > hmmm... Well we are almost there. > > > > 2.6.22: > > > > Throughput 2526.15 MB/sec 8 procs > > > > 2.6.28-rc5: > > > > Throughput 2486.2 MB/sec 8 procs > > > > 8p Dell 1950 and the number of processors specified on the tbench > > command line. > > And with net-next we might even be able to get past that magic limit? > net-next is linus-latest plus the latest and greatest networking bits: In any event I'm happy to toss this from the regression list. My sparc still shows the issues and I'll profile that independently. I'm pretty sure it's the indirect calls and the deeper stack frames (which == 128 bytes of extra stores at each level to save the register window), but I need to prove that first. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-21 8:30 ` Ingo Molnar 2008-11-21 8:51 ` Eric Dumazet 2008-11-21 9:03 ` David Miller @ 2008-11-21 16:11 ` Christoph Lameter 2008-11-21 18:06 ` Christoph Lameter 2 siblings, 1 reply; 191+ messages in thread From: Christoph Lameter @ 2008-11-21 16:11 UTC (permalink / raw) To: Ingo Molnar Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Kernel Testers List, Mike Galbraith, Peter Zijlstra, David S. Miller On Fri, 21 Nov 2008, Ingo Molnar wrote: > > 2.6.22: > > Throughput 2526.15 MB/sec 8 procs > > 2.6.28-rc5: > > Throughput 2486.2 MB/sec 8 procs > > > > 8p Dell 1950 and the number of processors specified on the tbench > > command line. > > ... so might be worth a test. Just to satisfy our curiosity and to > possibly close the entry :-) Ahh.. Wow.... net-next gets us: Throughput 2685.17 MB/sec 8 procs ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-21 16:11 ` Christoph Lameter @ 2008-11-21 18:06 ` Christoph Lameter 2008-11-21 18:16 ` Eric Dumazet 0 siblings, 1 reply; 191+ messages in thread From: Christoph Lameter @ 2008-11-21 18:06 UTC (permalink / raw) To: Ingo Molnar Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Kernel Testers List, Mike Galbraith, Peter Zijlstra, David S. Miller AIM9 results: TCP UDP 2.6.22 104868.00 489970.03 2.6.28-rc5 110007.00 518640.00 net-next 108207.00 514790.00 net-next looses here for some reason against 2.6.28-rc5. But the numbers are better than 2.6.22 in any case. ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-21 18:06 ` Christoph Lameter @ 2008-11-21 18:16 ` Eric Dumazet 2008-11-21 18:19 ` Eric Dumazet 0 siblings, 1 reply; 191+ messages in thread From: Eric Dumazet @ 2008-11-21 18:16 UTC (permalink / raw) To: Christoph Lameter Cc: Ingo Molnar, Rafael J. Wysocki, Linux Kernel Mailing List, Kernel Testers List, Mike Galbraith, Peter Zijlstra, David S. Miller Christoph Lameter a écrit : > AIM9 results: > TCP UDP > 2.6.22 104868.00 489970.03 > 2.6.28-rc5 110007.00 518640.00 > net-next 108207.00 514790.00 > > net-next looses here for some reason against 2.6.28-rc5. But the numbers > are better than 2.6.22 in any case. > I found that on current net-next, running oprofile in background can give better bench results. Thats really curious... no ? So the single loop on close(socket()), on all my 8 cpus is almost 10% faster if oprofile is running... (20 secs instead of 23 secs) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 2008-11-21 18:16 ` Eric Dumazet @ 2008-11-21 18:19 ` Eric Dumazet 0 siblings, 0 replies; 191+ messages in thread From: Eric Dumazet @ 2008-11-21 18:19 UTC (permalink / raw) To: Christoph Lameter Cc: Ingo Molnar, Rafael J. Wysocki, Linux Kernel Mailing List, Kernel Testers List, Mike Galbraith, Peter Zijlstra, David S. Miller Eric Dumazet a écrit : > Christoph Lameter a écrit : >> AIM9 results: >> TCP UDP >> 2.6.22 104868.00 489970.03 >> 2.6.28-rc5 110007.00 518640.00 >> net-next 108207.00 514790.00 >> >> net-next looses here for some reason against 2.6.28-rc5. But the numbers >> are better than 2.6.22 in any case. >> > > I found that on current net-next, running oprofile in background can > give better bench > results. Thats really curious... no ? > > > So the single loop on close(socket()), on all my 8 cpus is almost 10% > faster if oprofile > is running... (20 secs instead of 23 secs) > Oh well, thats normal, since when a cpu is interrupted by a NMI, and distracted by oprofile code, it doesnt fight with other cpus on dcache_lock and other contended cache lines... ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11664] acpi errors and random freeze on sony vaio sr 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (2 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11698] 2.6.27-rc7, freezes with > 1 s2ram cycle Rafael J. Wysocki ` (13 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, Giovanni Pellerano, ykzhao, Zhang Rui This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11664 Subject : acpi errors and random freeze on sony vaio sr Submitter : Giovanni Pellerano <giovanni.pellerano@gmail.com> Date : 2008-09-28 03:48 (50 days old) Patch : <<a href="http://marc.info/?l=linux-acpi&m=122514341319748&w=4 ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11698] 2.6.27-rc7, freezes with > 1 s2ram cycle 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (3 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11664] acpi errors and random freeze on sony vaio sr Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr Rafael J. Wysocki ` (12 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, Rafael J. Wysocki, Soeren Sonnenburg This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11698 Subject : 2.6.27-rc7, freezes with > 1 s2ram cycle Submitter : Soeren Sonnenburg <kernel@nn7.de> Date : 2008-09-29 11:29 (49 days old) References : http://marc.info/?l=linux-kernel&m=122268780926859&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (4 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11698] 2.6.27-rc7, freezes with > 1 s2ram cycle Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-17 16:19 ` Randy Dunlap 2008-11-16 17:40 ` [Bug #11569] Panic stop CPUs regression Rafael J. Wysocki ` (11 subsequent siblings) 17 siblings, 1 reply; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, James Bottomley, Miller, Mike (OS Dev), rdunlap This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404 Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr Submitter : rdunlap <randy.dunlap@oracle.com> Date : 2008-08-21 5:52 (88 days old) References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4 http://marc.info/?l=linux-kernel&m=121932889105368&w=4 Handled-By : Miller, Mike (OS Dev) <Mike.Miller@hp.com> James Bottomley <James.Bottomley@hansenpartnership.com> ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr 2008-11-16 17:40 ` [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr Rafael J. Wysocki @ 2008-11-17 16:19 ` Randy Dunlap 0 siblings, 0 replies; 191+ messages in thread From: Randy Dunlap @ 2008-11-17 16:19 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux Kernel Mailing List, Kernel Testers List, James Bottomley, Miller, Mike (OS Dev) Rafael J. Wysocki wrote: > This message has been generated automatically as a part of a report > of regressions introduced between 2.6.26 and 2.6.27. > > The following bug entry is on the current list of known regressions > introduced between 2.6.26 and 2.6.27. Please verify if it still should > be listed and let me know (either way). > Nothing has changed. IMO that means leave the bug as is (alive). > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404 > Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr > Submitter : rdunlap <randy.dunlap@oracle.com> > Date : 2008-08-21 5:52 (88 days old) > References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4 > http://marc.info/?l=linux-kernel&m=121932889105368&w=4 > Handled-By : Miller, Mike (OS Dev) <Mike.Miller@hp.com> > James Bottomley <James.Bottomley@hansenpartnership.com> -- ~Randy ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11569] Panic stop CPUs regression 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (5 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11543] kernel panic: softlockup in tick_periodic() ??? Rafael J. Wysocki ` (10 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Andi Kleen This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11569 Subject : Panic stop CPUs regression Submitter : Andi Kleen <andi@firstfloor.org> Date : 2008-09-02 13:49 (76 days old) References : http://marc.info/?l=linux-kernel&m=122036356127282&w=4 ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11543] kernel panic: softlockup in tick_periodic() ??? 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (6 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11569] Panic stop CPUs regression Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11836] Scheduler on C2D CPU and latest 2.6.27 kernel Rafael J. Wysocki ` (9 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, Cyrill Gorcunov, Ingo Molnar, Joshua Hoblitt, Thomas Gleixner This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543 Subject : kernel panic: softlockup in tick_periodic() ??? Submitter : Joshua Hoblitt <j_kernel@hoblitt.com> Date : 2008-09-11 16:46 (67 days old) References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4 Handled-By : Thomas Gleixner <tglx@linutronix.de> Cyrill Gorcunov <gorcunov@gmail.com> Ingo Molnar <mingo@elte.hu> Cyrill Gorcunov <gorcunov@gmail.com> ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11836] Scheduler on C2D CPU and latest 2.6.27 kernel 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (7 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11543] kernel panic: softlockup in tick_periodic() ??? Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki ` (8 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, Chris Snook, Zdenek Kabelac This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11836 Subject : Scheduler on C2D CPU and latest 2.6.27 kernel Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-10-21 9:59 (27 days old) References : http://marc.info/?l=linux-kernel&m=122458320502371&w=4 Handled-By : Chris Snook <csnook@redhat.com> ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11805] mounting XFS produces a segfault 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (8 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11836] Scheduler on C2D CPU and latest 2.6.27 kernel Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-17 14:44 ` Christoph Hellwig 2008-11-16 17:40 ` [Bug #11795] ks959-sir dongle no longer works under 2.6.27 (REGRESSION) Rafael J. Wysocki ` (7 subsequent siblings) 17 siblings, 1 reply; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Dave Chinner, Tiago Maluta This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 Subject : mounting XFS produces a segfault Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> Date : 2008-10-21 18:00 (27 days old) Handled-By : Dave Chinner <dgc@sgi.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11805] mounting XFS produces a segfault 2008-11-16 17:40 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki @ 2008-11-17 14:44 ` Christoph Hellwig 0 siblings, 0 replies; 191+ messages in thread From: Christoph Hellwig @ 2008-11-17 14:44 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux Kernel Mailing List, Kernel Testers List, Dave Chinner, Tiago Maluta On Sun, Nov 16, 2008 at 06:40:58PM +0100, Rafael J. Wysocki wrote: > This message has been generated automatically as a part of a report > of regressions introduced between 2.6.26 and 2.6.27. > > The following bug entry is on the current list of known regressions > introduced between 2.6.26 and 2.6.27. Please verify if it still should > be listed and let me know (either way). The patch for this is both in mainline and -stable > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 > Subject : mounting XFS produces a segfault > Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> > Date : 2008-10-21 18:00 (27 days old) > Handled-By : Dave Chinner <dgc@sgi.com> And that email address for Dave is severly outdated. ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11795] ks959-sir dongle no longer works under 2.6.27 (REGRESSION) 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (9 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11865] WOL for E100 Doesn't Work Anymore Rafael J. Wysocki ` (6 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, Alex Villacis Lasso, Samuel Ortiz, Vasily This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11795 Subject : ks959-sir dongle no longer works under 2.6.27 (REGRESSION) Submitter : Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Date : 2008-10-20 10:49 (28 days old) Handled-By : Samuel Ortiz <samuel@sortiz.org> ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11865] WOL for E100 Doesn't Work Anymore 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (10 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11795] ks959-sir dongle no longer works under 2.6.27 (REGRESSION) Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11843] usb hdd problems with 2.6.27.2 Rafael J. Wysocki ` (5 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Rafael J. Wysocki, roger This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11865 Subject : WOL for E100 Doesn't Work Anymore Submitter : roger <rogerx@sdf.lonestar.org> Date : 2008-10-26 21:56 (22 days old) Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18646&action=view ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11843] usb hdd problems with 2.6.27.2 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (11 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11865] WOL for E100 Doesn't Work Anymore Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 21:37 ` Luciano Rocha 2008-11-16 17:40 ` [Bug #11876] RCU hang on cpu re-hotplug with 2.6.27rc8 Rafael J. Wysocki ` (4 subsequent siblings) 17 siblings, 1 reply; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, Alan Stern, Luciano Rocha, Tim Wright This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11843 Subject : usb hdd problems with 2.6.27.2 Submitter : Luciano Rocha <luciano@eurotux.com> Date : 2008-10-22 16:22 (26 days old) References : http://marc.info/?l=linux-kernel&m=122469318102679&w=4 Handled-By : Luciano Rocha <luciano@eurotux.com> Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26 ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11843] usb hdd problems with 2.6.27.2 2008-11-16 17:40 ` [Bug #11843] usb hdd problems with 2.6.27.2 Rafael J. Wysocki @ 2008-11-16 21:37 ` Luciano Rocha 0 siblings, 0 replies; 191+ messages in thread From: Luciano Rocha @ 2008-11-16 21:37 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux Kernel Mailing List, Kernel Testers List, Alan Stern, Tim Wright On Sun, Nov 16, 2008 at 06:40:59PM +0100, Rafael J. Wysocki wrote: > This message has been generated automatically as a part of a report > of regressions introduced between 2.6.26 and 2.6.27. > > The following bug entry is on the current list of known regressions > introduced between 2.6.26 and 2.6.27. Please verify if it still should > be listed and let me know (either way). > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11843 > Subject : usb hdd problems with 2.6.27.2 > Submitter : Luciano Rocha <luciano@eurotux.com> > Date : 2008-10-22 16:22 (26 days old) > References : http://marc.info/?l=linux-kernel&m=122469318102679&w=4 > Handled-By : Luciano Rocha <luciano@eurotux.com> > Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26 What does "Handled-By" mean? The patches were created by Alan Stern <stern@rowland.harvard.edu>, I just tested them. Regards, Luciano Rocha -- Luciano Rocha <luciano@eurotux.com> Eurotux Informática, S.A. <http://www.eurotux.com/> ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11876] RCU hang on cpu re-hotplug with 2.6.27rc8 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (12 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11843] usb hdd problems with 2.6.27.2 Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11886] without serial console system doesn't poweroff Rafael J. Wysocki ` (3 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, Andi Kleen, Paul E. McKenney This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11876 Subject : RCU hang on cpu re-hotplug with 2.6.27rc8 Submitter : Andi Kleen <andi@firstfloor.org> Date : 2008-10-06 23:28 (42 days old) References : http://marc.info/?l=linux-kernel&m=122333610602399&w=2 Handled-By : Paul E. McKenney <paulmck@linux.vnet.ibm.com> ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11886] without serial console system doesn't poweroff 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (13 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11876] RCU hang on cpu re-hotplug with 2.6.27rc8 Rafael J. Wysocki @ 2008-11-16 17:40 ` Rafael J. Wysocki 2008-11-16 17:41 ` [Bug #12039] Regression: USB/DVB 2.6.26.8 --> 2.6.27.6 Rafael J. Wysocki ` (2 subsequent siblings) 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:40 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, Daniel Smolik, Rafael J. Wysocki This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11886 Subject : without serial console system doesn't poweroff Submitter : Daniel Smolik <marvin@mydatex.cz> Date : 2008-10-29 04:06 (19 days old) ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #12039] Regression: USB/DVB 2.6.26.8 --> 2.6.27.6 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (14 preceding siblings ...) 2008-11-16 17:40 ` [Bug #11886] without serial console system doesn't poweroff Rafael J. Wysocki @ 2008-11-16 17:41 ` Rafael J. Wysocki 2008-11-16 17:41 ` [Bug #11983] iwlagn: wrong command queue 31, command id 0x0 Rafael J. Wysocki 2008-11-16 17:41 ` [Bug #12048] Regression in bonding between 2.6.26.8 and 2.6.27.6 Rafael J. Wysocki 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:41 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Kernel Testers List, David This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12039 Subject : Regression: USB/DVB 2.6.26.8 --> 2.6.27.6 Submitter : David <david@unsolicited.net> Date : 2008-11-14 20:20 (3 days old) References : http://marc.info/?l=linux-kernel&m=122669568022274&w=4 ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11983] iwlagn: wrong command queue 31, command id 0x0 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (15 preceding siblings ...) 2008-11-16 17:41 ` [Bug #12039] Regression: USB/DVB 2.6.26.8 --> 2.6.27.6 Rafael J. Wysocki @ 2008-11-16 17:41 ` Rafael J. Wysocki 2008-11-16 17:41 ` [Bug #12048] Regression in bonding between 2.6.26.8 and 2.6.27.6 Rafael J. Wysocki 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:41 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Kernel Testers List, Luis R. Rodriguez, Marcel Holtmann, Matt Mackall, reinette chatre This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11983 Subject : iwlagn: wrong command queue 31, command id 0x0 Submitter : Matt Mackall <mpm@selenic.com> Date : 2008-11-06 4:16 (11 days old) References : http://marc.info/?l=linux-kernel&m=122598672815803&w=4 http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703 Handled-By : reinette chatre <reinette.chatre@intel.com> ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #12048] Regression in bonding between 2.6.26.8 and 2.6.27.6 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki ` (16 preceding siblings ...) 2008-11-16 17:41 ` [Bug #11983] iwlagn: wrong command queue 31, command id 0x0 Rafael J. Wysocki @ 2008-11-16 17:41 ` Rafael J. Wysocki 17 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-16 17:41 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Jesper Krogh This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12048 Subject : Regression in bonding between 2.6.26.8 and 2.6.27.6 Submitter : Jesper Krogh <jesper@krogh.cc> Date : 2008-11-16 9:41 (1 days old) References : http://marc.info/?l=linux-kernel&m=122682977001048&w=4 ^ permalink raw reply [flat|nested] 191+ messages in thread
* 2.6.28-rc3-git6: Reported regressions 2.6.26 -> 2.6.27 @ 2008-11-09 19:40 Rafael J. Wysocki 2008-11-09 19:43 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki 0 siblings, 1 reply; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-09 19:40 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Andrew Morton, Natalie Protasevich, Kernel Testers List This message contains a list of some regressions introduced between 2.6.26 and 2.6.27, for which there are no fixes in the mainline I know of. If any of them have been fixed already, please let me know. If you know of any other unresolved regressions introduced between 2.6.26 and 2.6.27, please let me know either and I'll add them to the list. Also, please let me know if any of the entries below are invalid. Each entry from the list will be sent additionally in an automatic reply to this message with CCs to the people involved in reporting and handling the issue. Listed regressions statistics: Date Total Pending Unresolved ---------------------------------------- 2008-11-09 196 28 23 2008-11-02 195 34 28 2008-10-26 190 34 29 2008-10-04 181 41 33 2008-09-27 173 35 28 2008-09-21 169 45 36 2008-09-15 163 46 32 2008-09-12 163 51 38 2008-09-07 150 43 33 2008-08-30 135 48 36 2008-08-23 122 48 40 2008-08-16 103 47 37 2008-08-10 80 52 31 2008-08-02 47 31 20 Unresolved regressions ---------------------- Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11983 Subject : iwlagn: wrong command queue 31, command id 0x0 Submitter : Matt Mackall <mpm@selenic.com> Date : 2008-11-06 4:16 (4 days old) References : http://marc.info/?l=linux-kernel&m=122598672815803&w=4 http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703 Handled-By : reinette chatre <reinette.chatre@intel.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11892 Subject : Battery information and status disappearing and wrong thermal status. Submitter : Mark <makalsky@gmail.com> Date : 2008-10-29 15:33 (12 days old) Handled-By : Alexey Starikovskiy <astarikovskiy@suse.de> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11876 Subject : RCU hang on cpu re-hotplug with 2.6.27rc8 Submitter : Andi Kleen <andi@firstfloor.org> Date : 2008-10-06 23:28 (35 days old) References : http://marc.info/?l=linux-kernel&m=122333610602399&w=2 Handled-By : Paul E. McKenney <paulmck@linux.vnet.ibm.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11843 Subject : usb hdd problems with 2.6.27.2 Submitter : Luciano Rocha <luciano@eurotux.com> Date : 2008-10-22 16:22 (19 days old) References : http://marc.info/?l=linux-kernel&m=122469318102679&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11836 Subject : Scheduler on C2D CPU and latest 2.6.27 kernel Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-10-21 9:59 (20 days old) References : http://marc.info/?l=linux-kernel&m=122458320502371&w=4 Handled-By : Chris Snook <csnook@redhat.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11830 Subject : disk statistics issue in 2.6.27 Submitter : Miquel van Smoorenburg <mikevs@xs4all.net> Date : 2008-10-19 11:31 (22 days old) First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=427e59f09fdba387547106de7bab980b7fff77be References : http://marc.info/?l=linux-kernel&m=122441671421326&w=4 Handled-By : Jens Axboe <jens.axboe@oracle.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11820 Subject : 2.6.27: 0 MHz CPU and wrong system time on AMD Geode system Submitter : Antipov Dmitry <dmantipov@yandex.ru> Date : 2008-10-15 6:39 (26 days old) References : http://marc.info/?l=linux-kernel&m=122405421010969&w=4 Handled-By : Jordan Crouse <jordan.crouse@amd.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11795 Subject : ks959-sir dongle no longer works under 2.6.27 (REGRESSION) Submitter : Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Date : 2008-10-20 10:49 (21 days old) Handled-By : Samuel Ortiz <samuel@sortiz.org> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11698 Subject : 2.6.27-rc7, freezes with > 1 s2ram cycle Submitter : Soeren Sonnenburg <kernel@nn7.de> Date : 2008-09-29 11:29 (42 days old) References : http://marc.info/?l=linux-kernel&m=122268780926859&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11608 Subject : 2.6.27-rc6 BUG: unable to handle kernel paging request Submitter : John Daiker <daikerjohn@gmail.com> Date : 2008-09-16 23:00 (55 days old) References : http://marc.info/?l=linux-kernel&m=122160611517267&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11607 Subject : 2.6.27-rc6 Bug in tty_chars_in_buffer Submitter : John Daiker <daikerjohn@gmail.com> Date : 2008-09-15 2:26 (56 days old) References : http://marc.info/?l=linux-kernel&m=122144565514490&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11569 Subject : Panic stop CPUs regression Submitter : Andi Kleen <andi@firstfloor.org> Date : 2008-09-02 13:49 (69 days old) References : http://marc.info/?l=linux-kernel&m=122036356127282&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543 Subject : kernel panic: softlockup in tick_periodic() ??? Submitter : Joshua Hoblitt <j_kernel@hoblitt.com> Date : 2008-09-11 16:46 (60 days old) References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4 Handled-By : Thomas Gleixner <tglx@linutronix.de> Cyrill Gorcunov <gorcunov@gmail.com> Ingo Molnar <mingo@elte.hu> Cyrill Gorcunov <gorcunov@gmail.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11476 Subject : failure to associate after resume from suspend to ram Submitter : Michael S. Tsirkin <m.s.tsirkin@gmail.com> Date : 2008-09-01 13:33 (70 days old) References : http://marc.info/?l=linux-kernel&m=122028529415108&w=4 Handled-By : Zhu Yi <yi.zhu@intel.com> Dan Williams <dcbw@redhat.com> Jouni Malinen <j@w1.fi> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11407 Subject : suspend: unable to handle kernel paging request Submitter : Vegard Nossum <vegard.nossum@gmail.com> Date : 2008-08-21 17:28 (81 days old) References : http://marc.info/?l=linux-kernel&m=121933974928881&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Pekka Enberg <penberg@cs.helsinki.fi> Pavel Machek <pavel@suse.cz> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404 Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr Submitter : rdunlap <randy.dunlap@oracle.com> Date : 2008-08-21 5:52 (81 days old) References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4 http://marc.info/?l=linux-kernel&m=121932889105368&w=4 Handled-By : Miller, Mike (OS Dev) <Mike.Miller@hp.com> James Bottomley <James.Bottomley@hansenpartnership.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11380 Subject : lockdep warning: cpu_add_remove_lock at:cpu_maps_update_begin+0x14/0x16 Submitter : Ingo Molnar <mingo@elte.hu> Date : 2008-08-20 6:44 (82 days old) References : http://marc.info/?l=linux-kernel&m=121921480931970&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11340 Subject : LTP overnight run resulted in unusable box Submitter : Alexey Dobriyan <adobriyan@gmail.com> Date : 2008-08-13 9:24 (89 days old) References : http://marc.info/?l=linux-kernel&m=121861951902949&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308 Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28 Submitter : Christoph Lameter <cl@linux-foundation.org> Date : 2008-08-11 18:36 (91 days old) References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4 http://marc.info/?l=linux-kernel&m=122125737421332&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11264 Subject : Invalid op opcode in kernel/workqueue Submitter : Jean-Luc Coulon <jean.luc.coulon@gmail.com> Date : 2008-08-07 04:18 (95 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11215 Subject : INFO: possible recursive locking detected ps2_command Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-07-31 9:41 (102 days old) References : http://marc.info/?l=linux-kernel&m=121749737011637&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11209 Subject : 2.6.27-rc1 process time accounting Submitter : Lukas Hejtmanek <xhejtman@ics.muni.cz> Date : 2008-07-31 10:43 (102 days old) References : http://marc.info/?l=linux-kernel&m=121750102917490&w=4 http://lkml.org/lkml/2008/9/30/199 http://marc.info/?l=linux-kernel&m=122470441624295&w=4 Handled-By : Peter Zijlstra <a.p.zijlstra@chello.nl> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11207 Subject : VolanoMark regression with 2.6.27-rc1 Submitter : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Date : 2008-07-31 3:20 (102 days old) References : http://marc.info/?l=linux-kernel&m=121747464114335&w=4 Handled-By : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Peter Zijlstra <a.p.zijlstra@chello.nl> Dhaval Giani <dhaval@linux.vnet.ibm.com> Miao Xie <miaox@cn.fujitsu.com> Regressions with patches ------------------------ Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11865 Subject : WOL for E100 Doesn't Work Anymore Submitter : roger <rogerx@sdf.lonestar.org> Date : 2008-10-26 21:56 (15 days old) Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18646&action=view Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11829 Subject : Kernel 2.6.26.5 -> 2.6.27.2 [USB REGRESSION] (USB -> D_STATE) Submitter : Justin Piszcz <jpiszcz@lucidpixels.com> Date : 2008-10-19 11:26 (22 days old) References : http://marc.info/?l=linux-kernel&m=122441560120027&w=4 Handled-By : Alan Stern <stern@rowland.harvard.edu> Mike Isely <isely@isely.net> Patch : http://linuxtv.org/hg/~mcisely/pvrusb2/rev/0bb411d8d2e4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 Subject : mounting XFS produces a segfault Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> Date : 2008-10-21 18:00 (20 days old) Handled-By : Dave Chinner <dgc@sgi.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11664 Subject : acpi errors and random freeze on sony vaio sr Submitter : Giovanni Pellerano <giovanni.pellerano@gmail.com> Date : 2008-09-28 03:48 (43 days old) Patch : http://marc.info/?l=linux-acpi&m=122514341319748&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11505 Subject : oltp ~10% regression with 2.6.27-rc5 on stoakley machine Submitter : Lin Ming <ming.m.lin@intel.com> Date : 2008-09-04 7:06 (67 days old) References : http://marc.info/?l=linux-kernel&m=122051202202373&w=4 http://marc.info/?t=122089704700005&r=1&w=4 Handled-By : Peter Zijlstra <a.p.zijlstra@chello.nl> Gregory Haskins <ghaskins@novell.com> Ingo Molnar <mingo@elte.hu> Patch : http://marc.info/?l=linux-kernel&m=122194673932703&w=4 For details, please visit the bug entries and follow the links given in references. As you can see, there is a Bugzilla entry for each of the listed regressions. There also is a Bugzilla entry used for tracking the regressions introduced between 2.6.26 and 2.6.27, unresolved as well as resolved, at: http://bugzilla.kernel.org/show_bug.cgi?id=11167 Please let me know if there are any Bugzilla entries that should be added to the list in there. Thanks, Rafael ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11805] mounting XFS produces a segfault 2008-11-09 19:40 2.6.28-rc3-git6: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki @ 2008-11-09 19:43 ` Rafael J. Wysocki 0 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-09 19:43 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Dave Chinner, Tiago Maluta This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 Subject : mounting XFS produces a segfault Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> Date : 2008-10-21 18:00 (20 days old) Handled-By : Dave Chinner <dgc@sgi.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view ^ permalink raw reply [flat|nested] 191+ messages in thread
* 2.6.28-rc2-git7: Reported regressions 2.6.26 -> 2.6.27 @ 2008-11-02 16:47 Rafael J. Wysocki 2008-11-02 16:49 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki 0 siblings, 1 reply; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-02 16:47 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Andrew Morton, Natalie Protasevich, Kernel Testers List This message contains a list of some regressions introduced between 2.6.26 and 2.6.27, for which there are no fixes in the mainline I know of. If any of them have been fixed already, please let me know. If you know of any other unresolved regressions introduced between 2.6.26 and 2.6.27, please let me know either and I'll add them to the list. Also, please let me know if any of the entries below are invalid. Each entry from the list will be sent additionally in an automatic reply to this message with CCs to the people involved in reporting and handling the issue. Listed regressions statistics: Date Total Pending Unresolved ---------------------------------------- 2008-11-02 195 34 28 2008-10-26 190 34 29 2008-10-04 181 41 33 2008-09-27 173 35 28 2008-09-21 169 45 36 2008-09-15 163 46 32 2008-09-12 163 51 38 2008-09-07 150 43 33 2008-08-30 135 48 36 2008-08-23 122 48 40 2008-08-16 103 47 37 2008-08-10 80 52 31 2008-08-02 47 31 20 Unresolved regressions ---------------------- Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11876 Subject : RCU hang on cpu re-hotplug with 2.6.27rc8 Submitter : Andi Kleen <andi@firstfloor.org> Date : 2008-10-06 23:28 (28 days old) References : http://marc.info/?l=linux-kernel&m=122333610602399&w=2 Handled-By : Paul E. McKenney <paulmck@linux.vnet.ibm.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11843 Subject : usb hdd problems with 2.6.27.2 Submitter : Luciano Rocha <luciano@eurotux.com> Date : 2008-10-22 16:22 (12 days old) References : http://marc.info/?l=linux-kernel&m=122469318102679&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11836 Subject : Scheduler on C2D CPU and latest 2.6.27 kernel Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-10-21 9:59 (13 days old) References : http://marc.info/?l=linux-kernel&m=122458320502371&w=4 Handled-By : Chris Snook <csnook@redhat.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11832 Subject : 2.6.27: "irq 18: nobody cared" on Toshiba Satellite A100 Submitter : M. Vefa Bicakci <bicave@superonline.com> Date : 2008-10-19 14:06 (15 days old) References : http://marc.info/?l=linux-kernel&m=122442552100406&w=4 Handled-By : Stefan Assmann <sassmann@suse.de> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11830 Subject : disk statistics issue in 2.6.27 Submitter : Miquel van Smoorenburg <mikevs@xs4all.net> Date : 2008-10-19 11:31 (15 days old) References : http://marc.info/?l=linux-kernel&m=122441671421326&w=4 Handled-By : Jens Axboe <jens.axboe@oracle.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11820 Subject : 2.6.27: 0 MHz CPU and wrong system time on AMD Geode system Submitter : Antipov Dmitry <dmantipov@yandex.ru> Date : 2008-10-15 6:39 (19 days old) References : http://marc.info/?l=linux-kernel&m=122405421010969&w=4 Handled-By : Jordan Crouse <jordan.crouse@amd.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11795 Subject : ks959-sir dongle no longer works under 2.6.27 (REGRESSION) Submitter : Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Date : 2008-10-20 10:49 (14 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11699 Subject : 2.6.27-rc-7: BUG: scheduling while atomic, c1e_idle+0x98/0xe0 Submitter : Prakash Punnoor <prakash@punnoor.de> Date : 2008-09-28 17:45 (36 days old) References : http://marc.info/?l=linux-kernel&m=122262403415629&w=4 Handled-By : Thomas Gleixner <tglx@linutronix.de> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11698 Subject : 2.6.27-rc7, freezes with > 1 s2ram cycle Submitter : Soeren Sonnenburg <kernel@nn7.de> Date : 2008-09-29 11:29 (35 days old) References : http://marc.info/?l=linux-kernel&m=122268780926859&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11664 Subject : acpi errors and random freeze on sony vaio sr Submitter : Giovanni Pellerano <giovanni.pellerano@gmail.com> Date : 2008-09-28 03:48 (36 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11608 Subject : 2.6.27-rc6 BUG: unable to handle kernel paging request Submitter : John Daiker <daikerjohn@gmail.com> Date : 2008-09-16 23:00 (48 days old) References : http://marc.info/?l=linux-kernel&m=122160611517267&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11607 Subject : 2.6.27-rc6 Bug in tty_chars_in_buffer Submitter : John Daiker <daikerjohn@gmail.com> Date : 2008-09-15 2:26 (49 days old) References : http://marc.info/?l=linux-kernel&m=122144565514490&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11569 Subject : Panic stop CPUs regression Submitter : Andi Kleen <andi@firstfloor.org> Date : 2008-09-02 13:49 (62 days old) References : http://marc.info/?l=linux-kernel&m=122036356127282&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543 Subject : kernel panic: softlockup in tick_periodic() ??? Submitter : Joshua Hoblitt <j_kernel@hoblitt.com> Date : 2008-09-11 16:46 (53 days old) References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4 Handled-By : Thomas Gleixner <tglx@linutronix.de> Cyrill Gorcunov <gorcunov@gmail.com> Ingo Molnar <mingo@elte.hu> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11476 Subject : failure to associate after resume from suspend to ram Submitter : Michael S. Tsirkin <m.s.tsirkin@gmail.com> Date : 2008-09-01 13:33 (63 days old) References : http://marc.info/?l=linux-kernel&m=122028529415108&w=4 Handled-By : Zhu Yi <yi.zhu@intel.com> Dan Williams <dcbw@redhat.com> Jouni Malinen <j@w1.fi> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11407 Subject : suspend: unable to handle kernel paging request Submitter : Vegard Nossum <vegard.nossum@gmail.com> Date : 2008-08-21 17:28 (74 days old) References : http://marc.info/?l=linux-kernel&m=121933974928881&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Pekka Enberg <penberg@cs.helsinki.fi> Pavel Machek <pavel@suse.cz> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404 Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr Submitter : rdunlap <randy.dunlap@oracle.com> Date : 2008-08-21 5:52 (74 days old) References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4 http://marc.info/?l=linux-kernel&m=121932889105368&w=4 Handled-By : Miller, Mike (OS Dev) <Mike.Miller@hp.com> James Bottomley <James.Bottomley@hansenpartnership.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11380 Subject : lockdep warning: cpu_add_remove_lock at:cpu_maps_update_begin+0x14/0x16 Submitter : Ingo Molnar <mingo@elte.hu> Date : 2008-08-20 6:44 (75 days old) References : http://marc.info/?l=linux-kernel&m=121921480931970&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11340 Subject : LTP overnight run resulted in unusable box Submitter : Alexey Dobriyan <adobriyan@gmail.com> Date : 2008-08-13 9:24 (82 days old) References : http://marc.info/?l=linux-kernel&m=121861951902949&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308 Subject : tbench regression on each kernel release from 2.6.22 -> 2.6.28 Submitter : Christoph Lameter <cl@linux-foundation.org> Date : 2008-08-11 18:36 (84 days old) References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4 http://marc.info/?l=linux-kernel&m=122125737421332&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11272 Subject : BUG: parport_serial in 2.6.27-rc1 for NetMos Technology PCI 9835 Submitter : Jaswinder Singh <jaswinderlinux@gmail.com> Date : 2008-08-05 15:12 (90 days old) References : http://marc.info/?l=linux-kernel&m=121794900319776&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11271 Subject : BUG: fealnx in 2.6.27-rc1 Submitter : Jaswinder Singh <jaswinderlinux@gmail.com> Date : 2008-08-05 14:58 (90 days old) References : http://marc.info/?l=linux-netdev&m=121794762016830&w=4 http://lkml.org/lkml/2008/8/10/98 Handled-By : Francois Romieu <romieu@fr.zoreil.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11264 Subject : Invalid op opcode in kernel/workqueue Submitter : Jean-Luc Coulon <jean.luc.coulon@gmail.com> Date : 2008-08-07 04:18 (88 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11220 Subject : Screen stays black after resume Submitter : Nico Schottelius <nico@schottelius.org> Date : 2008-07-31 21:05 (95 days old) References : http://marc.info/?l=linux-kernel&m=121753882422899&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11215 Subject : INFO: possible recursive locking detected ps2_command Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-07-31 9:41 (95 days old) References : http://marc.info/?l=linux-kernel&m=121749737011637&w=4 Handled-By : Peter Zijlstra <a.p.zijlstra@chello.nl> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11210 Subject : IRQ routing badness Submitter : Kumar Gala <galak@kernel.crashing.org> Date : 2008-07-31 18:53 (95 days old) References : http://marc.info/?l=linux-ide&m=121753059307310&w=4 Handled-By : Kumar Gala <galak@kernel.crashing.org> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11209 Subject : 2.6.27-rc1 process time accounting Submitter : Lukas Hejtmanek <xhejtman@ics.muni.cz> Date : 2008-07-31 10:43 (95 days old) References : http://marc.info/?l=linux-kernel&m=121750102917490&w=4 http://lkml.org/lkml/2008/9/30/199 http://marc.info/?l=linux-kernel&m=122470441624295&w=4 Handled-By : Peter Zijlstra <a.p.zijlstra@chello.nl> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11207 Subject : VolanoMark regression with 2.6.27-rc1 Submitter : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Date : 2008-07-31 3:20 (95 days old) References : http://marc.info/?l=linux-kernel&m=121747464114335&w=4 Handled-By : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Peter Zijlstra <a.p.zijlstra@chello.nl> Dhaval Giani <dhaval@linux.vnet.ibm.com> Miao Xie <miaox@cn.fujitsu.com> Regressions with patches ------------------------ Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11907 Subject : NVRAM being corrupted on ppc64 preventing boot Submitter : Mel Gorman <mel@csn.ul.ie> Date : 2008-10-30 14:26 (4 days old) References : http://marc.info/?l=linux-kernel&m=122537727204584&w=4 Handled-By : Paul Mackerras <paulus@samba.org> Mel Gorman <mel@csn.ul.ie> Patch : http://marc.info/?l=linux-kernel&m=122547833412996&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11904 Subject : upstream regression (IO-APIC?) Submitter : Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Date : 2008-10-30 0:00 (4 days old) References : http://marc.info/?l=linux-kernel&m=122532510328618&w=4 Patch : http://marc.info/?l=linux-kernel&m=122563711522315&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11829 Subject : Kernel 2.6.26.5 -> 2.6.27.2 [USB REGRESSION] (USB -> D_STATE) Submitter : Justin Piszcz <jpiszcz@lucidpixels.com> Date : 2008-10-19 11:26 (15 days old) References : http://marc.info/?l=linux-kernel&m=122441560120027&w=4 Handled-By : Alan Stern <stern@rowland.harvard.edu> Mike Isely <isely@isely.net> Patch : http://linuxtv.org/hg/~mcisely/pvrusb2/rev/0bb411d8d2e4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 Subject : mounting XFS produces a segfault Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> Date : 2008-10-21 18:00 (13 days old) Handled-By : Dave Chinner <dgc@sgi.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11550 Subject : pnp: Huge number of "io resource overlap" messages Submitter : Frans Pop <elendil@planet.nl> Date : 2008-09-09 10:50 (55 days old) References : http://marc.info/?l=linux-kernel&m=122095745403793&w=4 Handled-By : Rene Herman <rene.herman@keyaccess.nl> Bjorn Helgaas <bjorn.helgaas@hp.com> Patch : http://marc.info/?l=linux-kernel&m=122246533505643&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11505 Subject : oltp ~10% regression with 2.6.27-rc5 on stoakley machine Submitter : Lin Ming <ming.m.lin@intel.com> Date : 2008-09-04 7:06 (60 days old) References : http://marc.info/?l=linux-kernel&m=122051202202373&w=4 http://marc.info/?t=122089704700005&r=1&w=4 Handled-By : Peter Zijlstra <a.p.zijlstra@chello.nl> Gregory Haskins <ghaskins@novell.com> Ingo Molnar <mingo@elte.hu> Patch : http://marc.info/?l=linux-kernel&m=122194673932703&w=4 For details, please visit the bug entries and follow the links given in references. As you can see, there is a Bugzilla entry for each of the listed regressions. There also is a Bugzilla entry used for tracking the regressions introduced between 2.6.26 and 2.6.27, unresolved as well as resolved, at: http://bugzilla.kernel.org/show_bug.cgi?id=11167 Please let me know if there are any Bugzilla entries that should be added to the list in there. Thanks, Rafael ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11805] mounting XFS produces a segfault 2008-11-02 16:47 2.6.28-rc2-git7: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki @ 2008-11-02 16:49 ` Rafael J. Wysocki 2008-11-03 9:32 ` Christoph Hellwig 0 siblings, 1 reply; 191+ messages in thread From: Rafael J. Wysocki @ 2008-11-02 16:49 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Dave Chinner, Tiago Maluta This message has been generated automatically as a part of a report of regressions introduced between 2.6.26 and 2.6.27. The following bug entry is on the current list of known regressions introduced between 2.6.26 and 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 Subject : mounting XFS produces a segfault Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> Date : 2008-10-21 18:00 (13 days old) Handled-By : Dave Chinner <dgc@sgi.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11805] mounting XFS produces a segfault 2008-11-02 16:49 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki @ 2008-11-03 9:32 ` Christoph Hellwig 0 siblings, 0 replies; 191+ messages in thread From: Christoph Hellwig @ 2008-11-03 9:32 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux Kernel Mailing List, Kernel Testers List, Dave Chinner, Tiago Maluta On Sun, Nov 02, 2008 at 05:49:36PM +0100, Rafael J. Wysocki wrote: > This message has been generated automatically as a part of a report > of regressions introduced between 2.6.26 and 2.6.27. > > The following bug entry is on the current list of known regressions > introduced between 2.6.26 and 2.6.27. Please verify if it still should > be listed and let me know (either way). The patch is in the XFS development tree. Given that the maintainers didn't manage to get it into the merge window it might make sense to submit this fix individually. Dave, what do you think about sending this patch out directly to Linus? ^ permalink raw reply [flat|nested] 191+ messages in thread
* 2.6.28-rc1-git1: Reported regressions from 2.6.27 @ 2008-10-25 20:02 Rafael J. Wysocki 2008-10-25 20:06 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki 0 siblings, 1 reply; 191+ messages in thread From: Rafael J. Wysocki @ 2008-10-25 20:02 UTC (permalink / raw) To: Linux Kernel Mailing List Cc: Adrian Bunk, Andrew Morton, Linus Torvalds, Natalie Protasevich, Kernel Testers List, Network Development, Linux ACPI, Linux PM List, Linux SCSI List This message contains a list of some regressions from 2.6.27, for which there are no fixes in the mainline I know of. If any of them have been fixed already, please let me know. If you know of any other unresolved regressions from 2.6.27, please let me know either and I'll add them to the list. Also, please let me know if any of the entries below are invalid. Each entry from the list will be sent additionally in an automatic reply to this message with CCs to the people involved in reporting and handling the issue. Listed regressions statistics: Date Total Pending Unresolved ---------------------------------------- 2008-10-25 26 25 20 Unresolved regressions ---------------------- Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11854 Subject : v2.6.28-rc1: readlink /proc/*/exe returns uninitialized data to userspace Submitter : Vegard Nossum <vegard.nossum@gmail.com> Date : 2008-10-25 17:14 (1 days old) References : http://marc.info/?l=linux-kernel&m=122495490201663&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11853 Subject : Random problems with 2.6.28-rc1-git1 Submitter : Mikko C. <mikko.cal@gmail.com> Date : 2008-10-25 14:53 (1 days old) References : http://marc.info/?l=linux-kernel&m=122494643521893&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11852 Subject : v2.6.28-rc1: Regression in ext3/jbd Submitter : Vegard Nossum <vegard.nossum@gmail.com> Date : 2008-10-25 11:22 (1 days old) References : http://marc.info/?l=linux-kernel&m=122493379905812&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11849 Subject : default IRQ affinity change in v2.6.27 (breaking several SMP PPC based systems) Submitter : Kumar Gala <galak@kernel.crashing.org> Date : 2008-10-24 12:45 (2 days old) References : http://marc.info/?l=linux-kernel&m=122485245924125&w=4 Handled-By : Chris Snook <csnook@redhat.com> Scott Wood <scottwood@freescale.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11848 Subject : 2.6.28-rc1: NCQ devices connected to PMP are still dead Submitter : Petr Vandrovec <vandrove@vc.cvut.cz> Date : 2008-10-25 11:17 (1 days old) References : http://marc.info/?l=linux-kernel&m=122493470907259&w=4 Handled-By : Jens Axboe <jens.axboe@oracle.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11845 Subject : Suspend regression on Lenovo x60 Submitter : Jens Axboe <jens.axboe@oracle.com> Date : 2008-10-24 18:02 (2 days old) References : http://marc.info/?l=linux-next&m=122487159829507&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11841 Subject : plenty of line "ACPI: EC: non-query interrupt received, switching to interrupt mode" in dmesg Submitter : François Valenduc <francois.valenduc@tvcablenet.be> Date : 2008-10-25 10:29 (1 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11838 Subject : general protection fault: from release_blocks_on_commit Submitter : Eric Paris <eparis@redhat.com> Date : 2008-10-21 14:03 (5 days old) References : http://lkml.org/lkml/2008/10/21/248 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11835 Subject : 2.6.27-git8: max1111_read_channel and corgi_ssp_ads7846_putget missing Submitter : Pavel Machek <pavel@suse.cz> Date : 2008-10-20 12:10 (6 days old) References : http://marc.info/?l=linux-kernel&m=122450475101945&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11834 Subject : iwl3945: if I leave my machine running overnight, wifi will not work in the morning Submitter : Pavel Machek <pavel@suse.cz> Date : 2008-10-19 21:40 (7 days old) References : http://marc.info/?l=linux-kernel&m=122445440206101&w=4 Handled-By : reinette chatre <reinette.chatre@intel.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11833 Subject : 2.6.27-git8 compile error in drivers/mfd/wm8350-core.c Submitter : Tilman Schmidt <tilman@imap.cc> Date : 2008-10-19 18:17 (7 days old) References : http://marc.info/?l=linux-kernel&m=122444027920472&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11828 Subject : Linux 2.6.27-git3: no SD card reader Submitter : J.A. Magallón <jamagallon@ono.com> Date : 2008-10-14 0:54 (12 days old) References : http://marc.info/?l=linux-kernel&m=122394573904699&w=4 Handled-By : Pierre Ossman <drzeus-list@drzeus.cx> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11827 Subject : suspend broken on powerbook5,6 Submitter : Johannes Berg <johannes@sipsolutions.net> Date : 2008-10-25 05:07 (1 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11826 Subject : extreme slowness of IO stuff using 2.6.28-rc1 Submitter : Yves-Alexis Perez <corsac@debian.org> Date : 2008-10-25 04:25 (1 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11824 Subject : raw1394: possible deadlock if accessed by multithreaded app Submitter : Stefan Richter <stefan-r-bz@s5r6.in-berlin.de> Date : 2008-10-25 02:08 (1 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11822 Subject : ACPI Warning (nspredef-0858): _SB_.PCI0.LPC_.EC__.BAT0._BIF: Return Package type mismatch at index 9 - found Buffer, expected String [20080926] Submitter : Len Brown <len.brown@intel.com> Date : 2008-10-25 01:26 (1 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11806 Subject : iwl3945 fails with microcode error Submitter : Johannes Berg <johannes@sipsolutions.net> Date : 2008-10-22 02:36 (4 days old) References : http://marc.info/?l=linux-kernel&m=122450235730661&w=4 Handled-By : Reinette Chatre <reinette.chatre@intel.com> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 Subject : mounting XFS produces a segfault Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> Date : 2008-10-21 18:00 (5 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11799 Subject : xorg can not start up with stolen memory Submitter : arrow zhang <arrow.ebd@gmail.com> Date : 2008-10-21 06:08 (5 days old) Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11798 Subject : [Radeon Xpress 1100 IGP] DRI fail on recent kernel updates Submitter : Gu Rui <chaos.proton@gmail.com> Date : 2008-10-21 04:28 (5 days old) Regressions with patches ------------------------ Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11851 Subject : wmifinfo dockapp takes 100% of cpu Submitter : Carlos R. Mafra <crmafra2@gmail.com> Date : 2008-10-25 9:40 (1 days old) References : http://marc.info/?l=linux-kernel&m=122492763030453&w=4 Handled-By : Arjan van de Ven <arjan@linux.intel.com> Marcin Slusarz <marcin.slusarz@gmail.com> Patch : http://marc.info/?l=linux-kernel&m=122496409612709&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11847 Subject : 2.6.28-rc1 fails building on allnoconfig Submitter : Matt Mackall <mpm@selenic.com> Date : 2008-10-24 17:09 (2 days old) References : http://marc.info/?l=linux-kernel&m=122487097728241&w=4 http://marc.info/?l=linux-pci&m=122485633732409&w=4 Handled-By : Fenghua Yu <fenghua.yu@intel.com> Patch : http://marc.info/?l=linux-kernel&m=122487160129531&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11846 Subject : commit a802dd0e breaks console keyboard input Submitter : walt <w41ter@gmail.com> Date : 2008-10-24 0:09 (2 days old) References : http://marc.info/?l=linux-kernel&m=122480701528944&w=4 Handled-By : Heiko Carstens <heiko.carstens@de.ibm.com> Patch : http://marc.info/?l=linux-kernel&m=122487763507703&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11844 Subject : ext3: fix ext3_dx_readdir hash collision handling Submitter : Markus Trippelsdorf <markus@trippelsdorf.de> Date : 2008-10-25 11:56 (1 days old) References : http://marc.info/?l=linux-kernel&m=122493582908742&w=4 Handled-By : Theodore Tso <tytso@mit.edu> Patch : http://marc.info/?l=linux-kernel&m=122493582908742&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11825 Subject : appletouch broken Submitter : Johannes Berg <johannes@sipsolutions.net> Date : 2008-10-25 03:55 (1 days old) References : http://marc.info/?l=linux-kernel&m=122433436930451&w=4 Handled-By : Jiri Slaby <jirislaby@gmail.com> Jiri Kosina <jkosina@suse.cz> Patch : http://marc.info/?l=linux-kernel&m=122442584100873&w=4 For details, please visit the bug entries and follow the links given in references. As you can see, there is a Bugzilla entry for each of the listed regressions. There also is a Bugzilla entry used for tracking the regressions from 2.6.27, unresolved as well as resolved, at: http://bugzilla.kernel.org/show_bug.cgi?id=11808 Please let me know if there are any Bugzilla entries that should be added to the list in there. Thanks, Rafael ^ permalink raw reply [flat|nested] 191+ messages in thread
* [Bug #11805] mounting XFS produces a segfault 2008-10-25 20:02 2.6.28-rc1-git1: Reported regressions from 2.6.27 Rafael J. Wysocki @ 2008-10-25 20:06 ` Rafael J. Wysocki 2008-10-26 0:08 ` Dave Chinner 0 siblings, 1 reply; 191+ messages in thread From: Rafael J. Wysocki @ 2008-10-25 20:06 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Kernel Testers List, Tiago Maluta This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.27. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 Subject : mounting XFS produces a segfault Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> Date : 2008-10-21 18:00 (5 days old) ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11805] mounting XFS produces a segfault 2008-10-25 20:06 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki @ 2008-10-26 0:08 ` Dave Chinner 2008-10-26 11:14 ` Rafael J. Wysocki 0 siblings, 1 reply; 191+ messages in thread From: Dave Chinner @ 2008-10-26 0:08 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linux Kernel Mailing List, Kernel Testers List, Tiago Maluta On Sat, Oct 25, 2008 at 10:06:44PM +0200, Rafael J. Wysocki wrote: > This message has been generated automatically as a part of a report > of recent regressions. > > The following bug entry is on the current list of known regressions > from 2.6.27. Please verify if it still should be listed and let me know > (either way). > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 > Subject : mounting XFS produces a segfault > Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> > Date : 2008-10-21 18:00 (5 days old) Ah - this was reported as a 2.6.26 -> 2.6.27 regression, not a .27->.28-rcX regression. Even so, it's not obviously an XFS regression as the problem is that alloc_pages(GFP_KERNEL) is the new failure on .27. The fact that XFS never handled the allocation failure is not a new bug or regression - it has never caught failures during log allocation... So really, if you want to look for a regression here, it is the change of behaviour in the VM leading to a memory allocation failure where it has never, ever previously failed... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 191+ messages in thread
* Re: [Bug #11805] mounting XFS produces a segfault 2008-10-26 0:08 ` Dave Chinner @ 2008-10-26 11:14 ` Rafael J. Wysocki 0 siblings, 0 replies; 191+ messages in thread From: Rafael J. Wysocki @ 2008-10-26 11:14 UTC (permalink / raw) To: Dave Chinner; +Cc: Linux Kernel Mailing List, Kernel Testers List, Tiago Maluta On Sunday, 26 of October 2008, Dave Chinner wrote: > On Sat, Oct 25, 2008 at 10:06:44PM +0200, Rafael J. Wysocki wrote: > > This message has been generated automatically as a part of a report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.27. Please verify if it still should be listed and let me know > > (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805 > > Subject : mounting XFS produces a segfault > > Submitter : Tiago Maluta <maluta_tiago@yahoo.com.br> > > Date : 2008-10-21 18:00 (5 days old) > > Ah - this was reported as a 2.6.26 -> 2.6.27 regression, not a > .27->.28-rcX regression. > > Even so, it's not obviously an XFS regression as the problem is > that alloc_pages(GFP_KERNEL) is the new failure on .27. The fact > that XFS never handled the allocation failure is not a new bug > or regression - it has never caught failures during log > allocation... > > So really, if you want to look for a regression here, it is the > change of behaviour in the VM leading to a memory allocation failure > where it has never, ever previously failed... OK, I moved it to the list of regressions introduced between .26 and .27. Thanks, Rafael ^ permalink raw reply [flat|nested] 191+ messages in thread
end of thread, other threads:[~2008-12-17 20:27 UTC | newest] Thread overview: 191+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-11-16 17:38 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki 2008-11-16 17:38 ` [Bug #11207] VolanoMark regression with 2.6.27-rc1 Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11215] INFO: possible recursive locking detected ps2_command Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 Rafael J. Wysocki 2008-11-17 9:06 ` Ingo Molnar 2008-11-17 9:14 ` David Miller 2008-11-17 11:01 ` Ingo Molnar 2008-11-17 11:20 ` Eric Dumazet 2008-11-17 16:11 ` Ingo Molnar 2008-11-17 16:35 ` Eric Dumazet 2008-11-17 17:08 ` Ingo Molnar 2008-11-17 17:25 ` Ingo Molnar 2008-11-17 17:33 ` Eric Dumazet 2008-11-17 17:38 ` Linus Torvalds 2008-11-17 17:42 ` Eric Dumazet 2008-11-17 18:23 ` Ingo Molnar 2008-11-17 18:33 ` Linus Torvalds 2008-11-17 18:49 ` Ingo Molnar 2008-11-17 19:30 ` Eric Dumazet 2008-11-17 19:39 ` David Miller 2008-11-17 19:43 ` Eric Dumazet 2008-11-17 19:55 ` Linus Torvalds 2008-11-17 20:16 ` David Miller 2008-11-17 20:30 ` Linus Torvalds 2008-11-17 20:58 ` David Miller 2008-11-18 9:44 ` Nick Piggin 2008-11-18 15:58 ` Linus Torvalds 2008-11-19 4:31 ` Nick Piggin 2008-11-20 9:14 ` David Miller 2008-11-20 9:06 ` David Miller 2008-11-18 12:29 ` Mike Galbraith 2008-11-17 19:57 ` Ingo Molnar 2008-11-17 20:20 ` (avc_has_perm_noaudit()) " Ingo Molnar 2008-11-17 20:32 ` ip_queue_xmit(): " Ingo Molnar 2008-11-17 20:57 ` Eric Dumazet 2008-11-18 9:12 ` Nick Piggin 2008-11-17 20:47 ` Ingo Molnar 2008-11-17 20:56 ` Eric Dumazet 2008-11-17 20:55 ` skb_release_head_state(): " Ingo Molnar 2008-11-17 21:01 ` David Miller 2008-11-17 21:04 ` Eric Dumazet 2008-11-17 21:34 ` Linus Torvalds 2008-11-17 21:38 ` Ingo Molnar 2008-11-17 21:09 ` tcp_ack(): " Ingo Molnar 2008-11-17 21:19 ` tcp_recvmsg(): " Ingo Molnar 2008-11-17 21:26 ` eth_type_trans(): " Ingo Molnar 2008-11-17 21:40 ` Eric Dumazet 2008-11-17 23:41 ` Eric Dumazet 2008-11-18 0:01 ` Linus Torvalds 2008-11-18 8:35 ` Eric Dumazet 2008-11-17 21:52 ` Linus Torvalds 2008-11-18 5:16 ` David Miller 2008-11-18 5:35 ` Eric Dumazet 2008-11-18 7:00 ` David Miller 2008-11-18 8:30 ` Ingo Molnar 2008-11-18 8:49 ` Eric Dumazet 2008-11-17 21:35 ` __inet_lookup_established(): " Ingo Molnar 2008-11-17 22:14 ` Eric Dumazet 2008-11-17 21:59 ` system_call() - " Ingo Molnar 2008-11-17 22:09 ` Linus Torvalds 2008-11-17 22:08 ` Ingo Molnar 2008-11-17 22:15 ` Eric Dumazet 2008-11-17 22:26 ` Ingo Molnar 2008-11-17 22:39 ` Eric Dumazet 2008-11-18 5:23 ` David Miller 2008-11-18 8:45 ` Ingo Molnar 2008-11-17 22:14 ` tcp_transmit_skb() - " Ingo Molnar 2008-11-17 22:19 ` Ingo Molnar 2008-11-17 19:36 ` David Miller 2008-11-17 19:31 ` David Miller 2008-11-17 19:47 ` Linus Torvalds 2008-11-17 19:51 ` David Miller 2008-11-17 19:53 ` Ingo Molnar 2008-11-17 22:47 ` Ingo Molnar 2008-11-17 19:21 ` David Miller 2008-11-17 19:48 ` Linus Torvalds 2008-11-17 19:52 ` David Miller 2008-11-17 19:57 ` Linus Torvalds 2008-11-17 20:18 ` David Miller 2008-11-19 19:43 ` Christoph Lameter 2008-11-19 20:14 ` Ingo Molnar 2008-11-20 23:52 ` Christoph Lameter 2008-11-21 8:30 ` Ingo Molnar 2008-11-21 8:51 ` Eric Dumazet 2008-11-21 9:05 ` David Miller 2008-11-21 12:51 ` Eric Dumazet 2008-11-21 15:13 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Eric Dumazet 2008-11-21 15:21 ` Ingo Molnar 2008-11-21 15:28 ` Eric Dumazet 2008-11-21 15:34 ` Ingo Molnar 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet 2008-11-27 1:37 ` Christoph Lameter 2008-11-27 6:27 ` Eric Dumazet 2008-11-27 14:44 ` Christoph Lameter 2008-11-27 9:39 ` Christoph Hellwig 2008-11-28 18:03 ` Ingo Molnar 2008-11-28 18:47 ` Peter Zijlstra 2008-11-29 6:38 ` Christoph Hellwig 2008-11-29 8:07 ` Eric Dumazet 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 0/7] " Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet 2007-07-24 1:24 ` Nick Piggin 2008-12-16 21:04 ` Paul E. McKenney 2008-12-11 22:39 ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet 2007-07-24 1:30 ` Nick Piggin 2008-12-12 5:11 ` Eric Dumazet 2008-12-16 21:10 ` Paul E. McKenney 2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet 2007-07-24 1:34 ` Nick Piggin 2008-12-16 21:26 ` Paul E. McKenney 2008-12-11 22:39 ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet 2008-12-16 21:40 ` Paul E. McKenney 2008-12-11 22:40 ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet 2008-12-16 21:41 ` Paul E. McKenney 2008-12-11 22:40 ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet 2007-07-24 1:13 ` Nick Piggin 2008-12-12 2:50 ` Nick Piggin 2008-12-12 4:45 ` Eric Dumazet 2008-12-12 16:48 ` Eric Dumazet 2008-12-13 2:07 ` Christoph Lameter 2008-12-17 20:25 ` Eric Dumazet 2008-12-13 1:41 ` Christoph Lameter 2008-12-11 22:41 ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet 2008-11-29 8:43 ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet 2008-11-29 8:43 ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet 2008-11-29 8:44 ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet 2008-11-29 8:44 ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet 2008-11-29 10:38 ` Jörn Engel 2008-11-29 11:14 ` Eric Dumazet 2008-11-29 8:45 ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet 2008-11-29 11:14 ` Jörn Engel 2008-11-26 23:30 ` [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Eric Dumazet 2008-11-27 9:41 ` Christoph Hellwig 2008-11-26 23:32 ` [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator Eric Dumazet 2008-11-27 9:46 ` Christoph Hellwig 2008-11-26 23:32 ` [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Eric Dumazet 2008-11-27 9:32 ` Peter Zijlstra 2008-11-27 9:39 ` Peter Zijlstra 2008-11-27 9:48 ` Christoph Hellwig 2008-11-27 10:01 ` Eric Dumazet 2008-11-27 10:07 ` Andi Kleen 2008-11-27 14:46 ` Christoph Lameter 2008-11-26 23:32 ` [PATCH 5/6] fs: Introduce special inodes Eric Dumazet 2008-11-27 8:20 ` David Miller 2008-11-26 23:32 ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet 2008-11-27 8:21 ` David Miller 2008-11-27 9:53 ` Christoph Hellwig 2008-11-27 10:04 ` Eric Dumazet 2008-11-27 10:10 ` Christoph Hellwig 2008-11-28 9:26 ` Al Viro 2008-11-28 9:34 ` Al Viro 2008-11-28 18:02 ` Ingo Molnar 2008-11-28 18:58 ` Ingo Molnar 2008-11-28 22:20 ` Eric Dumazet 2008-11-28 22:37 ` Eric Dumazet 2008-11-28 22:43 ` Eric Dumazet 2008-11-21 15:36 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig 2008-11-21 17:58 ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet 2008-11-21 18:43 ` Matthew Wilcox 2008-11-23 3:53 ` Eric Dumazet 2008-11-21 9:18 ` [Bug #11308] tbench regression on each kernel release from 2.6.22 -> 2.6.28 Ingo Molnar 2008-11-21 9:03 ` David Miller 2008-11-21 16:11 ` Christoph Lameter 2008-11-21 18:06 ` Christoph Lameter 2008-11-21 18:16 ` Eric Dumazet 2008-11-21 18:19 ` Eric Dumazet 2008-11-16 17:40 ` [Bug #11664] acpi errors and random freeze on sony vaio sr Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11698] 2.6.27-rc7, freezes with > 1 s2ram cycle Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr Rafael J. Wysocki 2008-11-17 16:19 ` Randy Dunlap 2008-11-16 17:40 ` [Bug #11569] Panic stop CPUs regression Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11543] kernel panic: softlockup in tick_periodic() ??? Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11836] Scheduler on C2D CPU and latest 2.6.27 kernel Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki 2008-11-17 14:44 ` Christoph Hellwig 2008-11-16 17:40 ` [Bug #11795] ks959-sir dongle no longer works under 2.6.27 (REGRESSION) Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11865] WOL for E100 Doesn't Work Anymore Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11843] usb hdd problems with 2.6.27.2 Rafael J. Wysocki 2008-11-16 21:37 ` Luciano Rocha 2008-11-16 17:40 ` [Bug #11876] RCU hang on cpu re-hotplug with 2.6.27rc8 Rafael J. Wysocki 2008-11-16 17:40 ` [Bug #11886] without serial console system doesn't poweroff Rafael J. Wysocki 2008-11-16 17:41 ` [Bug #12039] Regression: USB/DVB 2.6.26.8 --> 2.6.27.6 Rafael J. Wysocki 2008-11-16 17:41 ` [Bug #11983] iwlagn: wrong command queue 31, command id 0x0 Rafael J. Wysocki 2008-11-16 17:41 ` [Bug #12048] Regression in bonding between 2.6.26.8 and 2.6.27.6 Rafael J. Wysocki -- strict thread matches above, loose matches on Subject: below -- 2008-11-09 19:40 2.6.28-rc3-git6: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki 2008-11-09 19:43 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki 2008-11-02 16:47 2.6.28-rc2-git7: Reported regressions 2.6.26 -> 2.6.27 Rafael J. Wysocki 2008-11-02 16:49 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki 2008-11-03 9:32 ` Christoph Hellwig 2008-10-25 20:02 2.6.28-rc1-git1: Reported regressions from 2.6.27 Rafael J. Wysocki 2008-10-25 20:06 ` [Bug #11805] mounting XFS produces a segfault Rafael J. Wysocki 2008-10-26 0:08 ` Dave Chinner 2008-10-26 11:14 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).