url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm3/ Includes a SARD update from Rick. The SARD disk accounting is pretty much final now. I moved the remaining disk accounting numbers (pgpgin, pgpgout) out of /proc/stat and this will confuse vmstat. Again. Updated versions are at http://surriel.com/procps, but they're not uptodate enough. To get a current procps, grab the cygnus CVS (instructions are at Rik's site) and then apply http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm3/vmstat.patch Since 2.5.38-mm2: -ide-block-fix-1.patch Merged (Jens) -ext3-htree.patch +ext3-dxdir.patch Switch to Ted's ext3-htree patch. -might_sleep.patch -unbreak-writeback-mode.patch -queue-congestion.patch -nonblocking-ext2-preread.patch -nonblocking-pdflush.patch -nonblocking-vm.patch -set_page_dirty-locking-fix.patch -prepare_to_wait.patch -vm-wakeups.patch -sync-helper.patch -slabasap.patch Merged +misc.patch A comment fix. +topology_fixes.patch Some topology API fixlets from Matthew +dio-bio-add-page.patch Convert direct-io.c to use bio_add_page(). (Badari) It will now build BIOs as large as the device supports. +dio-bio-fixes.patch Some alterations to the above. -read-latency.patch "I have to say, that elevator thing is the ugliest code I've seen in a long while." -- Linus +deadline-update.patch Latest deadline scheduler fixes from Jens. +akpm-deadline.patch Expose the deadline scheduler tunables into /proc/sys/vm, and set the default fifo_batch to 16. linus.patch cset-1.579.3.4-to-1.605.1.31.txt.gz ide-high-1.patch scsi_hack.patch Fix block-highmem for scsi ext3-dxdir.patch spin-lock-check.patch spinlock/rwlock checking infrastructure rd-cleanup.patch Cleanup and fix the ramdisk driver (doesn't work right yet) misc.patch misc fixes write-deadlock.patch Fix the generic_file_write-from-same-mmapped-page deadlock buddyinfo.patch Add /proc/buddyinfo - stats on the free pages pool free_area.patch Remove struct free_area_struct and free_area_t, use `struct free_area' per-node-kswapd.patch Per-node kswapd instance topology-api.patch Simple topology API topology_fixes.patch topology-api cleanups radix_tree_gang_lookup.patch radix tree gang lookup truncate_inode_pages.patch truncate/invalidate_inode_pages rewrite proc_vmstat.patch Move the vm accounting out of /proc/stat kswapd-reclaim-stats.patch Add kswapd_steal to /proc/vmstat iowait.patch I/O wait statistics sard.patch SARD disk accounting dio-bio-add-page.patch Use bio_add_page() in direct-io.c dio-bio-fixes.patch dio-bio-add-page fixes remove-gfp_nfs.patch remove GFP_NFS tcp-wakeups.patch Use fast wakeups in TCP/IPV4 swapoff-deadlock.patch Fix a tmpfs swapoff deadlock dirty-and-uptodate.patch page state cleanup shmem_rename.patch shmem_rename() directory link count fix dirent-size.patch tmpfs: show a non-zero size for directories tmpfs-trivia.patch tmpfs: small fixlets per-zone-vm.patch separate the kswapd and direct reclaim code paths swsusp-feature.patch add shrink_all_memory() for swsusp adaptec-fix.patch partial fix for aic7xxx error recovery remove-page-virtual.patch remove page->virtual for !WANT_PAGE_VIRTUAL dirty-memory-clamp.patch sterner dirty-memory clamping mempool-wakeup-fix.patch Fix for stuck tasks in mempool_alloc() remove-write_mapping_buffers.patch Remove write_mapping_buffers buffer_boundary-scheduling.patch IO schduling for indirect blocks ll_rw_block-cleanup.patch cleanup ll_rw_block() lseek-ext2_readdir.patch remove lock_kernel() from ext2_readdir() discontig-no-contig_page_data.patch undefine contif_page_data for discontigmem per-node-zone_normal.patch ia32 NUMA: per-node ZONE_NORMAL alloc_pages_node-cleanup.patch alloc_pages_node cleanup read_barrier_depends.patch extended barrier primitives rcu_ltimer.patch RCU core dcache_rcu.patch Use RCU for dcache deadline-update.patch deadline scheduler updates akpm-deadline.patch
On Thu, Sep 26, 2002 at 07:59:21AM +0000, Andrew Morton wrote: > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm3/ > > Includes a SARD update from Rick. The SARD disk accounting is > pretty much final now. > > read_barrier_depends.patch > extended barrier primitives > > rcu_ltimer.patch > RCU core > > dcache_rcu.patch > Use RCU for dcache > Hi Andrew, Updated 2.5.38 RCU core and dcache_rcu patches are now available at http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473 The differences since earlier versions are - rcu_ltimer - call_rcu() fixed for preemption and the earlier fix I had sent to you. read_barrier_depends - fixes list_for_each_rcu macro compilation error. dcache_rcu - uses list_add_rcu in d_rehash and list_for_each_rcu in d_lookup making the read_barrier_depends() fix I had sent to you earlier unnecessary. Thanks -- Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India.
On Thu, Sep 26, 2002 at 05:54:45PM +0530, Dipankar Sarma wrote:
> Updated 2.5.38 RCU core and dcache_rcu patches are now available
> at http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473
> The differences since earlier versions are -
> rcu_ltimer - call_rcu() fixed for preemption and the earlier fix I had sent
> to you.
> read_barrier_depends - fixes list_for_each_rcu macro compilation error.
> dcache_rcu - uses list_add_rcu in d_rehash and list_for_each_rcu in d_lookup
> making the read_barrier_depends() fix I had sent to you
> earlier unnecessary.
Is there an update to the files_struct stuff too? I'm seeing large
overheads there also.
Thanks,
Bill
On Thu, Sep 26, 2002 at 05:29:09AM -0700, William Lee Irwin III wrote: > On Thu, Sep 26, 2002 at 05:54:45PM +0530, Dipankar Sarma wrote: > > Updated 2.5.38 RCU core and dcache_rcu patches are now available > > at http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473 > > The differences since earlier versions are - > > rcu_ltimer - call_rcu() fixed for preemption and the earlier fix I had sent > > to you. > > read_barrier_depends - fixes list_for_each_rcu macro compilation error. > > dcache_rcu - uses list_add_rcu in d_rehash and list_for_each_rcu in d_lookup > > making the read_barrier_depends() fix I had sent to you > > earlier unnecessary. > > Is there an update to the files_struct stuff too? I'm seeing large > overheads there also. files_struct_rcu is not in mm kernels, but I will upload the most recent version to the same download directory in LSE. I would be interested in fget() profile count change with that patch. Thanks -- Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India.
On Thu, Sep 26, 2002 at 05:29:09AM -0700, William Lee Irwin III wrote: >> Is there an update to the files_struct stuff too? I'm seeing large >> overheads there also. On Thu, Sep 26, 2002 at 06:10:52PM +0530, Dipankar Sarma wrote: > files_struct_rcu is not in mm kernels, but I will upload the most > recent version to the same download directory in LSE. > I would be interested in fget() profile count change with that patch. In my experience fget() is large even on UP kernels. For instance, a UP profile from a long-running interactive load UP box (my home machine): 228542527 total 169.5902 216163353 default_idle 4503403.1875 850707 number 781.8998 829885 handle_IRQ_event 8644.6354 687351 proc_getdata 1227.4125 454401 system_call 8114.3036 446452 csum_partial_copy_generic 1800.2097 330157 tcp_sendmsg 76.4252 300022 vsnprintf 284.1117 271134 __generic_copy_to_user 3389.1750 237151 fget 3705.4844 222390 proc_pid_stat 308.8750 210759 fput 878.1625 186408 tcp_ioctl 314.8784 179146 sys_ioctl 238.2261 177419 do_softirq 1232.0764 167881 kmem_cache_free 1165.8403 154854 skb_clone 387.1350 149377 d_lookup 444.5744 139131 kmem_cache_alloc 668.8990 138638 kfree 866.4875 132555 sys_write 637.2837 This is only aggravated by cacheline bouncing on SMP. The reductions of system cpu time will doubtless be beneficial for all. Thanks, Bill
On Thu, Sep 26, 2002 at 05:42:44AM -0700, William Lee Irwin III wrote: > This is only aggravated by cacheline bouncing on SMP. The reductions > of system cpu time will doubtless be beneficial for all. On SMP, I would have thought that only sharing the fd table while cloning tasks (CLONE_FILES) affects performance by bouncing the rwlock cache line. Are there a lot of common workloads where this happens ? Anyway the files_struct_rcu patch for 2.5.38 is up at http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473 Thanks -- Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India.
On Thu, Sep 26, 2002 at 05:42:44AM -0700, William Lee Irwin III wrote: >> This is only aggravated by cacheline bouncing on SMP. The reductions >> of system cpu time will doubtless be beneficial for all. On Thu, Sep 26, 2002 at 06:35:58PM +0530, Dipankar Sarma wrote: > On SMP, I would have thought that only sharing the fd table > while cloning tasks (CLONE_FILES) affects performance by bouncing the rwlock > cache line. Are there a lot of common workloads where this happens ? > Anyway the files_struct_rcu patch for 2.5.38 is up at > http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473 It looks very unusual, but it is very real. Some of my prior profile results show this. I'll run a before/after profile with this either tonight or tomorrow night (it's 6:06AM PST here -- tonight is unlikely). Cheers, Bill
On Thu, 26 Sep 2002, William Lee Irwin III wrote:
> In my experience fget() is large even on UP kernels. For instance, a UP
> profile from a long-running interactive load UP box (my home machine):
I can affirmative that;
6124639 total 4.1414
4883005 default_idle 101729.2708
380218 ata_input_data 1697.4018
242647 ata_output_data 1083.2455
35989 do_select 60.7922
34931 unix_poll 218.3187
33561 schedule 52.4391
29823 do_softirq 155.3281
27021 fget 422.2031
25270 sock_poll 526.4583
18224 preempt_schedule 379.6667
17895 sys_select 15.5339
17741 __generic_copy_from_user 184.8021
15397 __generic_copy_to_user 240.5781
13214 fput 55.0583
13088 add_wait_queue 163.6000
12637 system_call 225.6607
--
function.linuxpower.ca
On Thu, Sep 26, 2002 at 09:29:36AM -0400, Zwane Mwaikambo wrote:
> I can affirmative that;
> 6124639 total 4.1414
> 4883005 default_idle 101729.2708
> 380218 ata_input_data 1697.4018
> 242647 ata_output_data 1083.2455
> 35989 do_select 60.7922
> 34931 unix_poll 218.3187
> 33561 schedule 52.4391
> 29823 do_softirq 155.3281
> 27021 fget 422.2031
> 25270 sock_poll 526.4583
Interesting, can you narrow down the poll overheads any? No immediate
needs (read as: leave your box up, but watch for it when you can),
but I'd be interested in knowing if it's fd chunk or poll table setup
overhead.
Thanks,
Bill
On Thu, 26 Sep 2002, William Lee Irwin III wrote:
> On Thu, Sep 26, 2002 at 09:29:36AM -0400, Zwane Mwaikambo wrote:
> > I can affirmative that;
> > 6124639 total 4.1414
> > 4883005 default_idle 101729.2708
> > 380218 ata_input_data 1697.4018
> > 242647 ata_output_data 1083.2455
> > 35989 do_select 60.7922
> > 34931 unix_poll 218.3187
> > 33561 schedule 52.4391
> > 29823 do_softirq 155.3281
> > 27021 fget 422.2031
> > 25270 sock_poll 526.4583
>
> Interesting, can you narrow down the poll overheads any? No immediate
> needs (read as: leave your box up, but watch for it when you can),
> but I'd be interested in knowing if it's fd chunk or poll table setup
> overhead.
Sure, i'm pretty sure i know which application is doing that so i can
reproduce easily enough.
Zwane
--
function.linuxpower.ca
On Thu, Sep 26, 2002 at 06:39:19AM -0700, William Lee Irwin III wrote: > Interesting, can you narrow down the poll overheads any? No immediate > needs (read as: leave your box up, but watch for it when you can), > but I'd be interested in knowing if it's fd chunk or poll table setup > overhead. Hmm.. I don't see this by just leaving the box up (and a fiew interactive commands) (4CPU P3 2.5.38-vanilla) - 8744695 default_idle 136635.8594 4371 __rdtsc_delay 136.5938 22793 do_softirq 118.7135 1734 serial_in 21.6750 261 .text.lock.serio 13.7368 8777715 total 6.2461 422 tasklet_hi_action 2.0288 106 bh_action 1.3250 46 system_call 1.0455 56 __generic_copy_to_user 0.8750 575 timer_bh 0.8168 70 __cpu_up 0.7292 57 cpu_idle 0.5089 24 __const_udelay 0.3750 35 mdio_read 0.3646 120 probe_irq_on 0.3571 134 page_remove_rmap 0.3102 108 page_add_rmap 0.3068 18 find_get_page 0.2812 189 do_wp_page 0.2513 7 fput 0.2188 27 pte_alloc_one 0.1875 135 __free_pages_ok 0.1834 2 syscall_call 0.1818 11 pgd_alloc 0.1719 11 __free_pages 0.1719 65 i8042_interrupt 0.1693 8 __wake_up 0.1667 16 find_vma 0.1667 15 serial_out 0.1562 15 radix_tree_lookup 0.1339 17 kmem_cache_free 0.1328 17 get_page_state 0.1328 62 zap_pte_range 0.1292 6 mdio_sync 0.1250 3 ret_from_intr 0.1250 2 cap_inode_permission_lite 0.1250 2 cap_file_permission 0.1250 49 do_anonymous_page 0.1178 9 lru_cache_add 0.1125 9 fget 0.1125 What application were you all running ? Thanks -- Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India.
On Fri, Sep 27, 2002 at 01:57:43PM +0530, Dipankar Sarma wrote:
> What application were you all running ?
> Thanks
Basically, the workload on my "desktop" system consists of numerous ssh
sessions in and out of the machine, half a dozen IRC clients, xmms,
Mozilla, and X overhead.
Cheers,
Bill
On Fri, Sep 27, 2002 at 02:20:20AM -0700, William Lee Irwin III wrote: > On Fri, Sep 27, 2002 at 01:57:43PM +0530, Dipankar Sarma wrote: > > What application were you all running ? > > Thanks > > Basically, the workload on my "desktop" system consists of numerous ssh > sessions in and out of the machine, half a dozen IRC clients, xmms, > Mozilla, and X overhead. Ok, from a relatively idle system (4CPU) running SMP kernel - 18 fget 0.2250 0 0.00 c013d460: push %ebx 0 0.00 c013d461: mov $0xffffe000,%edx 0 0.00 c013d466: mov %eax,%ecx 0 0.00 c013d468: and %esp,%edx 0 0.00 c013d46a: mov (%edx),%eax 1 5.56 c013d46c: mov 0x674(%eax),%ebx 1 5.56 c013d472: lea 0x4(%ebx),%eax 0 0.00 c013d475: lock subl $0x1,(%eax) 3 16.67 c013d479: js c013d61b <.text.lock.file_table+0x30> 0 0.00 c013d47f: mov (%edx),%eax 1 5.56 c013d481: mov 0x674(%eax),%edx 0 0.00 c013d487: xor %eax,%eax 0 0.00 c013d489: cmp 0x8(%edx),%ecx 0 0.00 c013d48c: jae c013d494 <fget+0x34> 0 0.00 c013d48e: mov 0x14(%edx),%eax 0 0.00 c013d491: mov (%eax,%ecx,4),%eax 0 0.00 c013d494: test %eax,%eax 0 0.00 c013d496: je c013d49c <fget+0x3c> 0 0.00 c013d498: lock incl 0x14(%eax) 0 0.00 c013d49c: lock incl 0x4(%ebx) 5 27.78 c013d4a0: pop %ebx 0 0.00 c013d4a1: ret 7 38.89 c013d4a2: lea 0x0(%esi,1),%esi I tried an SMP kernel on 1 CPU - 15 fget 0.1875 0 0.00 c013d460: push %ebx 2 13.33 c013d461: mov $0xffffe000,%edx 0 0.00 c013d466: mov %eax,%ecx 0 0.00 c013d468: and %esp,%edx 0 0.00 c013d46a: mov (%edx),%eax 0 0.00 c013d46c: mov 0x674(%eax),%ebx 0 0.00 c013d472: lea 0x4(%ebx),%eax 0 0.00 c013d475: lock subl $0x1,(%eax) 3 20.00 c013d479: js c013d61b <.text.lock.file_table+0x30> 0 0.00 c013d47f: mov (%edx),%eax 0 0.00 c013d481: mov 0x674(%eax),%edx 0 0.00 c013d487: xor %eax,%eax 0 0.00 c013d489: cmp 0x8(%edx),%ecx 0 0.00 c013d48c: jae c013d494 <fget+0x34> 0 0.00 c013d48e: mov 0x14(%edx),%eax 0 0.00 c013d491: mov (%eax,%ecx,4),%eax 0 0.00 c013d494: test %eax,%eax 0 0.00 c013d496: je c013d49c <fget+0x3c> 0 0.00 c013d498: lock incl 0x14(%eax) 0 0.00 c013d49c: lock incl 0x4(%ebx) 4 26.67 c013d4a0: pop %ebx 0 0.00 c013d4a1: ret 6 40.00 c013d4a2: lea 0x0(%esi,1),%esi The counts are off by one. With a UP kernel, I see that fget() cost is negligible. So it is most likely the atomic operations for rwlock acquisition/release in fget() that is adding to its cost. Unless of course my sampling is too less. Please try running the files_struct_rcu patch where fget() is lockfree and let me know what you see. Thanks -- Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India.
>> > What application were you all running ? Kernel compile on NUMA-Q looks like this: 125673 total 82183 default_idle 6134 do_anonymous_page 4431 page_remove_rmap 2345 page_add_rmap 2288 d_lookup 1921 vm_enough_memory 1883 __generic_copy_from_user 1566 file_read_actor 1381 .text.lock.file_table <------------- 1168 find_get_page 1116 get_empty_filp Presumably that's the same thing? Interestingly, if I look back at previous results, I see it's about twice the cost in -mm as it is in mainline, not sure why ... at least against 2.5.37 virgin it was. > Please try running the files_struct_rcu patch where fget() is lockfree > and let me know what you see. Will do ... if you tell me where it is ;-) M.
On Fri, Sep 27, 2002 at 08:04:31AM -0700, Martin J. Bligh wrote: > >> > What application were you all running ? > > Kernel compile on NUMA-Q looks like this: > > 125673 total > 82183 default_idle > 2288 d_lookup > 1921 vm_enough_memory > 1883 __generic_copy_from_user > 1566 file_read_actor > 1381 .text.lock.file_table <------------- More likely, this is contention for the files_lock. Do you have any lockmeter data ? That should give us more information. If so, the files_struct_rcu isn't likely to help. > 1168 find_get_page > 1116 get_empty_filp > > Presumably that's the same thing? Interestingly, if I look back at > previous results, I see it's about twice the cost in -mm as it is > in mainline, not sure why ... at least against 2.5.37 virgin it was. Not sure why it shows up more in -mm, but likely because -mm has lot less contention on other locks like dcache_lock. > > > Please try running the files_struct_rcu patch where fget() is lockfree > > and let me know what you see. > > Will do ... if you tell me where it is ;-) Oh, the usual place - http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473 I wish sourceforge FRS continued to allow direct links to patches. Thanks -- Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India.
On Fri, Sep 27, 2002 at 10:44:24PM +0530, Dipankar Sarma wrote:
> Not sure why it shows up more in -mm, but likely because -mm has
> lot less contention on other locks like dcache_lock.
Well, the profile I posted was an interactive UP workload, and it's
fairly high there. Trimming cycles off this is good for everyone.
Small SMP boxen (dual?) used similarly will probably see additional
gains as the number of locked operations in fget() will be reduced.
There's clearly no contention or cacheline bouncing in my workloads as
none of them have tasks sharing file tables, nor is anything else
messing with the cachelines.
Cheers,
Bill
On Fri, 27 Sep 2002, William Lee Irwin III wrote:
> On Fri, Sep 27, 2002 at 01:57:43PM +0530, Dipankar Sarma wrote:
> > What application were you all running ?
> > Thanks
>
> Basically, the workload on my "desktop" system consists of numerous ssh
> sessions in and out of the machine, half a dozen IRC clients, xmms,
> Mozilla, and X overhead.
That box is my development/main box, i run a lot of xterms, xmms, network
applications (ssh, browsers, irc etc...). Heavy simulator usage (i believe
thats where the poll stuff comes from, due to its virtual ethernet
interface) all done in X and the box is also local NFS server for the
various testboxes i have (heavy I/O, disk load) as well as kernel
compiles.
Zwane
--
function.linuxpower.ca
On Fri, 27 Sep 2002, Dipankar Sarma wrote: > The counts are off by one. > > With a UP kernel, I see that fget() cost is negligible. > So it is most likely the atomic operations for rwlock acquisition/release > in fget() that is adding to its cost. Unless of course my sampling > is too less. Mine is a UP box not an SMP kernel, although preempt is enabled; 0xc013d370 <fget>: push %ebx 0xc013d371 <fget+1>: mov %eax,%ecx 0xc013d373 <fget+3>: mov $0xffffe000,%edx 0xc013d378 <fget+8>: and %esp,%edx 0xc013d37a <fget+10>: incl 0x4(%edx) 0xc013d37d <fget+13>: xor %ebx,%ebx 0xc013d37f <fget+15>: mov 0x554(%edx),%eax 0xc013d385 <fget+21>: cmp 0x8(%eax),%ecx 0xc013d388 <fget+24>: jae 0xc013d390 <fget+32> 0xc013d38a <fget+26>: mov 0x14(%eax),%eax 0xc013d38d <fget+29>: mov (%eax,%ecx,4),%ebx 0xc013d390 <fget+32>: test %ebx,%ebx 0xc013d392 <fget+34>: je 0xc013d397 <fget+39> 0xc013d394 <fget+36>: incl 0x14(%ebx) 0xc013d397 <fget+39>: decl 0x4(%edx) 0xc013d39a <fget+42>: mov 0x14(%edx),%eax 0xc013d39d <fget+45>: cmp %eax,0x4(%edx) 0xc013d3a0 <fget+48>: jge 0xc013d3a7 <fget+55> 0xc013d3a2 <fget+50>: call 0xc01179b0 <preempt_schedule> 0xc013d3a7 <fget+55>: mov %ebx,%eax 0xc013d3a9 <fget+57>: pop %ebx 0xc013d3aa <fget+58>: ret 0xc013d3ab <fget+59>: nop 0xc013d3ac <fget+60>: lea 0x0(%esi,1),%esi > Please try running the files_struct_rcu patch where fget() is lockfree > and let me know what you see. Lock acquisition/release should be painless on this system no? Zwane -- function.linuxpower.ca
> On Fri, 27 Sep 2002, Dipankar Sarma wrote: >> The counts are off by one. >> With a UP kernel, I see that fget() cost is negligible. >> So it is most likely the atomic operations for rwlock acquisition/release >> in fget() that is adding to its cost. Unless of course my sampling >> is too less. On Sat, Sep 28, 2002 at 12:35:30AM -0400, Zwane Mwaikambo wrote: > Mine is a UP box not an SMP kernel, although preempt is enabled; > 0xc013d370 <fget>: push %ebx > 0xc013d371 <fget+1>: mov %eax,%ecx > 0xc013d373 <fget+3>: mov $0xffffe000,%edx > 0xc013d378 <fget+8>: and %esp,%edx > 0xc013d37a <fget+10>: incl 0x4(%edx) Do you have instruction-level profiles to show where the cost is on UP? Thanks, Bill
On Fri, 27 Sep 2002, William Lee Irwin III wrote:
> On Sat, Sep 28, 2002 at 12:35:30AM -0400, Zwane Mwaikambo wrote:
> > Mine is a UP box not an SMP kernel, although preempt is enabled;
> > 0xc013d370 <fget>: push %ebx
> > 0xc013d371 <fget+1>: mov %eax,%ecx
> > 0xc013d373 <fget+3>: mov $0xffffe000,%edx
> > 0xc013d378 <fget+8>: and %esp,%edx
> > 0xc013d37a <fget+10>: incl 0x4(%edx)
>
> Do you have instruction-level profiles to show where the cost is on UP?
Unfortunately no, i was lucky to remember to even be running profile=n on
this box.
--
function.linuxpower.ca
On Sat, Sep 28, 2002 at 12:54:39AM -0400, Zwane Mwaikambo wrote: > On Fri, 27 Sep 2002, William Lee Irwin III wrote: > > > On Sat, Sep 28, 2002 at 12:35:30AM -0400, Zwane Mwaikambo wrote: > > > Mine is a UP box not an SMP kernel, although preempt is enabled; > > > 0xc013d370 <fget>: push %ebx > > > 0xc013d371 <fget+1>: mov %eax,%ecx > > > 0xc013d373 <fget+3>: mov $0xffffe000,%edx > > > 0xc013d378 <fget+8>: and %esp,%edx > > > 0xc013d37a <fget+10>: incl 0x4(%edx) > > > > Do you have instruction-level profiles to show where the cost is on UP? > > Unfortunately no, i was lucky to remember to even be running profile=n on > this box. That is sufficient to get instruction level profile. Just use the hacked readprofile by tridge (it's available somewhere in his samba.org webpage). I suspect that inlining fget() will help, not sure whether that is clean code-wise. Thanks -- Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India.
On Fri, Sep 27, 2002 at 03:54:55PM -0700, William Lee Irwin III wrote: > On Fri, Sep 27, 2002 at 10:44:24PM +0530, Dipankar Sarma wrote: > > Not sure why it shows up more in -mm, but likely because -mm has > > lot less contention on other locks like dcache_lock. > > Well, the profile I posted was an interactive UP workload, and it's > fairly high there. Trimming cycles off this is good for everyone. Oh, I was commenting on possible files_lock contention on mbligh's NUMA-Q. > > Small SMP boxen (dual?) used similarly will probably see additional > gains as the number of locked operations in fget() will be reduced. > There's clearly no contention or cacheline bouncing in my workloads as > none of them have tasks sharing file tables, nor is anything else > messing with the cachelines. I remember seeing fget() high up in specweb profiles. I suspect that fget profile count is high because it just happens to get called very often for most workloads (all file syscalls) and the atomic operations (SMP) and the function call overhead just adds to the cost. If possible, we should try inlining it too. Thanks -- Dipankar Sarma <dipankar@in.ibm.com> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India.