* kernel lock contention and scalability @ 2001-02-15 18:46 Jonathan Lahr 2001-02-25 9:52 ` Manfred Spraul 2001-03-05 0:38 ` Anton Blanchard 0 siblings, 2 replies; 14+ messages in thread From: Jonathan Lahr @ 2001-02-15 18:46 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 905 bytes --] To discover possible locking limitations to scalability, I have collected locking statistics on a 2-way, 4-way, and 8-way performing as networked database servers. I patched the [48]-way kernels with Kravetz's multiqueue patch in the hope that mitigating runqueue_lock contention might better reveal other lock contention. In the attached document, I describe my test environment and excerpt lockstat output to show the more contentious locks for a typical run on each of my server configurations. I'm interested in comparing these data to other lock contention data, so information regarding previous or ongoing lock contention work would be appreciated. I'm aware of timer scalability work ongoing at people.redhat.com/mingo/scalable-timers, but is anyone working on reducing sem_ids contention? -- Jonathan Lahr IBM Linux Technology Center Beaverton, Oregon lahr@us.ibm.com 503-578-3385 [-- Attachment #2: note.att --] [-- Type: text/plain, Size: 10426 bytes --] server configuration: hardware: memory: 2-way: .5 Gb 4-way: 1 Gb 8-way: 1 Gb cpus: 2-way: Pentium II, 300 MHz [48]-way: Pentium III, 700 MHz NICs: 100 Mbps ethernet (2) software: distribution: Redhat 7.0 kernel: 2-way: 2.4.0-test10 patched with lockmeter1.4.5-2.4.0 [48]-way: 2.4.0 patched with lockmeter1.4.5-2.4.0, 2.4.0.MQ1-sched.rt database: postgresql-7.0.2-17 client: pgbench (distributed with postgresql) lockstat excerpts: 2way: SPINLOCKS HOLD WAIT UTIL CON MEAN ( MAX ) MEAN ( MAX ) TOTAL NOWAIT SPIN REJECT NAME 4.04% 1.22% 50us( 3344us) 5.2us( 2014us) 36515 36068 447 0 kernel_flag 0.01% 3.47% 46us( 427us) 17us( 2014us) 144 139 5 0 do_coredump+0x24 0.00% 0.00% 960us( 960us) 0us 1 1 0 0 do_exit+0x94 0.00% 4.00% 2.0us( 4.2us) 75us( 1876us) 25 24 1 0 ext2_discard_prealloc+0x24 0.03% 0.70% 11us( 1048us) 1.3us( 682us) 1144 1136 8 0 ext2_get_block+0x50 1.78% 0.79% 455us( 3344us) 0.8us( 759us) 1766 1752 14 0 ext2_sync_file+0x28 0.62% 0.84% 12us( 1289us) 2.5us( 1717us) 23353 23157 196 0 real_lookup+0x68 1.46% 1.29% 186us( 2980us) 5.4us( 1824us) 3553 3507 46 0 schedule+0x490 0.01% 0.00% 456us( 596us) 0us 9 9 0 0 sync_old_buffers+0x20 0.01% 1.83% 9.4us( 84us) 0.7us( 92us) 328 322 6 0 sys_fcntl64+0x44 0.00% 3.87% 8.0us( 329us) 6.7us( 1011us) 155 149 6 0 sys_ioctl+0x48 0.02% 2.79% 1.9us( 805us) 19us( 1986us) 5483 5330 153 0 sys_lseek+0x70 0.00% 0.00% 22us( 22us) 0us 1 1 0 0 sys_sysctl+0x50 0.01% 3.23% 17us( 84us) 0.5us( 25us) 155 150 5 0 tty_read+0xbc 0.02% 2.35% 39us( 110us) 0.2us( 11us) 213 208 5 0 tty_write+0x1dc 0.07% 1.09% 168us( 1442us) 0.7us( 116us) 184 182 2 0 vfs_readdir+0x70 0.00% 0.00% 31us( 31us) 0us 1 1 0 0 vfs_statfs+0x54 24.38% 23.93% 15us( 218us) 4.3us( 111us) 744475 566289 178186 0 runqueue_lock 0.06% 15.97% 4.5us( 26us) 2.6us( 67us) 5592 4699 893 0 __wake_up+0xdc 0.00% 10.27% 0.4us( 1.3us) 1.5us( 60us) 146 131 15 0 deliver_signal+0x58 1.16% 8.59% 1.5us( 27us) 2.3us( 111us) 360313 329373 30940 0 process_timeout+0x14 0.00% 0.00% 0.6us( 0.6us) 0us 1 1 0 0 release+0x28 23.15% 38.78% 28us( 218us) 6.2us( 108us) 376292 230381 145911 0 schedule+0xe0 0.01% 45.34% 3.7us( 24us) 16us( 82us) 686 375 311 0 schedule+0x458 0.00% 0.00% 2.8us( 70us) 0us 89 89 0 0 schedule+0x504 0.01% 8.55% 3.0us( 18us) 1.9us( 68us) 1356 1240 116 0 wake_up_process+0x14 0.11% 4.97% 12us( 1113us) 1.0us( 1540us) 4041 3840 201 0 sem_ids+0x24 0.00% 1.32% 7.1us( 88us) 0.1us( 11us) 303 299 4 0 semctl_main+0x4c 0.06% 3.85% 11us( 281us) 0.5us( 81us) 2392 2300 92 0 sys_semop+0xe8 0.04% 7.80% 15us( 1113us) 2.2us( 1540us) 1346 1241 105 0 sys_semop+0x3c8 2.31% 6.86% 0.9us( 15us) 0.2us( 13us) 1102822 1027206 75616 0 timerlist_lock 1.07% 4.75% 1.3us( 11us) 0.1us( 8.8us) 365451 348102 17349 0 add_timer+0x14 0.00% 1.91% 0.3us( 4.2us) 0.1us( 4.5us) 3935 3860 75 0 del_timer+0x14 0.32% 5.71% 0.4us( 7.2us) 0.2us( 13us) 362967 342246 20721 0 del_timer_sync+0x2c 0.02% 1.47% 1.8us( 9.1us) 0.0us( 6.2us) 3942 3884 58 0 mod_timer+0x18 0.02% 0.09% 2.3us( 15us) 0.0us( 2.8us) 4514 4510 4 0 timer_bh+0xd0 0.89% 10.33% 1.1us( 7.6us) 0.3us( 8.2us) 362013 324604 37409 0 timer_bh+0x26c 4way: SPINLOCKS HOLD WAIT UTIL CON MEAN ( MAX ) MEAN ( MAX ) TOTAL NOWAIT SPIN REJECT NAME 0.18% 33.57% 6.0us( 89us) 3.2us( 114us) 97322 64653 32669 0 sem_ids+0x24 0.01% 15.07% 2.0us( 69us) 0.9us( 44us) 10551 8961 1590 0 semctl_main+0x50 0.00% 0.00% 1.3us( 3.7us) 0us 248 248 0 0 sys_semget+0xd0 0.07% 23.57% 4.1us( 86us) 3.1us( 105us) 54350 41537 12813 0 sys_semop+0xf0 0.10% 56.77% 10us( 89us) 4.2us( 114us) 32173 13907 18266 0 sys_semop+0x35c 0.13% 10.71% 0.4us( 3.6us) 0.2us( 18us) 1147726 1024826 122900 0 timerlist_lock 0.06% 9.85% 0.6us( 3.0us) 0.2us( 14us) 361475 325856 35619 0 add_timer+0x10 0.00% 0.19% 0.1us( 1.2us) 0.0us( 6.2us) 45152 45068 84 0 del_timer+0x14 0.03% 11.15% 0.3us( 2.4us) 0.2us( 18us) 341333 303277 38056 0 del_timer_sync+0x1c 0.01% 0.44% 0.5us( 3.6us) 0.0us( 7.2us) 46186 45981 205 0 mod_timer+0x18 0.01% 0.01% 0.5us( 2.9us) 0.0us( 1.2us) 32429 32425 4 0 timer_bh+0xcc 0.03% 15.24% 0.3us( 2.0us) 0.3us( 10us) 321151 272219 48932 0 timer_bh+0x254 0.00% 7.03% 0.2us( 2.3us) 0.2us( 20us) 6882 6398 484 0 add_wait_queue_exclusive+0x10 0.00% 50.00% 0.1us( 0.1us) 3.2us( 6.4us) 2 1 1 0 inet_wait_for_connect+0x104 0.07% 7.48% 0.8us( 13us) 0.2us( 13us) 294222 272202 22020 0 process_timeout+0x24 0.02% 10.56% 0.5us( 4.6us) 0us 114853 102721 0 12132 reschedule_idle+0x3a4 0.04% 15.82% 1.2us( 5.7us) 0us 101053 85069 0 15984 schedule+0x5a8 0.00% 12.64% 2.3us( 12us) 0.3us( 11us) 2461 2150 311 0 schedule+0xb44 0.00% 16.00% 1.6us( 3.0us) 0.5us( 3.5us) 50 42 8 0 schedule+0xb80 0.00% 10.36% 0.4us( 13us) 1.4us( 20us) 251 225 26 0 tcp_close+0x30 0.00% 5.24% 0.1us( 1.4us) 1.1us( 39us) 248 235 13 0 tcp_setsockopt+0x98 8way: SPINLOCKS HOLD WAIT UTIL CON MEAN ( MAX ) MEAN ( MAX ) TOTAL NOWAIT SPIN REJECT NAME 1.15% 9.78% 7.6us( 363us) 2.1us( 862us) 1560956 1408297 152659 0 io_request_lock 0.00% 45.45% 0.5us( 5.3us) 13us( 250us) 58066 31677 26389 0 __get_request_wait+0x70 0.72% 12.06% 28us( 363us) 2.5us( 696us) 266880 234706 32174 0 __make_request+0xfc 0.01% 21.74% 0.2us( 7.0us) 3.9us( 862us) 241846 189278 52568 0 blk_get_queue+0x14 0.25% 4.19% 8.2us( 87us) 0.8us( 303us) 310337 297332 13005 0 do_aic7xxx_isr+0x20 0.02% 3.39% 2.7us( 35us) 0.3us( 233us) 63155 61015 2140 0 generic_unplug_device+0x10 0.04% 6.39% 2.4us( 34us) 2.2us( 314us) 155168 145259 9909 0 scsi_dispatch_cmd+0x12c 0.03% 6.09% 1.9us( 12us) 1.2us( 285us) 155168 145725 9443 0 scsi_old_done+0x614 0.07% 0.06% 4.9us( 65us) 0.0us( 177us) 155168 155080 88 0 scsi_queue_next_request+0x18 0.02% 4.47% 1.2us( 11us) 0.9us( 349us) 155168 148225 6943 0 scsi_request_fn+0x338 0.28% 36.38% 7.1us( 973us) 4.6us( 987us) 405038 257673 147365 0 sem_ids+0x24 0.01% 19.36% 2.2us( 42us) 2.3us( 149us) 56047 45198 10849 0 semctl_main+0x50 0.00% 0.00% 1.9us( 98us) 0us 992 992 0 0 sys_semget+0xd0 0.12% 23.36% 5.6us( 232us) 3.4us( 987us) 214063 164056 50007 0 sys_semop+0xf0 0.15% 64.59% 12us( 973us) 7.5us( 973us) 133936 47427 86509 0 sys_semop+0x35c 0.54% 14.58% 0.6us( 12us) 0.5us( 78us) 8829923 7542237 1287686 0 timerlist_lock 0.21% 11.40% 0.8us( 12us) 0.4us( 78us) 2856951 2531286 325665 0 add_timer+0x10 0.00% 0.82% 0.1us( 8.1us) 0.0us( 28us) 158853 157543 1310 0 del_timer+0x14 0.12% 15.81% 0.4us( 7.8us) 0.5us( 76us) 2787756 2347125 440631 0 del_timer_sync+0x1c 0.02% 1.19% 0.9us( 9.6us) 0.0us( 28us) 183111 180931 2180 0 mod_timer+0x18 0.02% 0.08% 1.6us( 9.1us) 0.0us( 5.5us) 102552 102466 86 0 timer_bh+0xcc 0.18% 18.89% 0.7us( 8.0us) 0.5us( 42us) 2740700 2222886 517814 0 timer_bh+0x254 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-02-15 18:46 kernel lock contention and scalability Jonathan Lahr @ 2001-02-25 9:52 ` Manfred Spraul 2001-03-05 18:41 ` Jonathan Lahr 2001-03-05 0:38 ` Anton Blanchard 1 sibling, 1 reply; 14+ messages in thread From: Manfred Spraul @ 2001-02-25 9:52 UTC (permalink / raw) To: Jonathan Lahr; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 2023 bytes --] Jonathan Lahr wrote: > > To discover possible locking limitations to scalability, I have collected > locking statistics on a 2-way, 4-way, and 8-way performing as networked > database servers. I patched the [48]-way kernels with Kravetz's multiqueue > patch in the hope that mitigating runqueue_lock contention might better > reveal other lock contention. > The dual cpu numbers are really odd. Extremely high count of add_timer(), del_timer_sync(), schedule() and process_timeout(). That could be a kernel bug: perhaps someone uses for(;;) { set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(100); } without checking signal_pending()? > In the attached document, I describe my test environment and excerpt > lockstat output to show the more contentious locks for a typical run on > each of my server configurations. I'm interested in comparing these data > to other lock contention data, so information regarding previous or ongoing > lock contention work would be appreciated. I'm aware of timer scalability > work ongoing at people.redhat.com/mingo/scalable-timers, but is anyone > working on reducing sem_ids contention? > Is that really a problem? The contention is high, but the actual lost time is quite small. The 8-way test ran for ~ 129 seconds wall clock time (total cpu time 1030 seconds), and around 0.7 seconds were lost due to spinning. The high contention is caused by the wakeups: cpu0 scans the list of waiting processes and if it finds one it is woken up. If that thread runs before cpu0 can release the spinlock, the second cpu will spin. I've attached 2 changes that might reduce the contention, but it's just an idea, completely untested. * slightly more efficient try_atomic_semop(). * don't acquire the spinlock if q->alter was 0. It could slightly improve performance, but I assume that q->alter will be always 1. Btw, I found a small bug in try_atomic_semop(): If a semaphore operation with sem_op==0 blocks, then the pid is corrupted. The bug also exists in 2.2. -- Manfred [-- Attachment #2: patch-sem --] [-- Type: text/plain, Size: 1648 bytes --] --- sem.c.old Sun Feb 25 10:50:55 2001 +++ sem.c Sun Feb 25 10:51:19 2001 @@ -250,23 +250,23 @@ curr = sma->sem_base + sop->sem_num; sem_op = sop->sem_op; - if (!sem_op && curr->semval) + result = curr->semval; + if (!sem_op && result) goto would_block; + result += sem_op; + if (result < 0) + goto would_block; + if (result > SEMVMX) + goto out_of_range; curr->sempid = (curr->sempid << 16) | pid; - curr->semval += sem_op; + curr->semval = result; if (sop->sem_flg & SEM_UNDO) un->semadj[sop->sem_num] -= sem_op; - - if (curr->semval < 0) - goto would_block; - if (curr->semval > SEMVMX) - goto out_of_range; } if (do_undo) { - sop--; result = 0; goto undo; } @@ -285,6 +285,7 @@ result = 1; undo: + sop--; while (sop >= sops) { curr = sma->sem_base + sop->sem_num; curr->semval -= sop->sem_op; @@ -305,7 +306,9 @@ { int error; struct sem_queue * q; + int do_retry = 0; +retry: for (q = sma->sem_pending; q; q = q->next) { if (q->status == 1) @@ -323,10 +326,17 @@ q->status = 1; return; } - q->status = error; remove_from_queue(sma,q); + wmb(); + q->status = error; + /* FIXME: retry only required if an increase was + * executed + */ + do_retry = 1; } } + if (do_retry) + goto retry; } /* The following counts are associated to each semaphore: @@ -919,7 +929,13 @@ sem_unlock(semid); schedule(); - + if (queue.status == 0) { + error = 0; + if (queue.prev) + BUG(); + current->semsleeping = NULL; + goto out_free; + } tmp = sem_lock(semid); if(tmp==NULL) { if(queue.prev != NULL) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-02-25 9:52 ` Manfred Spraul @ 2001-03-05 18:41 ` Jonathan Lahr 0 siblings, 0 replies; 14+ messages in thread From: Jonathan Lahr @ 2001-03-05 18:41 UTC (permalink / raw) To: Manfred Spraul; +Cc: Jonathan Lahr, linux-kernel Manfred Spraul [manfred@colorfullife.com] wrote: > > > lock contention work would be appreciated. I'm aware of timer scalability > > work ongoing at people.redhat.com/mingo/scalable-timers, but is anyone > > working on reducing sem_ids contention? > > Is that really a problem? > The contention is high, but the actual lost time is quite small. I agree it isn't a major performance problem under that workload. But, I thought since the contention was high that other workloads which may utilize it more might have shown it to be a significant problem. > I've attached 2 changes that might reduce the contention, but it's just > an idea, completely untested. Thanks for the insight into the sempahore subsystem and the suggested fixes. -- Jonathan Lahr IBM Linux Technology Center Beaverton, Oregon lahr@us.ibm.com 503-578-3385 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-02-15 18:46 kernel lock contention and scalability Jonathan Lahr 2001-02-25 9:52 ` Manfred Spraul @ 2001-03-05 0:38 ` Anton Blanchard 2001-03-06 22:45 ` Jonathan Lahr 1 sibling, 1 reply; 14+ messages in thread From: Anton Blanchard @ 2001-03-05 0:38 UTC (permalink / raw) To: Jonathan Lahr; +Cc: linux-kernel Hi, > To discover possible locking limitations to scalability, I have collected > locking statistics on a 2-way, 4-way, and 8-way performing as networked > database servers. I patched the [48]-way kernels with Kravetz's multiqueue > patch in the hope that mitigating runqueue_lock contention might better > reveal other lock contention. ... > 24.38% 23.93% 15us( 218us) 4.3us( 111us) 744475 566289 178186 0 runqueue_lock > 23.15% 38.78% 28us( 218us) 6.2us( 108us) 376292 230381 145911 0 schedule+0xe0 Tridge and I tried out the postgresql benchmark you used here and this contention is due to a bug in postgres. From a quick strace, we found the threads do a load of select(0, NULL, NULL, NULL, {0,0}). Basically all threads are pounding on schedule(). Our guess is that the app has some form of userspace synchronisation (semaphores/spinlocks). I'd argue that the app needs to be fixed not the kernel, or a more valid test case is put forwards. :) PS: I just looked at the postgresql source and the spinlocks (s_lock() etc) are in a tight loop doing select(0, NULL, NULL, NULL, {0,0}). In samba we have userspace spinlocks, but they cover small amounts of code and offer an advantage over ipc semaphores. When you have to synchronise large sections of code ipc semaphores are reasonably fast on linux and would be a better fit. Cheers, Anton ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-03-05 0:38 ` Anton Blanchard @ 2001-03-06 22:45 ` Jonathan Lahr 2001-03-06 23:39 ` Matthew Kirkwood 2001-03-11 6:26 ` Anton Blanchard 0 siblings, 2 replies; 14+ messages in thread From: Jonathan Lahr @ 2001-03-06 22:45 UTC (permalink / raw) To: Anton Blanchard; +Cc: Jonathan Lahr, linux-kernel > Tridge and I tried out the postgresql benchmark you used here and this > contention is due to a bug in postgres. From a quick strace, we found > the threads do a load of select(0, NULL, NULL, NULL, {0,0}). Basically all > threads are pounding on schedule(). ... > Our guess is that the app has some form of userspace synchronisation > (semaphores/spinlocks). I'd argue that the app needs to be fixed not the > kernel, or a more valid test case is put forwards. :) ... > PS: I just looked at the postgresql source and the spinlocks (s_lock() etc) > are in a tight loop doing select(0, NULL, NULL, NULL, {0,0}). Anton, Thanks for looking into postgresql/pgbench related locking. Yes, apparently postgresql uses a synchronization scheme that uses select() to effect delays for backing off while attempting to acquire a lock. However, it seems to me that runqueue lock contention was not entirely due to postgresql code, since it was largely alleviated by the multiqueue scheduler patch. In using postgresql/pgbench to measure lock contention, I was attempting to apply a typical server workload to measure scalability using only open software. My goal is to load and measure the kernel for server performance, so I need to ensure that the software I use represents likely real world server configurations. I did not use mysql, because it cannot perform transactions which I considered important. Any pointers to other open database software or benchmarks that might be suitable for this effort would be appreciated. Jonathan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-03-06 22:45 ` Jonathan Lahr @ 2001-03-06 23:39 ` Matthew Kirkwood 2001-03-07 0:28 ` Tim Wright 2001-03-11 6:50 ` Anton Blanchard 2001-03-11 6:26 ` Anton Blanchard 1 sibling, 2 replies; 14+ messages in thread From: Matthew Kirkwood @ 2001-03-06 23:39 UTC (permalink / raw) To: Jonathan Lahr; +Cc: Anton Blanchard, linux-kernel On Tue, 6 Mar 2001, Jonathan Lahr wrote: [ sorry to reply over another reply, but I don't have the original of this ] > > Tridge and I tried out the postgresql benchmark you used here and this > > contention is due to a bug in postgres. From a quick strace, we found > > the threads do a load of select(0, NULL, NULL, NULL, {0,0}). I can shed some light on this (though I'm far from a PG hacker). Postgres can use either of two locking methods -- SysV semaphores (which it tries to avoid, asusming that they'll be too heavy) or userspace spinlocks (via inline assembler on platforms which support it). In the slow path of a spinlock_acquire they busy wait for a few cycles, and then call schedule with a zero timeout assuming that it'll basically do the same as a sched_yield() but more portably. Matthew. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-03-06 23:39 ` Matthew Kirkwood @ 2001-03-07 0:28 ` Tim Wright 2001-03-07 3:12 ` Jeff Dike 2001-03-11 6:50 ` Anton Blanchard 1 sibling, 1 reply; 14+ messages in thread From: Tim Wright @ 2001-03-07 0:28 UTC (permalink / raw) To: Matthew Kirkwood; +Cc: Jonathan Lahr, Anton Blanchard, linux-kernel On Tue, Mar 06, 2001 at 11:39:17PM +0000, Matthew Kirkwood wrote: > On Tue, 6 Mar 2001, Jonathan Lahr wrote: > > [ sorry to reply over another reply, but I don't have > the original of this ] > > > > Tridge and I tried out the postgresql benchmark you used here and this > > > contention is due to a bug in postgres. From a quick strace, we found > > > the threads do a load of select(0, NULL, NULL, NULL, {0,0}). > > I can shed some light on this (though I'm far from a PG hacker). > > Postgres can use either of two locking methods -- SysV semaphores > (which it tries to avoid, asusming that they'll be too heavy) or > userspace spinlocks (via inline assembler on platforms which support > it). > > In the slow path of a spinlock_acquire they busy wait for a few > cycles, and then call schedule with a zero timeout assuming that > it'll basically do the same as a sched_yield() but more portably. > Ugh ! I had a nasty feeling that might be what they were up to. The reason for the "ugh" is as follows. If you're a UP system, it never makes sense to spin in userland, since you'll just burn up a timeslice and prevent the lock holder from running. I haven't looked, but assume that their code only uses spinlocks on SMP. If you're an SMP system, then you shouldn't be using a spinlock unless the critical section is "short", in which case the waiters should simply spin in userland rather than making system calls which is simply overhead. If the argument is that the "spinners" take too much useful time away from other processes, then it sounds like the contention is too high, or that the critical section is sufficiently long that semaphores would be a better choice. Actually, what's really needed here is an efficient form of dynamically marking a process as non-preemptible so that when acquiring a spinlock the process can ensure that it exits the critical section as fast as possible, when it would relinquish its non-preemptible privilege. Another synchronization method popular with database peeps is "post/wait" for which SGI have a patch available for Linux. I understand that this is relatively "light weight" and might be a better choice for PG. Tim -- Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com IBM Linux Technology Center, Beaverton, Oregon Interested in Linux scalability ? Look at http://lse.sourceforge.net/ "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-03-07 0:28 ` Tim Wright @ 2001-03-07 3:12 ` Jeff Dike 2001-03-07 22:13 ` Tim Wright 0 siblings, 1 reply; 14+ messages in thread From: Jeff Dike @ 2001-03-07 3:12 UTC (permalink / raw) To: timw; +Cc: Jonathan Lahr, Anton Blanchard, linux-kernel timw@splhi.com said: > If you're a UP system, it never makes sense to spin in userland, since > you'll just burn up a timeslice and prevent the lock holder from > running. I haven't looked, but assume that their code only uses > spinlocks on SMP. If you're an SMP system, then you shouldn't be using > a spinlock unless the critical section is "short", in which case the > waiters should simply spin in userland rather than making system calls > which is simply overhead. This is a problem that UML is going to have when I turn SMP back on. Emulating a multiprocessor box on a UP host with the existing locking primitives is going to result in exactly this problem. > Actually, what's really needed here is an efficient form of > dynamically marking a process as non-preemptible so that when > acquiring a spinlock the process can ensure that it exits the critical > section as fast as possible, when it would relinquish its > non-preemptible privilege. That sounds like a pretty fundamental (and abusable) mechanism. I had a suggestion from an IBM guy at ALS last year to make UML "spin"-locks actually sleep in the host (this doesn't make them sleep locks in userspace because they don't call schedule), which sounds reasonable. This gives the lock-holder an opportunity to run immediately. It's unclear to me what the wake-up mechanism would be, though. Another thought I had was to raise the priority of a thread holding a spinlock. This would reduce the chance that it would be preempted by a thread that will waste a timeslice spinning on that lock. I don't know whether this is a good idea either. > Another synchronization method popular with database peeps is "post/ > wait" for which SGI have a patch available for Linux. I understand > that this is relatively "light weight" and might be a better choice > for PG. URL? Jeff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-03-07 3:12 ` Jeff Dike @ 2001-03-07 22:13 ` Tim Wright 2001-03-08 23:26 ` Jeff Dike 0 siblings, 1 reply; 14+ messages in thread From: Tim Wright @ 2001-03-07 22:13 UTC (permalink / raw) To: Jeff Dike; +Cc: timw, Jonathan Lahr, Anton Blanchard, linux-kernel On Tue, Mar 06, 2001 at 10:12:17PM -0500, Jeff Dike wrote: > timw@splhi.com said: > > If you're a UP system, it never makes sense to spin in userland, since > > you'll just burn up a timeslice and prevent the lock holder from > > running. I haven't looked, but assume that their code only uses > > spinlocks on SMP. If you're an SMP system, then you shouldn't be using > > a spinlock unless the critical section is "short", in which case the > > waiters should simply spin in userland rather than making system calls > > which is simply overhead. > > This is a problem that UML is going to have when I turn SMP back on. > Emulating a multiprocessor box on a UP host with the existing locking > primitives is going to result in exactly this problem. > Yes. On a uniprocessor system, a simple fallback is to just use a semaphore instead of a spinlock, since you can guarantee that there's no point in scheduling the current task until the holder of the "lock" releases it. Otherwise, the spin calling sched_yield() each iteration isn't too horrible. > > Actually, what's really needed here is an efficient form of > > dynamically marking a process as non-preemptible so that when > > acquiring a spinlock the process can ensure that it exits the critical > > section as fast as possible, when it would relinquish its > > non-preemptible privilege. > > That sounds like a pretty fundamental (and abusable) mechanism. > It would be if it were generally available. The implementation on DYNIX/ptx requires a privilege (PRIV_SCHED IIRC), to be able to use it. It was added for a database to prevent preemption during critical sections. > I had a suggestion from an IBM guy at ALS last year to make UML "spin"-locks > actually sleep in the host (this doesn't make them sleep locks in userspace > because they don't call schedule), which sounds reasonable. This gives the > lock-holder an opportunity to run immediately. It's unclear to me what the > wake-up mechanism would be, though. > Hmmm.. depends what you mean by sleep i.e sleep(3) vs. making a system call that sleeps. I would have thought the latter, and use semaphores again. > Another thought I had was to raise the priority of a thread holding a > spinlock. This would reduce the chance that it would be preempted by a thread > that will waste a timeslice spinning on that lock. I don't know whether this > is a good idea either. > That's basically a weaker version of the no-preempt. Not a bad idea, but less than optimal :-) Regards, Tim -- Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com IBM Linux Technology Center, Beaverton, Oregon Interested in Linux scalability ? Look at http://lse.sourceforge.net/ "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-03-07 22:13 ` Tim Wright @ 2001-03-08 23:26 ` Jeff Dike 0 siblings, 0 replies; 14+ messages in thread From: Jeff Dike @ 2001-03-08 23:26 UTC (permalink / raw) To: timw; +Cc: Jonathan Lahr, Anton Blanchard, linux-kernel timw@splhi.com said: > On a uniprocessor system, a simple fallback is to just use a semaphore > instead of a spinlock, since you can guarantee that there's no point > in scheduling the current task until the holder of the "lock" releases > it. Yeah, that works. But I'm not all that interested in compiling UML differently for UP and SMP hosts. > Otherwise, the spin calling sched_yield() each iteration isn't too > horrible. This looks a lot better. For UML, if there's a thread spinning on a lock, there has to be a runnable thread holding it, and that thread will get a timeslice before the spinning one (assuming that the thread holding the lock hasn't called a blocking system call, which is something that I intend to make sure can't happen). > > That sounds like a pretty fundamental (and abusable) mechanism. > > It would be if it were generally available. The implementation on > DYNIX/ptx requires a privilege (PRIV_SCHED IIRC), to be able to use > it. OK, that makes sense. Jeff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-03-06 23:39 ` Matthew Kirkwood 2001-03-07 0:28 ` Tim Wright @ 2001-03-11 6:50 ` Anton Blanchard 1 sibling, 0 replies; 14+ messages in thread From: Anton Blanchard @ 2001-03-11 6:50 UTC (permalink / raw) To: Matthew Kirkwood; +Cc: Jonathan Lahr, linux-kernel Hi, > In the slow path of a spinlock_acquire they busy wait for a few > cycles, and then call schedule with a zero timeout assuming that > it'll basically do the same as a sched_yield() but more portably. The obvious problem with this is that we bounce in and out of schedule() a few times before moving on to the next task. I see this also with sched_yield(). I had this patch lying around which I think came about when I was playing with pthreads (which for spinlocks does sched_yield() for a while before sleeping) --- linux/kernel/sched.c Fri Mar 9 10:26:56 2001 +++ linux_intel/kernel/sched.c Fri Mar 9 08:42:39 2001 @@ -505,6 +505,9 @@ goto out_unlock; } #else + if (prev->policy & SCHED_YIELD) + prev->counter = (prev->counter >> 4); + prev->policy &= ~SCHED_YIELD; #endif /* CONFIG_SMP */ } Anton /* test sched_yield */ #include <stdio.h> #include <sched.h> #include <sys/time.h> #include <sys/types.h> #include <unistd.h> #undef USE_SELECT void waste_time() { int i; for(i = 0; i < 10000; i++) ; } void do_stuff(int i) { #ifdef USE_SELECT struct timeval tv; #endif while(1) { fprintf(stderr, "%d\n", i); waste_time(); #ifdef USE_SELECT tv.tv_sec = 0; tv.tv_usec = 0; select(0, NULL, NULL, NULL, &tv); #else sched_yield(); #endif } } int main() { int i, pid; for(i = 0; i < 10; i++) { pid = fork(); if (!pid) do_stuff(i); } do_stuff(i+1); return 0; } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-03-06 22:45 ` Jonathan Lahr 2001-03-06 23:39 ` Matthew Kirkwood @ 2001-03-11 6:26 ` Anton Blanchard 1 sibling, 0 replies; 14+ messages in thread From: Anton Blanchard @ 2001-03-11 6:26 UTC (permalink / raw) To: Jonathan Lahr; +Cc: linux-kernel Hi, > Thanks for looking into postgresql/pgbench related locking. Yes, > apparently postgresql uses a synchronization scheme that uses select() > to effect delays for backing off while attempting to acquire a lock. > However, it seems to me that runqueue lock contention was not entirely due > to postgresql code, since it was largely alleviated by the multiqueue > scheduler patch. Im not saying that the multiqueue scheduler patch isn't needed, just that this test case is caused by a bug in postgres. We shouldn't run around fixing symptoms - dropping the contention in the runqueue lock might not change the overall performance of the benchmark, on the other hand fixing the spinlocks in postgres probably will. On the other hand, if postgres still pounds on the runqueue lock after the bug has been fixed then we need to look at the multiqueue patch. Cheers, Anton ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <98454d$19p9h$1@fido.engr.sgi.com>]
* Re: kernel lock contention and scalability [not found] <98454d$19p9h$1@fido.engr.sgi.com> @ 2001-03-07 2:55 ` Rajagopal Ananthanarayanan 2001-03-07 5:48 ` Jeff Dike 0 siblings, 1 reply; 14+ messages in thread From: Rajagopal Ananthanarayanan @ 2001-03-07 2:55 UTC (permalink / raw) To: Jeff Dike, linux-kernel; +Cc: sfoehner Jeff Dike wrote: [ ... ] > > > Another synchronization method popular with database peeps is "post/ > > wait" for which SGI have a patch available for Linux. I understand > > that this is relatively "light weight" and might be a better choice > > for PG. > > URL? > > Jeff Here it is: http://oss.sgi.com/projects/postwait/ Check out the download section for a 2.4.0 patch. cheers, ananth. -------------------------------------------------------------------------- Rajagopal Ananthanarayanan ("ananth") Member Technical Staff, SGI. -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: kernel lock contention and scalability 2001-03-07 2:55 ` Rajagopal Ananthanarayanan @ 2001-03-07 5:48 ` Jeff Dike 0 siblings, 0 replies; 14+ messages in thread From: Jeff Dike @ 2001-03-07 5:48 UTC (permalink / raw) To: Rajagopal Ananthanarayanan; +Cc: linux-kernel ananth@sgi.com said: > Here it is: > http://oss.sgi.com/projects/postwait/ > Check out the download section for a 2.4.0 patch. After having thought about this a bit more, I don't see why pw_post and pw_wait can't be implemented in userspace as: int pw_post(uid_t uid) { return(kill(uid, SIGHUP)) /* Or signal of the waiter's choice */ } int pw_wait(struct timespec *t) { return(nanosleep(t, t)); } In the case of UML, there would be a uid field in its lock structure and the spin code would look like: lock->uid = getpid(); pw_wait(NULL); and the lock release code would be: pw_post(lock->uid); Obviously, sending signals to processes from the outside could massively confuse matters, but I don't see that being a big problem, since I think you can do that now, and no one is complaining about it. Is there anything that I'm missing? Jeff ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2001-03-11 6:53 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-02-15 18:46 kernel lock contention and scalability Jonathan Lahr 2001-02-25 9:52 ` Manfred Spraul 2001-03-05 18:41 ` Jonathan Lahr 2001-03-05 0:38 ` Anton Blanchard 2001-03-06 22:45 ` Jonathan Lahr 2001-03-06 23:39 ` Matthew Kirkwood 2001-03-07 0:28 ` Tim Wright 2001-03-07 3:12 ` Jeff Dike 2001-03-07 22:13 ` Tim Wright 2001-03-08 23:26 ` Jeff Dike 2001-03-11 6:50 ` Anton Blanchard 2001-03-11 6:26 ` Anton Blanchard [not found] <98454d$19p9h$1@fido.engr.sgi.com> 2001-03-07 2:55 ` Rajagopal Ananthanarayanan 2001-03-07 5:48 ` Jeff Dike
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).