From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S266065AbUGZVVk (ORCPT ); Mon, 26 Jul 2004 17:21:40 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S265938AbUGZVVB (ORCPT ); Mon, 26 Jul 2004 17:21:01 -0400 Received: from fw.osdl.org ([65.172.181.6]:4228 "EHLO mail.osdl.org") by vger.kernel.org with ESMTP id S265724AbUGZVNX (ORCPT ); Mon, 26 Jul 2004 17:13:23 -0400 Date: Mon, 26 Jul 2004 14:11:43 -0700 From: Andrew Morton To: Ingo Molnar Cc: rlrevell@joe-job.com, wli@holomorphy.com, lenar@vision.ee, linux-kernel@vger.kernel.org Subject: Re: [patch] voluntary-preempt-2.6.8-rc2-J3 Message-Id: <20040726141143.6d8352b6.akpm@osdl.org> In-Reply-To: <20040726203634.GA26096@elte.hu> References: <20040713122805.GZ21066@holomorphy.com> <40F3F0A0.9080100@vision.ee> <20040713143947.GG21066@holomorphy.com> <1090732537.738.2.camel@mindpipe> <1090795742.719.4.camel@mindpipe> <20040726082330.GA22764@elte.hu> <1090830574.6936.96.camel@mindpipe> <20040726083537.GA24948@elte.hu> <20040726125750.5e467cfd.akpm@osdl.org> <20040726203634.GA26096@elte.hu> X-Mailer: Sylpheed version 0.9.7 (GTK+ 1.2.10; i386-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Ingo Molnar wrote: > > ... > guaranteeing latencies on SMP is hard, because the latency of a CPU > might depend on the latency of a task on another CPU - which CPU isnt > notified of the rescheduling request. > > one alternative technique to yours would be to notify _all_ CPUs that a > high-prio RT task is ready to run (via a broadcast need-resched). That > way the UP latency-break techniques map to SMP in a 1:1 way. I did exactly that in the 2.4 ll patch. It works, but doesn't seem very nice. > non-RT tasks dont get this benefit, which is a difference to the UP > situation, but i dont think it would be appropriate to use the UP > behavior, due to the overhead of broadcasting. > > a combination of the two techniques could be used too: a global 'break > locks from now on' flag which gets set if a (RT?) task wants to > reschedule. Normally this flag would be zero and the cacheline would be > clean and shared between all CPUs, causing no overhead. Once a task > wants to reschedule it would increase by 1 until that task has been > scheduled. Something like that. The obscure livelock opportunities worry me. > This would remove the ugliness factor too, your sample code > would look: > > > + if (need_lock_break()) { > > + spin_unlock(&mm->page_table_lock); > > + cpu_relax(); > > + spin_lock(&mm->page_table_lock); > > + } > > + > > or a shortcut: > > > + cond_lock_break(&mm->page_table_lock); The usual problem is, of course, long-running loops under locks. These loops usually have some sort of counter (or you add one). The below patch changes cond_resched_lock() so that it takes the counter, and the tuned upper-loop-count limit. It also does the right thing on UP. It needs enhancements for the situations where we already have a counter in the loop. I dunno if it's sufficient, but I do think that we need to settle on some infrastructure of this sort before proceeding too far into this thing. > there are two problems with the unconditional lock-break: > > - i'm not sure cpu_relax() guarantees that another CPU can grab this > lock. It all depends on whether the lock cacheline can bounce over > to the other CPU faster than cpu_relax() finishes and this CPU > re-acquires the lock. precisely. diff -puN include/linux/sched.h~cond_resched_lock-smp-fix include/linux/sched.h --- 25/include/linux/sched.h~cond_resched_lock-smp-fix 2004-07-13 16:58:42.445155480 -0700 +++ 25-akpm/include/linux/sched.h 2004-07-13 17:02:56.194579704 -0700 @@ -1038,22 +1038,32 @@ static inline void cond_resched(void) } /* - * cond_resched_lock() - if a reschedule is pending, drop the given lock, - * call schedule, and on return reacquire the lock. - * * This works OK both with and without CONFIG_PREEMPT. We do strange low-level * operations here to prevent schedule() from being called twice (once via * spin_unlock(), once by hand). */ -static inline void cond_resched_lock(spinlock_t * lock) -{ - if (need_resched()) { - _raw_spin_unlock(lock); - preempt_enable_no_resched(); - __cond_resched(); - spin_lock(lock); - } -} + +#ifdef CONFIG_SMP +#define cond_resched_lock(lock, counter, limit) \ + do { \ + if (++(counter) >= limit) { \ + spin_unlock(lock); \ + cpu_relax(); \ + spin_lock(lock); \ + } \ + (counter) = 0; \ + } while (0) +#else +#define cond_resched_lock(lock, counter, limit) \ + do { \ + if (need_resched()) { \ + _raw_spin_unlock(lock); \ + preempt_enable_no_resched(); \ + __cond_resched(); \ + spin_lock(lock); \ + } \ + } while (0) +#endif /* Reevaluate whether the task has signals pending delivery. This is required every time the blocked sigset_t changes. diff -puN fs/reiserfs/journal.c~cond_resched_lock-smp-fix fs/reiserfs/journal.c --- 25/fs/reiserfs/journal.c~cond_resched_lock-smp-fix 2004-07-13 16:58:42.462152896 -0700 +++ 25-akpm/fs/reiserfs/journal.c 2004-07-13 17:03:43.131444216 -0700 @@ -796,6 +796,7 @@ static int write_ordered_buffers(spinloc struct buffer_chunk chunk; struct list_head tmp; INIT_LIST_HEAD(&tmp); + int resched_counter = 0; chunk.nr = 0; spin_lock(lock); @@ -827,7 +828,7 @@ static int write_ordered_buffers(spinloc } loop_next: put_bh(bh); - cond_resched_lock(lock); + cond_resched_lock(lock, resched_counter, 64); } if (chunk.nr) { spin_unlock(lock); @@ -848,7 +849,7 @@ loop_next: if (!buffer_uptodate(bh)) ret = -EIO; put_bh(bh); - cond_resched_lock(lock); + cond_resched_lock(lock, resched_counter, 64); } spin_unlock(lock); return ret; diff -puN mm/shmem.c~cond_resched_lock-smp-fix mm/shmem.c --- 25/mm/shmem.c~cond_resched_lock-smp-fix 2004-07-13 16:58:42.487149096 -0700 +++ 25-akpm/mm/shmem.c 2004-07-13 17:04:06.310920400 -0700 @@ -424,6 +424,7 @@ static void shmem_truncate(struct inode swp_entry_t *ptr; int offset; int freed; + int resched_counter = 0; inode->i_ctime = inode->i_mtime = CURRENT_TIME; idx = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; @@ -496,7 +497,7 @@ static void shmem_truncate(struct inode shmem_free_block(inode); } empty = subdir; - cond_resched_lock(&info->lock); + cond_resched_lock(&info->lock, resched_counter, 32); dir = shmem_dir_map(subdir); } subdir = *dir; diff -puN mm/memory.c~cond_resched_lock-smp-fix mm/memory.c --- 25/mm/memory.c~cond_resched_lock-smp-fix 2004-07-13 16:58:42.504146512 -0700 +++ 25-akpm/mm/memory.c 2004-07-13 17:05:43.943078056 -0700 @@ -224,6 +224,7 @@ int copy_page_range(struct mm_struct *ds unsigned long address = vma->vm_start; unsigned long end = vma->vm_end; unsigned long cow; + int resched_counter = 0; if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst, src, vma); @@ -341,7 +342,8 @@ cont_copy_pte_range_noset: pte_unmap_nested(src_pte-1); pte_unmap(dst_pte-1); spin_unlock(&src->page_table_lock); - cond_resched_lock(&dst->page_table_lock); + cond_resched_lock(&dst->page_table_lock, + resched_counter, 256); cont_copy_pmd_range: src_pmd++; dst_pmd++; @@ -530,6 +532,7 @@ int unmap_vmas(struct mmu_gather **tlbp, int tlb_start_valid = 0; int ret = 0; int atomic = details && details->atomic; + int resched_counter = 0; for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long start; @@ -567,10 +570,11 @@ int unmap_vmas(struct mmu_gather **tlbp, zap_bytes -= block; if ((long)zap_bytes > 0) continue; - if (!atomic && need_resched()) { + if (!atomic) { int fullmm = tlb_is_full_mm(*tlbp); tlb_finish_mmu(*tlbp, tlb_start, start); - cond_resched_lock(&mm->page_table_lock); + cond_resched_lock(&mm->page_table_lock, + resched_counter, 1); *tlbp = tlb_gather_mmu(mm, fullmm); tlb_start_valid = 0; } diff -puN mm/rmap.c~cond_resched_lock-smp-fix mm/rmap.c --- 25/mm/rmap.c~cond_resched_lock-smp-fix 2004-07-13 16:58:42.521143928 -0700 +++ 25-akpm/mm/rmap.c 2004-07-13 17:06:59.866535936 -0700 @@ -661,6 +661,7 @@ static inline int try_to_unmap_file(stru unsigned long max_nl_cursor = 0; unsigned long max_nl_size = 0; unsigned int mapcount; + int resched_counter = 0; if (!spin_trylock(&mapping->i_mmap_lock)) return ret; @@ -699,7 +700,7 @@ static inline int try_to_unmap_file(stru */ mapcount = page->mapcount; page_map_unlock(page); - cond_resched_lock(&mapping->i_mmap_lock); + cond_resched_lock(&mapping->i_mmap_lock, resched_counter, 1); max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK; if (max_nl_cursor == 0) @@ -728,7 +729,7 @@ static inline int try_to_unmap_file(stru (void *) max_nl_cursor; ret = SWAP_AGAIN; } - cond_resched_lock(&mapping->i_mmap_lock); + cond_resched_lock(&mapping->i_mmap_lock, resched_counter, 8); max_nl_cursor += CLUSTER_SIZE; } while (max_nl_cursor <= max_nl_size); _