From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946127AbXBBW1i (ORCPT ); Fri, 2 Feb 2007 17:27:38 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1946129AbXBBW1i (ORCPT ); Fri, 2 Feb 2007 17:27:38 -0500 Received: from mx2.mail.elte.hu ([157.181.151.9]:39449 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1946127AbXBBW1h (ORCPT ); Fri, 2 Feb 2007 17:27:37 -0500 Date: Fri, 2 Feb 2007 23:21:10 +0100 From: Ingo Molnar To: Linus Torvalds Cc: Zach Brown , linux-kernel@vger.kernel.org, linux-aio@kvack.org, Suparna Bhattacharya , Benjamin LaHaise Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling Message-ID: <20070202222110.GA1212@elte.hu> References: <20070201083611.GC18233@elte.hu> <20070202104900.GA13941@elte.hu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.2i X-ELTE-VirusStatus: clean X-ELTE-SpamScore: -3.7 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-3.7 required=5.9 tests=ALL_TRUSTED,BAYES_05 autolearn=no SpamAssassin version=3.0.3 -3.3 ALL_TRUSTED Did not pass through any untrusted hosts -0.4 BAYES_05 BODY: Bayesian spam probability is 1 to 5% [score: 0.0467] Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > On Fri, 2 Feb 2007, Ingo Molnar wrote: > > > > My only suggestion was to have a couple of transparent kernel threads > > (not fibrils) attached to a user context that does asynchronous > > syscalls! Those kernel threads would be 'switched in' if the current > > user-space thread blocks - so instead of having to 'create' any of them > > - the fast path would be to /switch/ them to under the current > > user-space, so that user-space processing can continue under that other > > thread! > > But in that case, you really do end up with "fibrils" anyway. > > Because those fibrils are what would be the state for the blocked > system calls when they aren't scheduled. > > We may have a few hundred thousand system calls a second (maybe that's > not actually reasonable, but it should be what we *aim* for), and 99% > of them will hopefully hit the cache and never need any separate IO, > but even if it's just 1%, we're talking about thousands of threads. > > I do _not_ think that it's reasonable to have thousands of threads > state around just "in case". Especially if all those threadlets are > then involved in signals etc - something that they are totally > uninterested in. > > I think it's a lot more reasonable to have just the kernel stack page > for "this was where I was when I blocked". IOW, a fibril-like thing. ok, i think i noticed another misunderstanding. The kernel thread based scheme i'm suggesting would /not/ 'switch' to another kernel thread in the cached case, by default. It would just execute in the original context (as if it were a synchronous syscall), and the switch to a kernel thread from the pool would only occur /if/ the context is about to block. (this 'switch' thing would be done by the scheduler) User-space gets back an -EAIO error code immediately and transparently - but already running under the new kernel thread. i.e. in the fully cached case there would be no scheduling at all - in fact no thread pool is needed at all. regarding cost: the biggest memory resource cost of a kernel thread (assuming it has no real user-space context) /is/ its kernel stack page, which is 4K or 8K. The task struct takes ~1.5K. Once we have a ready kernel thread around, it's quite cheap to 'flip' it to under any arbitrary user-space context: change its thread_info->task pointer to the user-space context's task struct, copy the mm pointer, the fs pointer to the "worker thread", switch the thread_info, update ptregs - done. Hm? Note: such a 'flip' would only occur when the original context blocks, /not/ on every async syscall. regarding CPU resource costs, i dont think there should be significant signal overhead, because the original task is still only one instance, and the kernel thread that is now running with the blocked kernel stack is not part of the signal set. (Although it might make sense to make such async syscalls interruptible, just like any syscall.) The 'pool' of kernel threads doesnt even have to be per-task, it can be a natural per-CPU thing - and its size will grow/shrink [with a low update frequency] depending on how much AIO parallelism there is in the workload. (But it can also be strictly per-user-context - to make sure that a proper ->mm ->fs, etc. is set up and that when the async system calls execute they have all the right context info.) and note the immediate scheduling benefits: if an app (say like OpenOffice) is single-threaded but has certain common ops coded as async syscalls, then if any of those syscalls blocks then it could utilize /more than one/ CPU. I.e. we could 'spread' a single-threaded app's processing to multiple cores/hardware-threads /without/ having to multi-thread the app in an intrusive way. I.e. this would be a finegrained threading of syscalls, executed as coroutines in essence. With fibrils all sorts of scheduling limitations occur and no parallelism is possible. in fact an app could also /trigger/ the execution of a syscall in a different context - to create parallelism artificially - without any blocking event. So we could do: cookie1 = sys_async(sys_read, params); cookie2 = sys_async(sys_write, params); [ ... calculation loop ... ] wait_on_async_syscall(cookie1); wait_on_async_syscall(cookie2); or something like that. Without user-space having to create threads itself, etc. So basically, we'd make kernel threads more useful, and we'd make threading safer - by only letting syscalls thread. > What I like about fibrils is that they should be able to handle the > cached case well: the case where no "real" scheduling (just the fibril > stack switches) takes place. the cached case (when a system call would not block at all) would not necessiate any switch to another kernel thread at all - the task just executes its system call as if it were synchronous! that's the nice thing: we can do this switch-to-another-kernel-thread magic thing right in the scheduler when we block - and the switched-to thread will magically return to user-space (with a -EAIO return code) as if nothing happened (while the original task blocks). I.e. under this scheme i'm suggesting we have /zero/ setup cost in the cached case. The optimistic case just falls through and switches to nothing else. Any switching cost only occurs in the slowpath - and even that cost is very low. once a kernel thread that ran off with the original stack finishes the async syscall and wants to return the return code, this can be gathered via a special return-code ringbuffer that notifies finished syscalls. (A magic cookie is associated to every async syscall.) > So the state machine argument is totally bogus - it results in a > programming model that simply doesn't match the *normal* setup. You > want the kernel programming model to appear "linear" even when it > isn't, because it's too damn hard to think nonlinearly. > > Yes, we could do pathname lookup with that kind of insane setup too. > But it would be HORRID! yeah, but i guess not nearly as horrid as writing a new OS from scratch ;-) seriously, i very much think and agree that programming state machines is hard and not desired in most of the kernel. But it can be done, and sometimes (definitely not in the common case) it's /cleaner/ than functional programming. I've programmed an HTTP and an FTP in-kernel server via a state machine and it worked better than i initially expected. It needs different thinking but there /are/ people around with that kind of thinking, so we just cannot exclude the possibility. [ It's just that such people usually dedicate their brain to mental fantasies^H^H^Hexcercises called 'Higher Mathematics' :-) ] > [...] The fact that networking (and TCP in particular) has state > machines is because it is a packetized environment. rough ballpark figures: for things like webserving or fileserving (or mailserving), networking sockets are the reason for context-blocking events in 90% of the cases (mostly due to networking latency). 9% of the blocking happens due to plain block IO, and 1% happens due to VFS metadata (inode, directory, etc.) blocking. ( in Tux i had to handle /all/ of these sources of blocking because even 1% kills your performance if you do a hundred thousand requests per second - but in terms of design weight, networking is pretty damn important. ) and interestingly, modern IO frameworks tend to gravitate towards a packetized environment as well. I.e. i dont think state machines are /that/ unimportant. Ingo