All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Zach Brown <zach.brown@oracle.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-aio@kvack.org, Suparna Bhattacharya <suparna@in.ibm.com>,
	Benjamin LaHaise <bcrl@kvack.org>, Ingo Molnar <mingo@elte.hu>
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
Date: Fri, 9 Feb 2007 14:33:01 -0800 (PST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0702091419470.8424@woody.linux-foundation.org> (raw)
In-Reply-To: <patchbomb.1170193181@tetsuo.zabbo.net>


Ok, here's another entry in this discussion.

This is a *really* small patch. Yes, it adds 174 lines, and yes it's 
actually x86 (32-bit) only, but about half of it is totally generic, and 
*all* of it is almost ludicrously simple.

There's no new assembly language. The one-liner addition to 
"syscall_table.S" is just adding the system call entry stub. It's all in 
C, and none of it is even hard to understand.

It's architecture-specific, because different architectures do the whole 
"fork()" entrypath differently, and this is basically a "delayed fork()", 
not really an AIO thing at all.

So what this does, very simply is:

 - on system call entry just save away the pt_regs pointer needed to do a 
   fork (on some architectures, this means that you need to use a longer 
   system call entry that saves off all registers - on x86 even that isn't 
   an issue)

 - save that away as a magic cookie in the task structure

 - do the system call

 - IF the system call blocks, we call the architecture-specific 
   "schedule_async()" function before we even get any scheduler locks, and 
   it can just do a fork() at that time, and let the *child* return to the 
   original user space. The process that already started doing the system 
   call will just continue to do the system call.

 - when the system call is done, we check to see if it was done totally 
   synchronously or not. If we ended up doing the clone(), we just exit 
   the new thread.

Now, I agree that this is a bit ugly in some of the details: in 
particular, it means that if the system call blocks, we will literally 
return as a *different* thread to user space. If you care, you shouldn't 
use this interface, or come up with some way to make it work nicely (doing 
it this way meant that I could just re-use all the clone/fork code as-is).

Also, it actually does take the hit of creating a full new thread. We 
could optimize that a bit. But at least the cached case has basically 
*zero* overhead: we literally end up doing just a few extra CPU 
instructions to copy the arguments around etc, but no locked cycles, no 
memory allocations, no *nothing*.

So I actually like this, because it means that while we slow down real IO, 
we don't slow down the cached cases at all.

Final warning: I didn't do any cancel/wait crud. It doesn't even return 
the thread ID as it is now. And I only hooked up "stat64()" as an exmple. 
So this really is just a total toy. But it's kind of funny how simple it 
was, once I started thinking about how I could do this in some clever way.

I even added comments, so a lot of the few new added lines aren't even 
code!

		Linus

---

diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index c641056..0909724 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -698,6 +698,71 @@ struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct tas
 	return prev_p;
 }
 
+/*
+ * This gets called when an async event causes a schedule.
+ * We should try to
+ *
+ *  (a) create a new thread
+ *  (b) within that new thread, return to the original
+ *      user mode call-site.
+ *  (c) clear the async event flag, since it is now no
+ *      longer relevant.
+ *
+ * If anything fails (a resource issue etc), we just do
+ * the async system call as a normal synchronous event!
+ */
+#define CLONE_ALL (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_PARENT | CLONE_THREAD)
+#define FAILED_CLONE ((struct pt_regs *)1)
+void schedule_async(void)
+{
+	struct pt_regs *regs = current->async_cookie;
+	int retval;
+
+	if (regs == FAILED_CLONE)
+		return;
+
+	current->async_cookie = NULL;
+	/*
+	 * This is magic. The child will return through "ret_from_fork()" to
+	 * where the original thread started it all. It's not the same thread
+	 * any more, and we don't much care. The "real thread" has now become
+	 * the async worker thread, and will exit once the async work is done.
+	 */
+	retval = do_fork(CLONE_ALL, regs->esp, regs, 0, NULL, NULL);
+
+	/*
+	 * If it failed, we could just restore the async_cookie and try again
+	 * on the next scheduling event. 
+	 *
+	 * But it's just better to set it to some magic value to indicate
+	 * "do not try this again". If it failed once, we shouldn't waste 
+	 * time trying it over and over again.
+	 *
+	 * Any non-NULL value will tell "do_async()" at the end that it was
+	 * done "synchronously".
+	 */
+	if (retval < 0)
+		current->async_cookie = FAILED_CLONE;
+}
+
+asmlinkage int sys_async(struct pt_regs regs)
+{
+	void *async_cookie;
+	unsigned long syscall, flags;
+	int __user *status;
+	unsigned long __user *user_args;
+
+	/* Pick out the do_async() arguments.. */
+	async_cookie = &regs;
+	syscall = regs.ebx;
+	flags = regs.ecx;
+	status = (int __user *) regs.edx;
+	user_args = (unsigned long __user *) regs.esi;
+
+	/* ..and call the generic helper routine */
+	return do_async(async_cookie, syscall, flags, status, user_args);
+}
+
 asmlinkage int sys_fork(struct pt_regs regs)
 {
 	return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 2697e92..647193c 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
 	.long sys_move_pages
 	.long sys_getcpu
 	.long sys_epoll_pwait
+	.long sys_async			/* 320 */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4463735..e14b11b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -844,6 +844,13 @@ struct task_struct {
 
 	struct mm_struct *mm, *active_mm;
 
+	/*
+	 * The scheduler uses this to determine if the current call is a
+	 * standalone thread or just an async system call that hasn't
+	 * had its real thread created yet.
+	 */
+	void *async_cookie;
+
 /* task state */
 	struct linux_binfmt *binfmt;
 	long exit_state;
@@ -1649,6 +1656,12 @@ extern int sched_create_sysfs_power_savings_entries(struct sysdev_class *cls);
 
 extern void normalize_rt_tasks(void);
 
+/* Async system call support */
+extern long do_async(void *async_cookie, unsigned int syscall, unsigned long flags,
+	 int __user *status, unsigned long __user *user_args);
+extern void schedule_async(void);
+                                        
+
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 14f4d45..13bda9f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -8,7 +8,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
-	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o
+	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o async.o
 
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
diff --git a/kernel/async.c b/kernel/async.c
new file mode 100644
index 0000000..29b14f3
--- /dev/null
+++ b/kernel/async.c
@@ -0,0 +1,71 @@
+/*
+ * kernel/async.c
+ *
+ * Create a light-weight kernel-level thread.
+ */
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+
+#include <asm/uaccess.h>
+
+/* Fake "generic" system call pointer type */
+typedef asmlinkage long (*master_syscall_t)(unsigned long arg, ...);
+
+#define ASYNC_SYSCALL(syscall, param) \
+	{ (master_syscall_t) (syscall), (param) }
+
+static struct async_call {
+	master_syscall_t fn;
+	int args;
+} call_descriptor[] = {
+	ASYNC_SYSCALL(sys_stat64, 2),
+};
+
+long do_async(
+	void *async_cookie,
+	unsigned int syscall,
+	unsigned long flags,
+	int __user *status,
+	unsigned long __user *user_args)
+{
+	int ret, size;
+	struct async_call *desc;
+	unsigned long args[6];
+
+	if (syscall >= ARRAY_SIZE(call_descriptor))
+		return -EINVAL;
+
+	desc = call_descriptor + syscall;
+	if (!desc->fn)
+		return -EINVAL;
+
+	if (desc->args > ARRAY_SIZE(args))
+		return -EINVAL;
+
+	size = sizeof(unsigned long)*desc->args;
+	if (copy_from_user(args, user_args, size))
+		return -EFAULT;
+
+	/* We don't nest async calls! */
+	if (current->async_cookie)
+		return -EINVAL;
+	current->async_cookie = async_cookie;
+
+	ret = desc->fn(args[0], args[1], args[2], args[3], args[4], args[5]);
+	put_user(ret, status);
+
+	/*
+	 * Did we end up doing part of the work in a separate thread?
+	 *
+	 * If so, the async thread-creation already returned in the
+	 * origial parent, and cleared out the async_cookie. We're
+	 * now just in the worker thread, and should just exit. Our
+	 * job here is done.
+	 */
+	if (!current->async_cookie)
+		do_exit(0);
+
+	/* We did it synchronously - return 0 */
+	current->async_cookie = 0;
+	return 0;
+}
diff --git a/kernel/fork.c b/kernel/fork.c
index d57118d..6f38c46 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1413,6 +1413,18 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+/*
+ * Architectures that don't have async support get this
+ * dummy async thread scheduler callback.
+ *
+ * They had better not set task->async_cookie in the
+ * first place, so this should never get called!
+ */
+void __attribute__ ((weak)) schedule_async(void)
+{
+	BUG();
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index cca93cc..cc73dee 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3436,6 +3436,17 @@ asmlinkage void __sched schedule(void)
 	}
 	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
 
+	/* Are we running within an async system call? */
+	if (unlikely(current->async_cookie)) {
+		/*
+		 * If so, we now try to start a new thread for it, but
+		 * not for a preemption event or a scheduler timeout
+		 * triggering!
+		 */
+		if (!(preempt_count() & PREEMPT_ACTIVE) && current->state != TASK_RUNNING)
+			schedule_async();
+	}
+
 need_resched:
 	preempt_disable();
 	prev = current;

  parent reply	other threads:[~2007-02-09 22:33 UTC|newest]

Thread overview: 151+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
2007-01-30 20:39 ` [PATCH 1 of 4] Introduce per_call_chain() Zach Brown
2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
2007-02-01  8:36   ` Ingo Molnar
2007-02-01 13:02     ` Ingo Molnar
2007-02-01 13:19       ` Christoph Hellwig
2007-02-01 13:52         ` Ingo Molnar
2007-02-01 17:13           ` Mark Lord
2007-02-01 18:02             ` Ingo Molnar
2007-02-02 13:23         ` Andi Kleen
2007-02-01 21:52       ` Zach Brown
2007-02-01 22:23         ` Benjamin LaHaise
2007-02-01 22:37           ` Zach Brown
2007-02-02 13:22       ` Andi Kleen
2007-02-01 20:07     ` Linus Torvalds
2007-02-02 10:49       ` Ingo Molnar
2007-02-02 15:56         ` Linus Torvalds
2007-02-02 19:59           ` Alan
2007-02-02 20:14             ` Linus Torvalds
2007-02-02 20:58               ` Davide Libenzi
2007-02-02 21:09                 ` Linus Torvalds
2007-02-02 21:30               ` Alan
2007-02-02 21:30                 ` Linus Torvalds
2007-02-02 22:42                   ` Ingo Molnar
2007-02-02 23:01                     ` Linus Torvalds
2007-02-02 23:17                       ` Linus Torvalds
2007-02-03  0:04                         ` Alan
2007-02-03  0:23                         ` bert hubert
2007-02-02 22:48                   ` Alan
2007-02-05 16:44             ` Zach Brown
2007-02-02 22:21           ` Ingo Molnar
2007-02-02 22:49             ` Linus Torvalds
2007-02-02 23:55               ` Ingo Molnar
2007-02-03  0:56                 ` Linus Torvalds
2007-02-03  7:15                   ` Suparna Bhattacharya
2007-02-03  8:23                   ` Ingo Molnar
2007-02-03  9:25                     ` Matt Mackall
2007-02-03 10:03                       ` Ingo Molnar
2007-02-05 17:44                     ` Zach Brown
2007-02-05 19:26                       ` Davide Libenzi
2007-02-05 19:41                         ` Zach Brown
2007-02-05 20:10                           ` Davide Libenzi
2007-02-05 20:21                             ` Zach Brown
2007-02-05 20:42                               ` Linus Torvalds
2007-02-05 20:39                             ` Linus Torvalds
2007-02-05 21:09                               ` Davide Libenzi
2007-02-05 21:31                                 ` Kent Overstreet
2007-02-06 20:25                                   ` Davide Libenzi
2007-02-06 20:46                                   ` Linus Torvalds
2007-02-06 21:16                                     ` David Miller
2007-02-06 21:28                                       ` Linus Torvalds
2007-02-06 21:31                                         ` David Miller
2007-02-06 21:46                                           ` Eric Dumazet
2007-02-06 21:50                                           ` Linus Torvalds
2007-02-06 22:28                                             ` Zach Brown
2007-02-06 22:45                                     ` Kent Overstreet
2007-02-06 23:04                                       ` Linus Torvalds
2007-02-07  1:22                                         ` Kent Overstreet
2007-02-06 23:23                                       ` Davide Libenzi
2007-02-06 23:39                                         ` Joel Becker
2007-02-06 23:56                                           ` Davide Libenzi
2007-02-07  0:06                                             ` Joel Becker
2007-02-07  0:23                                               ` Davide Libenzi
2007-02-07  0:44                                                 ` Joel Becker
2007-02-07  1:15                                                   ` Davide Libenzi
2007-02-07  1:24                                                     ` Kent Overstreet
2007-02-07  1:30                                                     ` Joel Becker
2007-02-07  6:16                                                   ` Michael K. Edwards
2007-02-07  9:17                                                     ` Michael K. Edwards
2007-02-07  9:37                                                       ` Michael K. Edwards
2007-02-06  0:32                                 ` Davide Libenzi
2007-02-05 21:21                               ` Zach Brown
2007-02-02 23:37             ` Davide Libenzi
2007-02-03  0:02               ` Davide Libenzi
2007-02-05 17:12               ` Zach Brown
2007-02-05 18:24                 ` Davide Libenzi
2007-02-05 21:44                   ` David Miller
2007-02-06  0:15                     ` Davide Libenzi
2007-02-05 21:36               ` bert hubert
2007-02-05 21:57                 ` Linus Torvalds
2007-02-05 22:07                   ` bert hubert
2007-02-05 22:15                     ` Zach Brown
2007-02-05 22:34                   ` Davide Libenzi
2007-02-06  0:27                   ` Scot McKinley
2007-02-06  0:48                     ` David Miller
2007-02-06  0:48                     ` Joel Becker
2007-02-05 17:02             ` Zach Brown
2007-02-05 18:52               ` Davide Libenzi
2007-02-05 19:20                 ` Zach Brown
2007-02-05 19:38                   ` Davide Libenzi
2007-02-04  5:12   ` Davide Libenzi
2007-02-05 17:54     ` Zach Brown
2007-01-30 20:39 ` [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct Zach Brown
2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
2007-01-31  8:58   ` Andi Kleen
2007-01-31 17:15     ` Zach Brown
2007-01-31 17:21       ` Andi Kleen
2007-01-31 19:23         ` Zach Brown
2007-02-01 11:13           ` Suparna Bhattacharya
2007-02-01 19:50             ` Trond Myklebust
2007-02-02  7:19               ` Suparna Bhattacharya
2007-02-02  7:45                 ` Andi Kleen
2007-02-01 22:18             ` Zach Brown
2007-02-02  3:35               ` Suparna Bhattacharya
2007-02-01 20:26   ` bert hubert
2007-02-01 21:29     ` Zach Brown
2007-02-02  7:12       ` bert hubert
2007-02-04  5:12   ` Davide Libenzi
2007-01-30 21:58 ` [PATCH 0 of 4] Generic AIO by scheduling stacks Linus Torvalds
2007-01-30 22:23   ` Linus Torvalds
2007-01-30 22:53     ` Zach Brown
2007-01-30 22:40   ` Zach Brown
2007-01-30 22:53     ` Linus Torvalds
2007-01-30 23:45       ` Zach Brown
2007-01-31  2:07         ` Benjamin Herrenschmidt
2007-01-31  2:04 ` Benjamin Herrenschmidt
2007-01-31  2:46   ` Linus Torvalds
2007-01-31  3:02     ` Linus Torvalds
2007-01-31 10:50       ` Xavier Bestel
2007-01-31 19:28         ` Zach Brown
2007-01-31 17:59       ` Zach Brown
2007-01-31  5:16     ` Benjamin Herrenschmidt
2007-01-31  5:36     ` Nick Piggin
2007-01-31  5:51       ` Nick Piggin
2007-01-31  6:06       ` Linus Torvalds
2007-01-31  8:43         ` Ingo Molnar
2007-01-31 20:13         ` Joel Becker
2007-01-31 18:20       ` Zach Brown
2007-01-31 17:47     ` Zach Brown
2007-01-31 17:38   ` Zach Brown
2007-01-31 17:51     ` Benjamin LaHaise
2007-01-31 19:25       ` Zach Brown
2007-01-31 20:05         ` Benjamin LaHaise
2007-01-31 20:41           ` Zach Brown
2007-02-04  5:13 ` Davide Libenzi
2007-02-04 20:00   ` Davide Libenzi
2007-02-09 22:33 ` Linus Torvalds [this message]
2007-02-09 23:11   ` Davide Libenzi
2007-02-09 23:35     ` Linus Torvalds
2007-02-10 18:45       ` Davide Libenzi
2007-02-10 19:01         ` Linus Torvalds
2007-02-10 19:35           ` Linus Torvalds
2007-02-10 20:59           ` Davide Libenzi
2007-02-10  0:04   ` Eric Dumazet
2007-02-10  0:12     ` Linus Torvalds
2007-02-10  0:34       ` Alan
2007-02-10 10:47   ` bert hubert
2007-02-10 18:19     ` Davide Libenzi
2007-02-11  0:56   ` David Miller
2007-02-11  2:49     ` Linus Torvalds
2007-02-14 16:42       ` James Antill

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0702091419470.8424@woody.linux-foundation.org \
    --to=torvalds@linux-foundation.org \
    --cc=bcrl@kvack.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=suparna@in.ibm.com \
    --cc=zach.brown@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.