All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] x86_64: per-cpu memory for user-space
@ 2014-09-13 14:35 Konstantin Khlebnikov
  2014-09-13 18:10 ` Dmitry Vyukov
  2014-09-14 14:06 ` Andi Kleen
  0 siblings, 2 replies; 4+ messages in thread
From: Konstantin Khlebnikov @ 2014-09-13 14:35 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Thomas Gleixner, Andi Kleen, Ingo Molnar, Dmitry Vyukov, H. Peter Anvin

This patch implements user-space per-cpu memory in the same manner as in
kernel-space: each cpu has its own %gs base address. On x86_64 %fs is used
for thread local storage, %gs usually is free.

User-space application cannot prevent preemption but x86 read-modify-write
operations are atomic against interrupts and context switches. Thus percpu
counters, ring-buffer cursors, per-cpu locks and other cool things might
be implemented in a very efficient way.

After this patch kernel recalculates %gs at each context switch.
This's implemented only via MSR_KERNEL_GS_BASE. Loading base via gdt
selector might be faster but it's much more complicated.

By the way, newer Intel cpus have even faster instructions for
changing %fs/%gs, but they are still not supported by the kernel.

Additional overhead is near to zero: this patch adds one extra multiplication
into __switch_to (only if gs is set by user-space and its base is above 4Gb):

        if (next->gs)
-               wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+               wrmsrl(MSR_KERNEL_GS_BASE, next->gs +
+                               cpu * next->gs_cpu_stride);

Child inherits setup from parent at clone because it gets a copy of task_struct.
Changing %gs via any other interface (selector, ARCH_SET_GS) disables striping.

Interface:

int arch_prctl(ARCH_GET_GS_PERCPU, unsigned long arg[2]);
int arch_prctl(ARCH_SET_GS_PERCPU, unsigned long arg[2]);

arg[0] - base address for cpu0
arg[1] - stride to each next cpu

Error codes:
-EINVAL	    - not implemented (or ia32 compat)
-ENOENT     - not configured (only for get)
-EFAULT	    - arg isn't addressable
-EPERM      - base above addressable space (only for set)
-EOVERFLOW  - stride too big for this base and count nr_cpus (only for set)

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
---
 arch/x86/include/asm/processor.h  |    1 +
 arch/x86/include/uapi/asm/prctl.h |    2 ++
 arch/x86/kernel/process_64.c      |   39 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb71ec7..102c1f9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -484,6 +484,7 @@ struct thread_struct {
 #endif
 #ifdef CONFIG_X86_64
 	unsigned long		fs;
+	unsigned long		gs_cpu_stride;
 #endif
 	unsigned long		gs;
 	/* Save middle states of ptrace breakpoints */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 3ac5032..026bd39 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -5,5 +5,7 @@
 #define ARCH_SET_FS 0x1002
 #define ARCH_GET_FS 0x1003
 #define ARCH_GET_GS 0x1004
+#define ARCH_SET_GS_PERCPU 0x1005
+#define ARCH_GET_GS_PERCPU 0x1006
 
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ca5b02d..5e7af75 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -351,7 +351,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 			prev->gs = 0;
 	}
 	if (next->gs)
-		wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+		wrmsrl(MSR_KERNEL_GS_BASE, next->gs +
+				cpu * next->gs_cpu_stride);
 	prev->gsindex = gsindex;
 
 	switch_fpu_finish(next_p, fpu);
@@ -469,6 +470,7 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 		if (addr >= TASK_SIZE_OF(task))
 			return -EPERM;
 		cpu = get_cpu();
+		task->thread.gs_cpu_stride = 0;
 		/* handle small bases via the GDT because that's faster to
 		   switch. */
 		if (addr <= 0xffffffff) {
@@ -544,6 +546,41 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 		ret = put_user(base, (unsigned long __user *)addr);
 		break;
 	}
+	case ARCH_GET_GS_PERCPU:
+		if (test_tsk_thread_flag(task, TIF_ADDR32))
+			return -EINVAL;
+		if (!task->thread.gs || !task->thread.gs_cpu_stride)
+			return -ENOENT;
+		ret = put_user(task->thread.gs,
+				(unsigned long __user *)addr);
+		if (!ret)
+			ret = put_user(task->thread.gs_cpu_stride,
+					((unsigned long __user *)addr) + 1);
+		break;
+	case ARCH_SET_GS_PERCPU: {
+		unsigned long arg[2];
+
+		if (test_tsk_thread_flag(task, TIF_ADDR32))
+			return -EINVAL;
+		if (copy_from_user(arg, (void __user *)addr, sizeof(arg)))
+			return -EFAULT;
+		if (arg[0] >= TASK_SIZE_MAX)
+			return -EPERM;
+		if (arg[1] > (TASK_SIZE_MAX - arg[0]) / num_possible_cpus())
+			return -EOVERFLOW;
+
+		task->thread.gsindex = 0;
+		task->thread.gs = arg[0];
+		task->thread.gs_cpu_stride = arg[1];
+		if (doit) {
+			cpu = get_cpu();
+			load_gs_index(0);
+			ret = wrmsrl_safe(MSR_KERNEL_GS_BASE,
+					  arg[0] + cpu * arg[1]);
+			put_cpu();
+		}
+		break;
+	}
 
 	default:
 		ret = -EINVAL;


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] x86_64: per-cpu memory for user-space
  2014-09-13 14:35 [PATCH RFC] x86_64: per-cpu memory for user-space Konstantin Khlebnikov
@ 2014-09-13 18:10 ` Dmitry Vyukov
  2014-09-14 14:06 ` Andi Kleen
  1 sibling, 0 replies; 4+ messages in thread
From: Dmitry Vyukov @ 2014-09-13 18:10 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: x86, LKML, Thomas Gleixner, Andi Kleen, Ingo Molnar, H. Peter Anvin

On Sat, Sep 13, 2014 at 7:35 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> This patch implements user-space per-cpu memory in the same manner as in
> kernel-space: each cpu has its own %gs base address. On x86_64 %fs is used
> for thread local storage, %gs usually is free.
>
> User-space application cannot prevent preemption but x86 read-modify-write
> operations are atomic against interrupts and context switches. Thus percpu
> counters, ring-buffer cursors, per-cpu locks and other cool things might
> be implemented in a very efficient way.
>
> After this patch kernel recalculates %gs at each context switch.
> This's implemented only via MSR_KERNEL_GS_BASE. Loading base via gdt
> selector might be faster but it's much more complicated.
>
> By the way, newer Intel cpus have even faster instructions for
> changing %fs/%gs, but they are still not supported by the kernel.
>
> Additional overhead is near to zero: this patch adds one extra multiplication
> into __switch_to (only if gs is set by user-space and its base is above 4Gb):
>
>         if (next->gs)
> -               wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
> +               wrmsrl(MSR_KERNEL_GS_BASE, next->gs +
> +                               cpu * next->gs_cpu_stride);
>
> Child inherits setup from parent at clone because it gets a copy of task_struct.
> Changing %gs via any other interface (selector, ARCH_SET_GS) disables striping.
>
> Interface:
>
> int arch_prctl(ARCH_GET_GS_PERCPU, unsigned long arg[2]);
> int arch_prctl(ARCH_SET_GS_PERCPU, unsigned long arg[2]);
>
> arg[0] - base address for cpu0
> arg[1] - stride to each next cpu
>
> Error codes:
> -EINVAL     - not implemented (or ia32 compat)
> -ENOENT     - not configured (only for get)
> -EFAULT     - arg isn't addressable
> -EPERM      - base above addressable space (only for set)
> -EOVERFLOW  - stride too big for this base and count nr_cpus (only for set)
>
> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
> ---
>  arch/x86/include/asm/processor.h  |    1 +
>  arch/x86/include/uapi/asm/prctl.h |    2 ++
>  arch/x86/kernel/process_64.c      |   39 ++++++++++++++++++++++++++++++++++++-
>  3 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index eb71ec7..102c1f9 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -484,6 +484,7 @@ struct thread_struct {
>  #endif
>  #ifdef CONFIG_X86_64
>         unsigned long           fs;
> +       unsigned long           gs_cpu_stride;
>  #endif
>         unsigned long           gs;
>         /* Save middle states of ptrace breakpoints */
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 3ac5032..026bd39 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -5,5 +5,7 @@
>  #define ARCH_SET_FS 0x1002
>  #define ARCH_GET_FS 0x1003
>  #define ARCH_GET_GS 0x1004
> +#define ARCH_SET_GS_PERCPU 0x1005
> +#define ARCH_GET_GS_PERCPU 0x1006
>
>  #endif /* _ASM_X86_PRCTL_H */
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index ca5b02d..5e7af75 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -351,7 +351,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>                         prev->gs = 0;
>         }
>         if (next->gs)
> -               wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
> +               wrmsrl(MSR_KERNEL_GS_BASE, next->gs +
> +                               cpu * next->gs_cpu_stride);
>         prev->gsindex = gsindex;
>
>         switch_fpu_finish(next_p, fpu);
> @@ -469,6 +470,7 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
>                 if (addr >= TASK_SIZE_OF(task))
>                         return -EPERM;
>                 cpu = get_cpu();
> +               task->thread.gs_cpu_stride = 0;
>                 /* handle small bases via the GDT because that's faster to
>                    switch. */
>                 if (addr <= 0xffffffff) {
> @@ -544,6 +546,41 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
>                 ret = put_user(base, (unsigned long __user *)addr);
>                 break;
>         }
> +       case ARCH_GET_GS_PERCPU:
> +               if (test_tsk_thread_flag(task, TIF_ADDR32))
> +                       return -EINVAL;
> +               if (!task->thread.gs || !task->thread.gs_cpu_stride)
> +                       return -ENOENT;
> +               ret = put_user(task->thread.gs,
> +                               (unsigned long __user *)addr);
> +               if (!ret)
> +                       ret = put_user(task->thread.gs_cpu_stride,
> +                                       ((unsigned long __user *)addr) + 1);
> +               break;
> +       case ARCH_SET_GS_PERCPU: {
> +               unsigned long arg[2];
> +
> +               if (test_tsk_thread_flag(task, TIF_ADDR32))
> +                       return -EINVAL;
> +               if (copy_from_user(arg, (void __user *)addr, sizeof(arg)))
> +                       return -EFAULT;
> +               if (arg[0] >= TASK_SIZE_MAX)
> +                       return -EPERM;
> +               if (arg[1] > (TASK_SIZE_MAX - arg[0]) / num_possible_cpus())
> +                       return -EOVERFLOW;
> +
> +               task->thread.gsindex = 0;
> +               task->thread.gs = arg[0];
> +               task->thread.gs_cpu_stride = arg[1];
> +               if (doit) {
> +                       cpu = get_cpu();
> +                       load_gs_index(0);
> +                       ret = wrmsrl_safe(MSR_KERNEL_GS_BASE,
> +                                         arg[0] + cpu * arg[1]);
> +                       put_cpu();
> +               }
> +               break;
> +       }
>
>         default:
>                 ret = -EINVAL;
>



Nice!

Per-cpu non-lossy stats counters are trivial with this support.

For more complex data structures the trick is to put cpu number in the
per-cpu region of memory, and use CAS-loop to modify the data (but the
CAS does not need LOCK prefix in this case). For example, here is how
a lock-free per-cpu freelist can be implemented:

struct freelist_t
{
        void*   head;
        uint16  cpu;
        uint16  len;
        uint32  aba;
};

bool freelist_push(void *p)
{
        freelist_t *fl, old, new;

        for (;;) {
                fl = (freelist_t*)&GS[OFF];
                old = atomic_load(fl);
                if (old.len == UINT16_MAX)
                        return false;
                *(void**)p = old.head;
                new = old;
                new.aba++;
                new.len++;
                new.head = p;
                if (CMPXCHG16B(fl, old, new))
                        return true;
        }
}

void *freelist_pop(void)
{
        freelist_t *fl, old, new;
        void *p;

        for (;;) {
                fl = (freelist_t*)&GS[OFF];
                old = atomic_load(fl);
                if (old.len == 0)
                        return NULL;
                p = old.head;
                new = old;
                new.len--;
                new.head = *(void**)p;
                if (CMPXCHG16B(fl, old, new))
                        return p;
        }
}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] x86_64: per-cpu memory for user-space
  2014-09-13 14:35 [PATCH RFC] x86_64: per-cpu memory for user-space Konstantin Khlebnikov
  2014-09-13 18:10 ` Dmitry Vyukov
@ 2014-09-14 14:06 ` Andi Kleen
  2014-09-14 18:35   ` Dmitry Vyukov
  1 sibling, 1 reply; 4+ messages in thread
From: Andi Kleen @ 2014-09-14 14:06 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dmitry Vyukov,
	H. Peter Anvin

On Sat, Sep 13, 2014 at 06:35:34PM +0400, Konstantin Khlebnikov wrote:
> This patch implements user-space per-cpu memory in the same manner as in
> kernel-space: each cpu has its own %gs base address. On x86_64 %fs is used
> for thread local storage, %gs usually is free.
> 
> User-space application cannot prevent preemption but x86 read-modify-write
> operations are atomic against interrupts and context switches. Thus percpu
> counters, ring-buffer cursors, per-cpu locks and other cool things might
> be implemented in a very efficient way.

Do you have some concrete examples for the more complex operations?

It seems to me the limitation to a simple instruction will be very limiting
for anything more complicated than a counter. Also it's not even
clear how someone would implement retry (short of something like kuchannel)

Of course it wouldn't be a problem with TSX transactions, but it's not
clear they need it.

The other problem with the approach is, how would cpu hotplug
be handled?

> By the way, newer Intel cpus have even faster instructions for
> changing %fs/%gs, but they are still not supported by the kernel.

Patch kits are pending.

-Andi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] x86_64: per-cpu memory for user-space
  2014-09-14 14:06 ` Andi Kleen
@ 2014-09-14 18:35   ` Dmitry Vyukov
  0 siblings, 0 replies; 4+ messages in thread
From: Dmitry Vyukov @ 2014-09-14 18:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Konstantin Khlebnikov, x86, LKML, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Paul Turner, Andrew Hunter

On Sun, Sep 14, 2014 at 7:06 AM, Andi Kleen <ak@linux.intel.com> wrote:
> On Sat, Sep 13, 2014 at 06:35:34PM +0400, Konstantin Khlebnikov wrote:
>> This patch implements user-space per-cpu memory in the same manner as in
>> kernel-space: each cpu has its own %gs base address. On x86_64 %fs is used
>> for thread local storage, %gs usually is free.
>>
>> User-space application cannot prevent preemption but x86 read-modify-write
>> operations are atomic against interrupts and context switches. Thus percpu
>> counters, ring-buffer cursors, per-cpu locks and other cool things might
>> be implemented in a very efficient way.
>
> Do you have some concrete examples for the more complex operations?


Hi Andy,

I've showed how to implement a per-cpu freelist in a previous email.

Here is how you can do a per-cpu logging system, multiple threads
enqueue messages, a single thread polls all per-cpu queues (not tested
as well):


struct elem_t
{
        uint32  lap;
        T       data;
};

struct percpu_queue_t
{
        uint64  enqueue_pos_cpu;
        uint32  dequeue_pos;
        elem_t  buf[N];
};

bool percpu_queue_enqueue(T *data)
{
        percpu_queue_t *q;
        uint32 enqueue_pos;
        elem_t *e;
        uint64 old;

        for (;;) {
                q = (percpu_queue_t*)&GS[OFF];
                old = atomic_load(&q->enqueue_pos_cpu, memory_order_relaxed);
                enqueue_pos = (uint32)(old >> 32);
                if (enqueue_pos - atomic32_load(&q->dequeue_pos,
memory_order_acquire) >= N)
                        return false;  // full
                if (CMPXCHG16B(&GS[OFF], old, old+1))  // w/o LOCK prefix
                        return p;
                e = &q->buf[enqueue_pos%N];
                e->data = *data;
                atomic_store(&e->lap, enqueue_pos/N + 1, memory_order_release);
                return true;
        }
}

bool percpu_queue_dequeue(percpu_queue_t *q, T *data)
{
        elem_t *e;
        uint32 dequeue_pos;

        dequeue_pos = q->dequeue_pos;
        e = q->buf[dequeue_pos%N];
        if (atomic_load(&e->lap, memory_order_acquire) != dequeue_pos/N+1)
                return false;  // empty
        *data = e->data;
        atomic_store(&q->dequeue_pos, dequeue_pos+1, memory_order_release);
        return true;
}


This is a simple version. It can be modified to do variable-size
messages, batch updates of dequeue_pos to reduce write sharing, or
split both enqueue and dequeue in prepare and commit phases to do
zero-copy communication.

So, generally, if you can reduce per-cpu state modification to a
single CAS, then you pair the data with cpu number and include it into
CAS. This allows you do CAS loop on per-cpu state w/o LOCK prefix.

And in the freelist code, the line:
                if (CMPXCHG16B(fl, old, new))
needs to be replaced with (we want to ensure that we are still on the same cpu):
               if (CMPXCHG16B(&GS[OFF], old, new))


For more complex data structures (e.g. doubly-linked list), I
*suspect* that it's possible to do something like transaction
journalling approach. Namely, there is a single per-cpu slot for
current transaction. If it's empty, then a thread installs own
transaction descriptor with a CAS w/o lock prefix; executes it and
uninstalls the trx descriptor. If it's not empty, then a thread first
helps to execute the pending transaction and uninstalls the trx
descriptor. All mutations in transactions are done with CAS w/o lock
prefix (both when executed by the issuing thread and during helping).
Care must be taken to ensure that threads do not operate on stale data
and on wrong cpu; probably all data slots must be accompanied by aba
counter and/or cpu number.
I've done this in the context of robust shared memory data structures.
I have not worked out all the details for per-cpu data structures, but
I think that it can work.



> It seems to me the limitation to a simple instruction will be very limiting
> for anything more complicated than a counter. Also it's not even
> clear how someone would implement retry (short of something like kuchannel)

What is retry/kuchannel?


> Of course it wouldn't be a problem with TSX transactions, but it's not
> clear they need it.

As far as I know (had not have a change to try), commit of a TSX
transaction includes a trailing store-load style fence. And the whole
point of this change is to eliminate the cost of the store-load fence.
Otherwise, you just query any approximation of the current cpu and do
either TSX or LOCK-prefixed RWM on the current cpu state.


> The other problem with the approach is, how would cpu hotplug
> be handled?
>
>> By the way, newer Intel cpus have even faster instructions for
>> changing %fs/%gs, but they are still not supported by the kernel.
>
> Patch kits are pending.
>
> -Andi

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-09-14 18:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-13 14:35 [PATCH RFC] x86_64: per-cpu memory for user-space Konstantin Khlebnikov
2014-09-13 18:10 ` Dmitry Vyukov
2014-09-14 14:06 ` Andi Kleen
2014-09-14 18:35   ` Dmitry Vyukov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.