linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mm: NULL ptr deref handling mmaping of special mappings
@ 2014-05-14 15:55 Sasha Levin
  2014-05-14 20:23 ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Sasha Levin @ 2014-05-14 15:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Dave Jones, LKML

Hi all,

While fuzzing with trinity inside a KVM tools guest running the latest -next
kernel I've stumbled on the following spew:

[ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961)
[ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0
[ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1634.972913] Dumping ftrace buffer:
[ 1634.975493]    (ftrace buffer empty)
[ 1634.977470] Modules linked in:
[ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461
[ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000
[ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961)
[ 1634.977513] RSP: 0018:ffff88033351faf8  EFLAGS: 00010202
[ 1634.977513] RAX: 0000000000000000 RBX: ffff88033351fbb0 RCX: 0000000000000028
[ 1634.977513] RDX: 0000000000000001 RSI: ffff88033351fb38 RDI: ffff88006d28b600
[ 1634.977513] RBP: ffff88033351fb18 R08: ffff88033351fbb0 R09: 0000000000000000
[ 1634.977513] R10: ffff88001bcca690 R11: 0000000000000000 R12: ffff88001bcca690
[ 1634.977513] R13: ffff880000000748 R14: 0000000000000028 R15: ffff88006d28b600
[ 1634.977513] FS:  00007f79da4b8700(0000) GS:ffff8801b4c00000(0000) knlGS:0000000000000000
[ 1634.977513] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1634.977513] CR2: 0000000000000000 CR3: 00000003334fb000 CR4: 00000000000006a0
[ 1634.977513] Stack:
[ 1634.977513]  ffff88033351fb08 00000000000005d1 00000000000005d1 ffff88033351fbb0
[ 1634.977513]  ffff88033351fb78 ffffffffac2b83a9 ffff88033351fc28 ffffffffac1cb679
[ 1634.977513]  0000000000000028 00000007f79da4e9 00007f79da4e9000 0000000000000000
[ 1634.977513] Call Trace:
[ 1634.977513] __do_fault (mm/memory.c:2703)
[ 1634.977513] ? __lock_acquire (kernel/locking/lockdep.c:3189)
[ 1634.977513] do_read_fault.isra.40 (mm/memory.c:2883)
[ 1634.977513] ? get_parent_ip (kernel/sched/core.c:2519)
[ 1634.977513] ? get_parent_ip (kernel/sched/core.c:2519)
[ 1634.977513] ? __mem_cgroup_count_vm_event (include/linux/rcupdate.h:391 include/linux/rcupdate.h:893 mm/memcontrol.c:1302)
[ 1634.977513] ? get_parent_ip (kernel/sched/core.c:2519)
[ 1634.977513] __handle_mm_fault (mm/memory.c:3021 mm/memory.c:3182 mm/memory.c:3306)
[ 1634.977513] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 1634.977513] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 1634.977513] handle_mm_fault (mm/memory.c:3329)
[ 1634.977513] __do_page_fault (arch/x86/mm/fault.c:1224)
[ 1634.977513] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[ 1634.977513] ? sched_clock (arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:305)
[ 1634.977513] ? sched_clock_local (kernel/sched/clock.c:214)
[ 1634.977513] ? get_parent_ip (kernel/sched/core.c:2519)
[ 1634.977513] ? context_tracking_user_exit (kernel/context_tracking.c:182)
[ 1634.977513] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1634.977513] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[ 1634.977513] do_page_fault (arch/x86/mm/fault.c:1277 include/linux/jump_label.h:105 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1278)
[ 1634.977513] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[ 1634.977513] async_page_fault (arch/x86/kernel/entry_64.S:1552)
[ 1634.977513] ? copy_user_generic_unrolled (arch/x86/lib/copy_user_64.S:153)
[ 1634.977513] ? SyS_add_key (security/keys/keyctl.c:109 security/keys/keyctl.c:62)
[ 1634.977513] tracesys (arch/x86/kernel/entry_64.S:746)
[ 1634.977513] Code: 1f 40 00 66 66 66 66 90 55 48 89 e5 53 48 83 ec 18 48 8b 56 08 48 8b 87 a8 00 00 00 48 2b 97 98 00 00 00 74 1e 66 0f 1f 44 00 00 <48> 83 38 00 74 62 48 83 c0 08 48 83 ea 01 75 f0 0f 1f 84 00 00
[ 1634.977513] RIP special_mapping_fault (mm/mmap.c:2961)
[ 1634.977513]  RSP <ffff88033351faf8>
[ 1634.977513] CR2: 0000000000000000


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 15:55 mm: NULL ptr deref handling mmaping of special mappings Sasha Levin
@ 2014-05-14 20:23 ` Andrew Morton
  2014-05-14 20:41   ` Sasha Levin
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2014-05-14 20:23 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-mm, Dave Jones, LKML

On Wed, 14 May 2014 11:55:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:

> Hi all,
> 
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel I've stumbled on the following spew:
> 
> [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961)
> [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0
> [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 1634.972913] Dumping ftrace buffer:
> [ 1634.975493]    (ftrace buffer empty)
> [ 1634.977470] Modules linked in:
> [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461
> [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000
> [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961)

Somebody's gone and broken the x86 oops output.  It used to say
"special_mapping_fault+0x30/0x120" but the offset info has now
disappeared.  That was useful for guesstimating whereabouts in the
function it died.

The line number isn't very useful as it's not possible (or at least,
not convenient) for others to reliably reproduce your kernel.

<scrabbles with git for a while>

: static int special_mapping_fault(struct vm_area_struct *vma,
: 				struct vm_fault *vmf)
: {
: 	pgoff_t pgoff;
: 	struct page **pages;
: 
: 	/*
: 	 * special mappings have no vm_file, and in that case, the mm
: 	 * uses vm_pgoff internally. So we have to subtract it from here.
: 	 * We are allowed to do this because we are the mm; do not copy
: 	 * this code into drivers!
: 	 */
: 	pgoff = vmf->pgoff - vma->vm_pgoff;
: 
: 	for (pages = vma->vm_private_data; pgoff && *pages; ++pages)
: 		pgoff--;
: 
: 	if (*pages) {
: 		struct page *page = *pages;
: 		get_page(page);
: 		vmf->page = page;
: 		return 0;
: 	}
: 
: 	return VM_FAULT_SIGBUS;
: }

OK so it might be the "if (*pages)".  So vma->vm_private_data was NULL
and pgoff was zero.  As usual, I can't imagine what race would cause
that :(


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 20:23 ` Andrew Morton
@ 2014-05-14 20:41   ` Sasha Levin
  2014-05-14 21:03     ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Sasha Levin @ 2014-05-14 20:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Dave Jones, LKML

On 05/14/2014 04:23 PM, Andrew Morton wrote:
> On Wed, 14 May 2014 11:55:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
> 
>> Hi all,
>>
>> While fuzzing with trinity inside a KVM tools guest running the latest -next
>> kernel I've stumbled on the following spew:
>>
>> [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at           (null)
>> [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961)
>> [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0
>> [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [ 1634.972913] Dumping ftrace buffer:
>> [ 1634.975493]    (ftrace buffer empty)
>> [ 1634.977470] Modules linked in:
>> [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461
>> [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000
>> [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961)
> 
> Somebody's gone and broken the x86 oops output.  It used to say
> "special_mapping_fault+0x30/0x120" but the offset info has now
> disappeared.  That was useful for guesstimating whereabouts in the
> function it died.

I'm the one who "broke" the oops output, but I thought I'm helping people
read that output instead of making it harder...

What happened before is that due to my rather complex .config, the offsets
didn't make sense to anyone who didn't build the kernel with my .config,
so I had to repeatedly send it out to folks who attempted to get basic
things like line numbers.

> The line number isn't very useful as it's not possible (or at least,
> not convenient) for others to reliably reproduce your kernel.

I don't understand that part. I'm usually stating in the beginning of my
mails that I run my testing on the latest -next kernel. And indeed if
you look at today's -next, that line number would point to:

        for (pages = vma->vm_private_data; pgoff && *pages; ++pages) <=== HERE
                pgoff--;

So I'm not sure how replacing the offset with line numbers is making things
worse? previously offsets were useless for people who tried to debug these
spews so that's why I switched it to line numbers in the first place.

> <scrabbles with git for a while>
> 
> : static int special_mapping_fault(struct vm_area_struct *vma,
> : 				struct vm_fault *vmf)
> : {
> : 	pgoff_t pgoff;
> : 	struct page **pages;
> : 
> : 	/*
> : 	 * special mappings have no vm_file, and in that case, the mm
> : 	 * uses vm_pgoff internally. So we have to subtract it from here.
> : 	 * We are allowed to do this because we are the mm; do not copy
> : 	 * this code into drivers!
> : 	 */
> : 	pgoff = vmf->pgoff - vma->vm_pgoff;
> : 
> : 	for (pages = vma->vm_private_data; pgoff && *pages; ++pages)
> : 		pgoff--;
> : 
> : 	if (*pages) {
> : 		struct page *page = *pages;
> : 		get_page(page);
> : 		vmf->page = page;
> : 		return 0;
> : 	}
> : 
> : 	return VM_FAULT_SIGBUS;
> : }
> 
> OK so it might be the "if (*pages)".  So vma->vm_private_data was NULL
> and pgoff was zero.  As usual, I can't imagine what race would cause
> that :(

Yup, it's the *pages part in the 'for' loop above that. I did find the
following in the vdso code:

        vma = _install_special_mapping(mm,
                                       addr + image->size,
                                       image->sym_end_mapping - image->size,
                                       VM_READ,
                                       NULL);

Which installs a mapping with a NULL ptr for pages (if I understand that
correctly), but that code has been there for a while now.


Thanks,
Sasha
been there

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 20:41   ` Sasha Levin
@ 2014-05-14 21:03     ` Andrew Morton
  2014-05-14 21:11       ` Sasha Levin
  2014-05-14 21:26       ` Andy Lutomirski
  0 siblings, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2014-05-14 21:03 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-mm, Dave Jones, LKML, Andy Lutomirski

On Wed, 14 May 2014 16:41:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:

> On 05/14/2014 04:23 PM, Andrew Morton wrote:
> > On Wed, 14 May 2014 11:55:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
> > 
> >> Hi all,
> >>
> >> While fuzzing with trinity inside a KVM tools guest running the latest -next
> >> kernel I've stumbled on the following spew:
> >>
> >> [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at           (null)
> >> [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961)
> >> [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0
> >> [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >> [ 1634.972913] Dumping ftrace buffer:
> >> [ 1634.975493]    (ftrace buffer empty)
> >> [ 1634.977470] Modules linked in:
> >> [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461
> >> [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000
> >> [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961)
> > 
> > Somebody's gone and broken the x86 oops output.  It used to say
> > "special_mapping_fault+0x30/0x120" but the offset info has now
> > disappeared.  That was useful for guesstimating whereabouts in the
> > function it died.
> 
> I'm the one who "broke" the oops output, but I thought I'm helping people
> read that output instead of making it harder...
> 
> What happened before is that due to my rather complex .config, the offsets
> didn't make sense to anyone who didn't build the kernel with my .config,
> so I had to repeatedly send it out to folks who attempted to get basic
> things like line numbers.
> 
> > The line number isn't very useful as it's not possible (or at least,
> > not convenient) for others to reliably reproduce your kernel.
> 
> I don't understand that part. I'm usually stating in the beginning of my
> mails that I run my testing on the latest -next kernel.

Your "latest next kernel" apparently differes from mine ;( It would be
useful if you could just quote the +/-5 lines, perhaps?


> And indeed if
> you look at today's -next, that line number would point to:
> 
>         for (pages = vma->vm_private_data; pgoff && *pages; ++pages) <=== HERE
>                 pgoff--;
> 
> So I'm not sure how replacing the offset with line numbers is making things
> worse? previously offsets were useless for people who tried to debug these
> spews so that's why I switched it to line numbers in the first place.
> 
> > <scrabbles with git for a while>
> > 
> > : static int special_mapping_fault(struct vm_area_struct *vma,
> > : 				struct vm_fault *vmf)
> > : {
> > : 	pgoff_t pgoff;
> > : 	struct page **pages;
> > : 
> > : 	/*
> > : 	 * special mappings have no vm_file, and in that case, the mm
> > : 	 * uses vm_pgoff internally. So we have to subtract it from here.
> > : 	 * We are allowed to do this because we are the mm; do not copy
> > : 	 * this code into drivers!
> > : 	 */
> > : 	pgoff = vmf->pgoff - vma->vm_pgoff;
> > : 
> > : 	for (pages = vma->vm_private_data; pgoff && *pages; ++pages)
> > : 		pgoff--;
> > : 
> > : 	if (*pages) {
> > : 		struct page *page = *pages;
> > : 		get_page(page);
> > : 		vmf->page = page;
> > : 		return 0;
> > : 	}
> > : 
> > : 	return VM_FAULT_SIGBUS;
> > : }
> > 
> > OK so it might be the "if (*pages)".  So vma->vm_private_data was NULL
> > and pgoff was zero.  As usual, I can't imagine what race would cause
> > that :(
> 
> Yup, it's the *pages part in the 'for' loop above that. I did find the
> following in the vdso code:
> 
>         vma = _install_special_mapping(mm,
>                                        addr + image->size,
>                                        image->sym_end_mapping - image->size,
>                                        VM_READ,
>                                        NULL);
> 
> Which installs a mapping with a NULL ptr for pages (if I understand that
> correctly), but that code has been there for a while now.

Well that's weird.  I don't see anything which permits that.  Maybe
nobody faulted against that address before?

It's unclear what that code's actually doing and nobody bothered
commenting it of course.  Maybe it's installing a guard page?

In my linux-next all that code got deleted by Andy's "x86, vdso:
Reimplement vdso.so preparation in build-time C" anyway.  What kernel
were you looking at?

Andy, are you able to shed some light on why
arch_setup_additional_pages() is (or was) passing a NULL into
_install_special_mapping()?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 21:03     ` Andrew Morton
@ 2014-05-14 21:11       ` Sasha Levin
  2014-05-14 21:31         ` Andrew Morton
  2014-05-14 21:26       ` Andy Lutomirski
  1 sibling, 1 reply; 27+ messages in thread
From: Sasha Levin @ 2014-05-14 21:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Dave Jones, LKML, Andy Lutomirski

On 05/14/2014 05:03 PM, Andrew Morton wrote:
> On Wed, 14 May 2014 16:41:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
> 
>> On 05/14/2014 04:23 PM, Andrew Morton wrote:
>>> On Wed, 14 May 2014 11:55:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> While fuzzing with trinity inside a KVM tools guest running the latest -next
>>>> kernel I've stumbled on the following spew:
>>>>
>>>> [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at           (null)
>>>> [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961)
>>>> [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0
>>>> [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>> [ 1634.972913] Dumping ftrace buffer:
>>>> [ 1634.975493]    (ftrace buffer empty)
>>>> [ 1634.977470] Modules linked in:
>>>> [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461
>>>> [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000
>>>> [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961)
>>>
>>> Somebody's gone and broken the x86 oops output.  It used to say
>>> "special_mapping_fault+0x30/0x120" but the offset info has now
>>> disappeared.  That was useful for guesstimating whereabouts in the
>>> function it died.
>>
>> I'm the one who "broke" the oops output, but I thought I'm helping people
>> read that output instead of making it harder...
>>
>> What happened before is that due to my rather complex .config, the offsets
>> didn't make sense to anyone who didn't build the kernel with my .config,
>> so I had to repeatedly send it out to folks who attempted to get basic
>> things like line numbers.
>>
>>> The line number isn't very useful as it's not possible (or at least,
>>> not convenient) for others to reliably reproduce your kernel.
>>
>> I don't understand that part. I'm usually stating in the beginning of my
>> mails that I run my testing on the latest -next kernel.
> 
> Your "latest next kernel" apparently differes from mine ;( It would be
> useful if you could just quote the +/-5 lines, perhaps?

Oh, I see what happened. I have the remap_file_pages() get_file/fput fix
merged in which modified line count.

Yup, I'll start quoting the line themselves as well.

>> And indeed if
>> you look at today's -next, that line number would point to:
>>
>>         for (pages = vma->vm_private_data; pgoff && *pages; ++pages) <=== HERE
>>                 pgoff--;
>>
>> So I'm not sure how replacing the offset with line numbers is making things
>> worse? previously offsets were useless for people who tried to debug these
>> spews so that's why I switched it to line numbers in the first place.
>>
>>> <scrabbles with git for a while>
>>>
>>> : static int special_mapping_fault(struct vm_area_struct *vma,
>>> : 				struct vm_fault *vmf)
>>> : {
>>> : 	pgoff_t pgoff;
>>> : 	struct page **pages;
>>> : 
>>> : 	/*
>>> : 	 * special mappings have no vm_file, and in that case, the mm
>>> : 	 * uses vm_pgoff internally. So we have to subtract it from here.
>>> : 	 * We are allowed to do this because we are the mm; do not copy
>>> : 	 * this code into drivers!
>>> : 	 */
>>> : 	pgoff = vmf->pgoff - vma->vm_pgoff;
>>> : 
>>> : 	for (pages = vma->vm_private_data; pgoff && *pages; ++pages)
>>> : 		pgoff--;
>>> : 
>>> : 	if (*pages) {
>>> : 		struct page *page = *pages;
>>> : 		get_page(page);
>>> : 		vmf->page = page;
>>> : 		return 0;
>>> : 	}
>>> : 
>>> : 	return VM_FAULT_SIGBUS;
>>> : }
>>>
>>> OK so it might be the "if (*pages)".  So vma->vm_private_data was NULL
>>> and pgoff was zero.  As usual, I can't imagine what race would cause
>>> that :(
>>
>> Yup, it's the *pages part in the 'for' loop above that. I did find the
>> following in the vdso code:
>>
>>         vma = _install_special_mapping(mm,
>>                                        addr + image->size,
>>                                        image->sym_end_mapping - image->size,
>>                                        VM_READ,
>>                                        NULL);
>>
>> Which installs a mapping with a NULL ptr for pages (if I understand that
>> correctly), but that code has been there for a while now.
> 
> Well that's weird.  I don't see anything which permits that.  Maybe
> nobody faulted against that address before?
> 
> It's unclear what that code's actually doing and nobody bothered
> commenting it of course.  Maybe it's installing a guard page?
> 
> In my linux-next all that code got deleted by Andy's "x86, vdso:
> Reimplement vdso.so preparation in build-time C" anyway.  What kernel
> were you looking at?

Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 .

I don't see Andy's patch removing that code either.

I'm running next-20140514...


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 21:03     ` Andrew Morton
  2014-05-14 21:11       ` Sasha Levin
@ 2014-05-14 21:26       ` Andy Lutomirski
  1 sibling, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2014-05-14 21:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Sasha Levin, linux-mm, Dave Jones, LKML, Stefani Seibold

On Wed, May 14, 2014 at 2:03 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 14 May 2014 16:41:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
>
>> On 05/14/2014 04:23 PM, Andrew Morton wrote:
>> > On Wed, 14 May 2014 11:55:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
>> >
>> >> Hi all,
>> >>
>> >> While fuzzing with trinity inside a KVM tools guest running the latest -next
>> >> kernel I've stumbled on the following spew:
>> >>
>> >> [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at           (null)
>> >> [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961)
>> >> [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0
>> >> [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> >> [ 1634.972913] Dumping ftrace buffer:
>> >> [ 1634.975493]    (ftrace buffer empty)
>> >> [ 1634.977470] Modules linked in:
>> >> [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461
>> >> [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000
>> >> [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961)
>> >
>> > Somebody's gone and broken the x86 oops output.  It used to say
>> > "special_mapping_fault+0x30/0x120" but the offset info has now
>> > disappeared.  That was useful for guesstimating whereabouts in the
>> > function it died.
>>
>> I'm the one who "broke" the oops output, but I thought I'm helping people
>> read that output instead of making it harder...
>>
>> What happened before is that due to my rather complex .config, the offsets
>> didn't make sense to anyone who didn't build the kernel with my .config,
>> so I had to repeatedly send it out to folks who attempted to get basic
>> things like line numbers.
>>
>> > The line number isn't very useful as it's not possible (or at least,
>> > not convenient) for others to reliably reproduce your kernel.
>>
>> I don't understand that part. I'm usually stating in the beginning of my
>> mails that I run my testing on the latest -next kernel.
>
> Your "latest next kernel" apparently differes from mine ;( It would be
> useful if you could just quote the +/-5 lines, perhaps?
>
>
>> And indeed if
>> you look at today's -next, that line number would point to:
>>
>>         for (pages = vma->vm_private_data; pgoff && *pages; ++pages) <=== HERE
>>                 pgoff--;
>>
>> So I'm not sure how replacing the offset with line numbers is making things
>> worse? previously offsets were useless for people who tried to debug these
>> spews so that's why I switched it to line numbers in the first place.
>>
>> > <scrabbles with git for a while>
>> >
>> > : static int special_mapping_fault(struct vm_area_struct *vma,
>> > :                           struct vm_fault *vmf)
>> > : {
>> > :   pgoff_t pgoff;
>> > :   struct page **pages;
>> > :
>> > :   /*
>> > :    * special mappings have no vm_file, and in that case, the mm
>> > :    * uses vm_pgoff internally. So we have to subtract it from here.
>> > :    * We are allowed to do this because we are the mm; do not copy
>> > :    * this code into drivers!
>> > :    */
>> > :   pgoff = vmf->pgoff - vma->vm_pgoff;
>> > :
>> > :   for (pages = vma->vm_private_data; pgoff && *pages; ++pages)
>> > :           pgoff--;
>> > :
>> > :   if (*pages) {
>> > :           struct page *page = *pages;
>> > :           get_page(page);
>> > :           vmf->page = page;
>> > :           return 0;
>> > :   }
>> > :
>> > :   return VM_FAULT_SIGBUS;
>> > : }
>> >
>> > OK so it might be the "if (*pages)".  So vma->vm_private_data was NULL
>> > and pgoff was zero.  As usual, I can't imagine what race would cause
>> > that :(
>>
>> Yup, it's the *pages part in the 'for' loop above that. I did find the
>> following in the vdso code:
>>
>>         vma = _install_special_mapping(mm,
>>                                        addr + image->size,
>>                                        image->sym_end_mapping - image->size,
>>                                        VM_READ,
>>                                        NULL);
>>
>> Which installs a mapping with a NULL ptr for pages (if I understand that
>> correctly), but that code has been there for a while now.
>
> Well that's weird.  I don't see anything which permits that.  Maybe
> nobody faulted against that address before?
>
> It's unclear what that code's actually doing and nobody bothered
> commenting it of course.  Maybe it's installing a guard page?
>
> In my linux-next all that code got deleted by Andy's "x86, vdso:
> Reimplement vdso.so preparation in build-time C" anyway.  What kernel
> were you looking at?
>
> Andy, are you able to shed some light on why
> arch_setup_additional_pages() is (or was) passing a NULL into
> _install_special_mapping()?
>

Ugh.  I just moved that code; I didn't delete it.  In fact, I likely
made it worse by causing x86_64 to use it, too.

Oddly, this code worked when I tested it.  Do you know what Trinity
did to break it?  Did the PTEs get nuked due to memory pressure?

The intent is to use remap_pfn_range (a couple of lines below) to map
some kernel data into userspace read-only.  I think the right fix
would be to pass in the pages array directly but to modify
_install_special_mapping to clear VM_MAYWRITE.  The idea is that
ptrace and mprotect should *not* be usable to COW these pages:
userspace will shoot itself in the foot quite thoroughly if the pages
get COWed.  Additionally, one of those pages actually represents some
hardware registers, which almost certainly needs to be mapped UC.

IIRC on x86_32 there are two pages in there.  One is regular kernel
memory that has a struct page associated with it.  The other is the
HPET, which is a page of hardware registers that may not have a struct
page.

Can anyone who's less clueless than I am about how vmas work help?
Does this need to use its own vmops instead of the special mapping
stuff?

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 21:11       ` Sasha Levin
@ 2014-05-14 21:31         ` Andrew Morton
  2014-05-14 21:33           ` Andy Lutomirski
  2014-05-14 22:51           ` Andy Lutomirski
  0 siblings, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2014-05-14 21:31 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-mm, Dave Jones, LKML, Andy Lutomirski

On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:

> > In my linux-next all that code got deleted by Andy's "x86, vdso:
> > Reimplement vdso.so preparation in build-time C" anyway.  What kernel
> > were you looking at?
> 
> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 .
> 
> I don't see Andy's patch removing that code either.

ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into
arch/x86/vdso/vma.c.

Maybe you managed to take a fault against the symbol area between the
_install_special_mapping() and the remap_pfn_range() call, but mmap_sem
should prevent that.

Or the remap_pfn_range() call never happened.  Should map_vdso() be
running _install_special_mapping() at all if
image->sym_vvar_page==NULL?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 21:31         ` Andrew Morton
@ 2014-05-14 21:33           ` Andy Lutomirski
  2014-05-14 22:11             ` Cyrill Gorcunov
  2014-05-14 22:51           ` Andy Lutomirski
  1 sibling, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2014-05-14 21:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov,
	Cyrill Gorcunov

On Wed, May 14, 2014 at 2:31 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
>
>> > In my linux-next all that code got deleted by Andy's "x86, vdso:
>> > Reimplement vdso.so preparation in build-time C" anyway.  What kernel
>> > were you looking at?
>>
>> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 .
>>
>> I don't see Andy's patch removing that code either.
>
> ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into
> arch/x86/vdso/vma.c.
>
> Maybe you managed to take a fault against the symbol area between the
> _install_special_mapping() and the remap_pfn_range() call, but mmap_sem
> should prevent that.
>
> Or the remap_pfn_range() call never happened.  Should map_vdso() be
> running _install_special_mapping() at all if
> image->sym_vvar_page==NULL?

I'm confused: are we talking about 3.15-rcsomething or linux-next?
That code changed.

Would this all make more sense if there were just a single vma in
here?  cc: Pavel and Cyrill, who might have to deal with this stuff in
CRIU

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 21:33           ` Andy Lutomirski
@ 2014-05-14 22:11             ` Cyrill Gorcunov
  2014-05-14 22:23               ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Cyrill Gorcunov @ 2014-05-14 22:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote:
> On Wed, May 14, 2014 at 2:31 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
> >
> >> > In my linux-next all that code got deleted by Andy's "x86, vdso:
> >> > Reimplement vdso.so preparation in build-time C" anyway.  What kernel
> >> > were you looking at?
> >>
> >> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 .
> >>
> >> I don't see Andy's patch removing that code either.
> >
> > ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into
> > arch/x86/vdso/vma.c.
> >
> > Maybe you managed to take a fault against the symbol area between the
> > _install_special_mapping() and the remap_pfn_range() call, but mmap_sem
> > should prevent that.
> >
> > Or the remap_pfn_range() call never happened.  Should map_vdso() be
> > running _install_special_mapping() at all if
> > image->sym_vvar_page==NULL?
> 
> I'm confused: are we talking about 3.15-rcsomething or linux-next?
> That code changed.
> 
> Would this all make more sense if there were just a single vma in
> here?  cc: Pavel and Cyrill, who might have to deal with this stuff in
> CRIU

Well, for criu we've not modified any vdso kernel's code (except
setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping).
And never experienced problems Sasha points. Looks like indeed in
-next code is pretty different from mainline one. To figure out
why I need to fetch -next branch and get some research. I would
try to do that tomorrow (still hoping someone more experienced
in mm system would beat me on that).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 22:11             ` Cyrill Gorcunov
@ 2014-05-14 22:23               ` Andy Lutomirski
  2014-05-15  2:36                 ` Pavel Emelyanov
  2014-05-15  8:45                 ` Cyrill Gorcunov
  0 siblings, 2 replies; 27+ messages in thread
From: Andy Lutomirski @ 2014-05-14 22:23 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Wed, May 14, 2014 at 3:11 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote:
>> On Wed, May 14, 2014 at 2:31 PM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>> > On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
>> >
>> >> > In my linux-next all that code got deleted by Andy's "x86, vdso:
>> >> > Reimplement vdso.so preparation in build-time C" anyway.  What kernel
>> >> > were you looking at?
>> >>
>> >> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 .
>> >>
>> >> I don't see Andy's patch removing that code either.
>> >
>> > ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into
>> > arch/x86/vdso/vma.c.
>> >
>> > Maybe you managed to take a fault against the symbol area between the
>> > _install_special_mapping() and the remap_pfn_range() call, but mmap_sem
>> > should prevent that.
>> >
>> > Or the remap_pfn_range() call never happened.  Should map_vdso() be
>> > running _install_special_mapping() at all if
>> > image->sym_vvar_page==NULL?
>>
>> I'm confused: are we talking about 3.15-rcsomething or linux-next?
>> That code changed.
>>
>> Would this all make more sense if there were just a single vma in
>> here?  cc: Pavel and Cyrill, who might have to deal with this stuff in
>> CRIU
>
> Well, for criu we've not modified any vdso kernel's code (except
> setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping).
> And never experienced problems Sasha points. Looks like indeed in
> -next code is pretty different from mainline one. To figure out
> why I need to fetch -next branch and get some research. I would
> try to do that tomorrow (still hoping someone more experienced
> in mm system would beat me on that).

I can summarize:

On 3.14 and before, the vdso is just a bunch of ELF headers and
executable data.  When executed by 64-bit binaries, it reads from the
fixmap to do its thing.  That is, it reads from kernel addresses that
don't have vmas.  When executed by 32-bit binaries, it doesn't read
anything, since there was no 32-bit timing code.

On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
a separate vma containing two pages worth of time-varying read-only
data.  The vdso reads those pages using PIC references.

On linux-next, all vdsos work the same way.  There are two vmas.  The
first vma is executable text, which can be poked at by ptrace, etc
normally.  The second vma contains time-varying state, should not
allow poking, and is accessed by PIC references.

What does CRIU do to restore the vdso?  Will 3.15 and/or linux-next
need to make some concession for CRIU?

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 21:31         ` Andrew Morton
  2014-05-14 21:33           ` Andy Lutomirski
@ 2014-05-14 22:51           ` Andy Lutomirski
  1 sibling, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2014-05-14 22:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Sasha Levin, linux-mm, Dave Jones, LKML

On Wed, May 14, 2014 at 2:31 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
>
>> > In my linux-next all that code got deleted by Andy's "x86, vdso:
>> > Reimplement vdso.so preparation in build-time C" anyway.  What kernel
>> > were you looking at?
>>
>> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 .
>>
>> I don't see Andy's patch removing that code either.
>
> ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into
> arch/x86/vdso/vma.c.
>
> Maybe you managed to take a fault against the symbol area between the
> _install_special_mapping() and the remap_pfn_range() call, but mmap_sem
> should prevent that.
>
> Or the remap_pfn_range() call never happened.  Should map_vdso() be
> running _install_special_mapping() at all if
> image->sym_vvar_page==NULL?

You're almost right, but that was enough to point me in the right direction :)

The mapping is still needed, since there are two pages.  qemu -no-hpet
will trigger this, but the nohpet kernel option will not.  The latter
is arguably a bug in the nohpet option.

Fix coming.

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 22:23               ` Andy Lutomirski
@ 2014-05-15  2:36                 ` Pavel Emelyanov
  2014-05-15 19:42                   ` Andy Lutomirski
  2014-05-15  8:45                 ` Cyrill Gorcunov
  1 sibling, 1 reply; 27+ messages in thread
From: Pavel Emelyanov @ 2014-05-15  2:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Cyrill Gorcunov, Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML

On 05/15/2014 02:23 AM, Andy Lutomirski wrote:
> On Wed, May 14, 2014 at 3:11 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>> On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote:
>>> On Wed, May 14, 2014 at 2:31 PM, Andrew Morton
>>> <akpm@linux-foundation.org> wrote:
>>>> On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
>>>>
>>>>>> In my linux-next all that code got deleted by Andy's "x86, vdso:
>>>>>> Reimplement vdso.so preparation in build-time C" anyway.  What kernel
>>>>>> were you looking at?
>>>>>
>>>>> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 .
>>>>>
>>>>> I don't see Andy's patch removing that code either.
>>>>
>>>> ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into
>>>> arch/x86/vdso/vma.c.
>>>>
>>>> Maybe you managed to take a fault against the symbol area between the
>>>> _install_special_mapping() and the remap_pfn_range() call, but mmap_sem
>>>> should prevent that.
>>>>
>>>> Or the remap_pfn_range() call never happened.  Should map_vdso() be
>>>> running _install_special_mapping() at all if
>>>> image->sym_vvar_page==NULL?
>>>
>>> I'm confused: are we talking about 3.15-rcsomething or linux-next?
>>> That code changed.
>>>
>>> Would this all make more sense if there were just a single vma in
>>> here?  cc: Pavel and Cyrill, who might have to deal with this stuff in
>>> CRIU
>>
>> Well, for criu we've not modified any vdso kernel's code (except
>> setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping).
>> And never experienced problems Sasha points. Looks like indeed in
>> -next code is pretty different from mainline one. To figure out
>> why I need to fetch -next branch and get some research. I would
>> try to do that tomorrow (still hoping someone more experienced
>> in mm system would beat me on that).
> 
> I can summarize:
> 
> On 3.14 and before, the vdso is just a bunch of ELF headers and
> executable data.  When executed by 64-bit binaries, it reads from the
> fixmap to do its thing.  That is, it reads from kernel addresses that
> don't have vmas.  When executed by 32-bit binaries, it doesn't read
> anything, since there was no 32-bit timing code.
> 
> On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
> a separate vma containing two pages worth of time-varying read-only
> data.  The vdso reads those pages using PIC references.
> 
> On linux-next, all vdsos work the same way.  There are two vmas.  The
> first vma is executable text, which can be poked at by ptrace, etc
> normally.  The second vma contains time-varying state, should not
> allow poking, and is accessed by PIC references.

Is this 2nd vma seen in /proc/pid/maps? And if so, is it marked somehow?

> What does CRIU do to restore the vdso?  Will 3.15 and/or linux-next
> need to make some concession for CRIU?

We detect the vdso by "[vdso]" mark in proc at dump time and mark it in
the images. At restore time we check that vdso symbols layout hasn't changed
and just remap it in proper location.

If this remains the same in -next, then we're fine :)

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-14 22:23               ` Andy Lutomirski
  2014-05-15  2:36                 ` Pavel Emelyanov
@ 2014-05-15  8:45                 ` Cyrill Gorcunov
  2014-05-15 19:46                   ` Andy Lutomirski
  1 sibling, 1 reply; 27+ messages in thread
From: Cyrill Gorcunov @ 2014-05-15  8:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote:
> I can summarize:
> 
> On 3.14 and before, the vdso is just a bunch of ELF headers and
> executable data.  When executed by 64-bit binaries, it reads from the
> fixmap to do its thing.  That is, it reads from kernel addresses that
> don't have vmas.  When executed by 32-bit binaries, it doesn't read
> anything, since there was no 32-bit timing code.
> 
> On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
> a separate vma containing two pages worth of time-varying read-only
> data.  The vdso reads those pages using PIC references.

Andy, could you please point me where is the code which creates a second vma?
latest 3.15 master branch

[root@fc ~]# cat /proc/self/maps
...
7fff57b6e000-7fff57b8f000 rw-p 00000000 00:00 0                          [stack]
7fff57bff000-7fff57c00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
[root@fc ~]#

Or you mean vsyscall area? If yes, then in criu we don't dump vsyscall zone.
On restore we don't touch  vsyscall either but for vdso there are two cases

 - if there were no kernel change on vdso contents we simply use vdso provided
   by the kernel at the moment of criu startup

 - if vdso has been changed and looks different from one saved in image during
   checkpoint, we map it from image but then patch (push jmp instruction) so
   when application calls for some of vdso function it jumps into vdso code
   saved in image and then jumps into vdso mapped by the kernel (ie kind of
   proxy calls) This force us to do own Elf parsing inside criu to calculate
   proper offsets.

We don't support (and have no plans to support) x86-32 kernels but there
left a case with compatible mode (32bit app on 64bit kernel) which has
not yet been implemented though.

> On linux-next, all vdsos work the same way.  There are two vmas.  The
> first vma is executable text, which can be poked at by ptrace, etc
> normally.  The second vma contains time-varying state, should not
> allow poking, and is accessed by PIC references.
> 
> What does CRIU do to restore the vdso?  Will 3.15 and/or linux-next
> need to make some concession for CRIU?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15  2:36                 ` Pavel Emelyanov
@ 2014-05-15 19:42                   ` Andy Lutomirski
  2014-05-19  8:27                     ` Pavel Emelyanov
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2014-05-15 19:42 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: linux-mm, Cyrill Gorcunov, LKML, Sasha Levin, Andrew Morton, Dave Jones

On May 14, 2014 8:36 PM, "Pavel Emelyanov" <xemul@parallels.com> wrote:
>
> On 05/15/2014 02:23 AM, Andy Lutomirski wrote:
> > On Wed, May 14, 2014 at 3:11 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> >> On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote:
> >>> On Wed, May 14, 2014 at 2:31 PM, Andrew Morton
> >>> <akpm@linux-foundation.org> wrote:
> >>>> On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
> >>>>
> >>>>>> In my linux-next all that code got deleted by Andy's "x86, vdso:
> >>>>>> Reimplement vdso.so preparation in build-time C" anyway.  What kernel
> >>>>>> were you looking at?
> >>>>>
> >>>>> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 .
> >>>>>
> >>>>> I don't see Andy's patch removing that code either.
> >>>>
> >>>> ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into
> >>>> arch/x86/vdso/vma.c.
> >>>>
> >>>> Maybe you managed to take a fault against the symbol area between the
> >>>> _install_special_mapping() and the remap_pfn_range() call, but mmap_sem
> >>>> should prevent that.
> >>>>
> >>>> Or the remap_pfn_range() call never happened.  Should map_vdso() be
> >>>> running _install_special_mapping() at all if
> >>>> image->sym_vvar_page==NULL?
> >>>
> >>> I'm confused: are we talking about 3.15-rcsomething or linux-next?
> >>> That code changed.
> >>>
> >>> Would this all make more sense if there were just a single vma in
> >>> here?  cc: Pavel and Cyrill, who might have to deal with this stuff in
> >>> CRIU
> >>
> >> Well, for criu we've not modified any vdso kernel's code (except
> >> setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping).
> >> And never experienced problems Sasha points. Looks like indeed in
> >> -next code is pretty different from mainline one. To figure out
> >> why I need to fetch -next branch and get some research. I would
> >> try to do that tomorrow (still hoping someone more experienced
> >> in mm system would beat me on that).
> >
> > I can summarize:
> >
> > On 3.14 and before, the vdso is just a bunch of ELF headers and
> > executable data.  When executed by 64-bit binaries, it reads from the
> > fixmap to do its thing.  That is, it reads from kernel addresses that
> > don't have vmas.  When executed by 32-bit binaries, it doesn't read
> > anything, since there was no 32-bit timing code.
> >
> > On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
> > a separate vma containing two pages worth of time-varying read-only
> > data.  The vdso reads those pages using PIC references.
> >
> > On linux-next, all vdsos work the same way.  There are two vmas.  The
> > first vma is executable text, which can be poked at by ptrace, etc
> > normally.  The second vma contains time-varying state, should not
> > allow poking, and is accessed by PIC references.
>
> Is this 2nd vma seen in /proc/pid/maps? And if so, is it marked somehow?

It is in maps, and it's not marked.  I can write a patch to change
that.  I imagine it shouldn't be called [vdso], though.

>
> > What does CRIU do to restore the vdso?  Will 3.15 and/or linux-next
> > need to make some concession for CRIU?
>
> We detect the vdso by "[vdso]" mark in proc at dump time and mark it in
> the images. At restore time we check that vdso symbols layout hasn't changed
> and just remap it in proper location.
>
> If this remains the same in -next, then we're fine :)

If you just remap the vdso, you'll crash.

This is the case in 3.15, too, for 32-bit apps, anyway.

What happens if you try to checkpoint a program that's in the vdso or,
worse, in a signal frame with the vdso on the stack?

--Andy

>
> Thanks,
> Pavel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15  8:45                 ` Cyrill Gorcunov
@ 2014-05-15 19:46                   ` Andy Lutomirski
  2014-05-15 19:53                     ` Cyrill Gorcunov
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2014-05-15 19:46 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Thu, May 15, 2014 at 1:45 AM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote:
>> I can summarize:
>>
>> On 3.14 and before, the vdso is just a bunch of ELF headers and
>> executable data.  When executed by 64-bit binaries, it reads from the
>> fixmap to do its thing.  That is, it reads from kernel addresses that
>> don't have vmas.  When executed by 32-bit binaries, it doesn't read
>> anything, since there was no 32-bit timing code.
>>
>> On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
>> a separate vma containing two pages worth of time-varying read-only
>> data.  The vdso reads those pages using PIC references.
>
> Andy, could you please point me where is the code which creates a second vma?
> latest 3.15 master branch

Search for _install_special_mapping in arch/x86/vdso.  It's in a
different place in 3.15-rc and -next.

>
> [root@fc ~]# cat /proc/self/maps
> ...
> 7fff57b6e000-7fff57b8f000 rw-p 00000000 00:00 0                          [stack]
> 7fff57bff000-7fff57c00000 r-xp 00000000 00:00 0                          [vdso]
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
> [root@fc ~]#
>

What version and bitness is this?

> Or you mean vsyscall area? If yes, then in criu we don't dump vsyscall zone.
> On restore we don't touch  vsyscall either but for vdso there are two cases

vsyscalls are almost gone now :)

>
>  - if there were no kernel change on vdso contents we simply use vdso provided
>    by the kernel at the moment of criu startup
>
>  - if vdso has been changed and looks different from one saved in image during
>    checkpoint, we map it from image but then patch (push jmp instruction) so
>    when application calls for some of vdso function it jumps into vdso code
>    saved in image and then jumps into vdso mapped by the kernel (ie kind of
>    proxy calls) This force us to do own Elf parsing inside criu to calculate
>    proper offsets.

Yuck :)

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15 19:46                   ` Andy Lutomirski
@ 2014-05-15 19:53                     ` Cyrill Gorcunov
  2014-05-15 19:59                       ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Cyrill Gorcunov @ 2014-05-15 19:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Thu, May 15, 2014 at 12:46:34PM -0700, Andy Lutomirski wrote:
> On Thu, May 15, 2014 at 1:45 AM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> > On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote:
> >> I can summarize:
> >>
> >> On 3.14 and before, the vdso is just a bunch of ELF headers and
> >> executable data.  When executed by 64-bit binaries, it reads from the
> >> fixmap to do its thing.  That is, it reads from kernel addresses that
> >> don't have vmas.  When executed by 32-bit binaries, it doesn't read
> >> anything, since there was no 32-bit timing code.
> >>
> >> On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
> >> a separate vma containing two pages worth of time-varying read-only
> >> data.  The vdso reads those pages using PIC references.
> >
> > Andy, could you please point me where is the code which creates a second vma?
> > latest 3.15 master branch
> 
> Search for _install_special_mapping in arch/x86/vdso.  It's in a
> different place in 3.15-rc and -next.

As far as I see _install_special_mapping allocates one vma from cache and

	vma->vm_start = addr;
	vma->vm_end = addr + len;

so where is the second one?

> 
> >
> > [root@fc ~]# cat /proc/self/maps
> > ...
> > 7fff57b6e000-7fff57b8f000 rw-p 00000000 00:00 0                          [stack]
> > 7fff57bff000-7fff57c00000 r-xp 00000000 00:00 0                          [vdso]
> > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
> > [root@fc ~]#
> >
> 
> What version and bitness is this?

x86-64, 3.15-rc5

> 
> > Or you mean vsyscall area? If yes, then in criu we don't dump vsyscall zone.
> > On restore we don't touch  vsyscall either but for vdso there are two cases
> 
> vsyscalls are almost gone now :)

Good to know ;)

> 
> >
> >  - if there were no kernel change on vdso contents we simply use vdso provided
> >    by the kernel at the moment of criu startup
> >
> >  - if vdso has been changed and looks different from one saved in image during
> >    checkpoint, we map it from image but then patch (push jmp instruction) so
> >    when application calls for some of vdso function it jumps into vdso code
> >    saved in image and then jumps into vdso mapped by the kernel (ie kind of
> >    proxy calls) This force us to do own Elf parsing inside criu to calculate
> >    proper offsets.
> 
> Yuck :)

Yeah, I know, we simply had no choise.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15 19:53                     ` Cyrill Gorcunov
@ 2014-05-15 19:59                       ` Andy Lutomirski
  2014-05-15 20:19                         ` Cyrill Gorcunov
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2014-05-15 19:59 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Thu, May 15, 2014 at 12:53 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Thu, May 15, 2014 at 12:46:34PM -0700, Andy Lutomirski wrote:
>> On Thu, May 15, 2014 at 1:45 AM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>> > On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote:
>> >> I can summarize:
>> >>
>> >> On 3.14 and before, the vdso is just a bunch of ELF headers and
>> >> executable data.  When executed by 64-bit binaries, it reads from the
>> >> fixmap to do its thing.  That is, it reads from kernel addresses that
>> >> don't have vmas.  When executed by 32-bit binaries, it doesn't read
>> >> anything, since there was no 32-bit timing code.
>> >>
>> >> On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
>> >> a separate vma containing two pages worth of time-varying read-only
>> >> data.  The vdso reads those pages using PIC references.
>> >
>> > Andy, could you please point me where is the code which creates a second vma?
>> > latest 3.15 master branch
>>
>> Search for _install_special_mapping in arch/x86/vdso.  It's in a
>> different place in 3.15-rc and -next.
>
> As far as I see _install_special_mapping allocates one vma from cache and
>
>         vma->vm_start = addr;
>         vma->vm_end = addr + len;
>
> so where is the second one?

Look at its callers in vdso32-setup.c and/or vma.c, depending on version.

>
>>
>> >
>> > [root@fc ~]# cat /proc/self/maps
>> > ...
>> > 7fff57b6e000-7fff57b8f000 rw-p 00000000 00:00 0                          [stack]
>> > 7fff57bff000-7fff57c00000 r-xp 00000000 00:00 0                          [vdso]
>> > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
>> > [root@fc ~]#
>> >
>>
>> What version and bitness is this?
>
> x86-64, 3.15-rc5

Aha.  Give tip/x86/vdso or -next a try or boot a 32-bit 3.15-rc kernel
and you'll see it.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15 19:59                       ` Andy Lutomirski
@ 2014-05-15 20:19                         ` Cyrill Gorcunov
  2014-05-15 21:31                           ` Cyrill Gorcunov
  0 siblings, 1 reply; 27+ messages in thread
From: Cyrill Gorcunov @ 2014-05-15 20:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Thu, May 15, 2014 at 12:59:04PM -0700, Andy Lutomirski wrote:
> >>
> >> What version and bitness is this?
> >
> > x86-64, 3.15-rc5
> 
> Aha.  Give tip/x86/vdso or -next a try or boot a 32-bit 3.15-rc kernel
> and you'll see it.

I see what you mean. We're rather targeting on bare x86-64 at the moment
but compat mode is needed as well (not yet implemented though). I'll take
a precise look into this area. Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15 20:19                         ` Cyrill Gorcunov
@ 2014-05-15 21:31                           ` Cyrill Gorcunov
  2014-05-15 21:42                             ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Cyrill Gorcunov @ 2014-05-15 21:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Fri, May 16, 2014 at 12:19:14AM +0400, Cyrill Gorcunov wrote:
> 
> I see what you mean. We're rather targeting on bare x86-64 at the moment
> but compat mode is needed as well (not yet implemented though). I'll take
> a precise look into this area. Thanks!

Indeed, because we were not running 32bit tasks vdso32-setup.c::arch_setup_additional_pages
has never been called. That's the mode we will have to implement one day.

Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables
be kind of immutable constants? If yes, we could calculate where the additional
vma lives without requiring any kind of [vdso] mark in proc/pid/maps output.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15 21:31                           ` Cyrill Gorcunov
@ 2014-05-15 21:42                             ` Andy Lutomirski
  2014-05-15 21:57                               ` Cyrill Gorcunov
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2014-05-15 21:42 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Thu, May 15, 2014 at 2:31 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Fri, May 16, 2014 at 12:19:14AM +0400, Cyrill Gorcunov wrote:
>>
>> I see what you mean. We're rather targeting on bare x86-64 at the moment
>> but compat mode is needed as well (not yet implemented though). I'll take
>> a precise look into this area. Thanks!
>
> Indeed, because we were not running 32bit tasks vdso32-setup.c::arch_setup_additional_pages
> has never been called. That's the mode we will have to implement one day.
>
> Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables
> be kind of immutable constants? If yes, we could calculate where the additional
> vma lives without requiring any kind of [vdso] mark in proc/pid/maps output.

Please don't!

These might, in principle, even vary between tasks on the same system.
 Certainly the relative positions of the vmas will be different
between 3.15 and 3.16, since we need almost my entire cleanup series
to reliably put them into their 3.16 location.  And I intend to change
the number of pages in 3.16 or 3.17.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15 21:42                             ` Andy Lutomirski
@ 2014-05-15 21:57                               ` Cyrill Gorcunov
  2014-05-15 22:15                                 ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Cyrill Gorcunov @ 2014-05-15 21:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Thu, May 15, 2014 at 02:42:48PM -0700, Andy Lutomirski wrote:
> >
> > Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables
> > be kind of immutable constants? If yes, we could calculate where the additional
> > vma lives without requiring any kind of [vdso] mark in proc/pid/maps output.
> 
> Please don't!
> 
> These might, in principle, even vary between tasks on the same system.
>  Certainly the relative positions of the vmas will be different
> between 3.15 and 3.16, since we need almost my entire cleanup series
> to reliably put them into their 3.16 location.  And I intend to change
> the number of pages in 3.16 or 3.17.

There are other ways how to find where additional pages are laying but it
would be great if there a straightforward interface for that (ie some mark
in /proc/pid/maps output).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15 21:57                               ` Cyrill Gorcunov
@ 2014-05-15 22:15                                 ` Andy Lutomirski
  2014-05-16 22:40                                   ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2014-05-15 22:15 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On Thu, May 15, 2014 at 2:57 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Thu, May 15, 2014 at 02:42:48PM -0700, Andy Lutomirski wrote:
>> >
>> > Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables
>> > be kind of immutable constants? If yes, we could calculate where the additional
>> > vma lives without requiring any kind of [vdso] mark in proc/pid/maps output.
>>
>> Please don't!
>>
>> These might, in principle, even vary between tasks on the same system.
>>  Certainly the relative positions of the vmas will be different
>> between 3.15 and 3.16, since we need almost my entire cleanup series
>> to reliably put them into their 3.16 location.  And I intend to change
>> the number of pages in 3.16 or 3.17.
>
> There are other ways how to find where additional pages are laying but it
> would be great if there a straightforward interface for that (ie some mark
> in /proc/pid/maps output).

I'll try to write a patch in time for 3.15.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15 22:15                                 ` Andy Lutomirski
@ 2014-05-16 22:40                                   ` Andy Lutomirski
  2014-05-16 22:56                                     ` H. Peter Anvin
  2014-05-17  6:15                                     ` Cyrill Gorcunov
  0 siblings, 2 replies; 27+ messages in thread
From: Andy Lutomirski @ 2014-05-16 22:40 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML,
	Pavel Emelyanov, H. Peter Anvin

On Thu, May 15, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, May 15, 2014 at 2:57 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>> On Thu, May 15, 2014 at 02:42:48PM -0700, Andy Lutomirski wrote:
>>> >
>>> > Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables
>>> > be kind of immutable constants? If yes, we could calculate where the additional
>>> > vma lives without requiring any kind of [vdso] mark in proc/pid/maps output.
>>>
>>> Please don't!
>>>
>>> These might, in principle, even vary between tasks on the same system.
>>>  Certainly the relative positions of the vmas will be different
>>> between 3.15 and 3.16, since we need almost my entire cleanup series
>>> to reliably put them into their 3.16 location.  And I intend to change
>>> the number of pages in 3.16 or 3.17.
>>
>> There are other ways how to find where additional pages are laying but it
>> would be great if there a straightforward interface for that (ie some mark
>> in /proc/pid/maps output).
>
> I'll try to write a patch in time for 3.15.
>

My current draft is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=vdso/cleanups

On 64-bit userspace, it results in:

7fffa1dfd000-7fffa1dfe000 r-xp 00000000 00:00 0                          [vdso]
7fffa1dfe000-7fffa1e00000 r--p 00000000 00:00 0                          [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
  [vsyscall]

On 32-bit userspace, it results in:

f7748000-f7749000 r-xp 00000000 00:00 0                                  [vdso]
f7749000-f774b000 r--p 00000000 00:00 0                                  [vvar]
ffd94000-ffdb5000 rw-p 00000000 00:00 0                                  [stack]

Is this good for CRIU?  Another approach would be to name both of
these things "vdso", since they are sort of both the vdso, but that
might be a bit confusing -- [vvar] is not static text the way that
[vdso] is.

If I backport this for 3.15 (which might be nasty -- I would argue
that the code change is actually a cleanup, but it's fairly
intrusive), then [vvar] will be *before* [vdso], not after it.  I'd be
very hesitant to name both of them "[vdso]" in that case, since there
is probably code that assumes that the beginning of "[vdso]" is a DSO.

Note that it is *not* safe to blindly read from "[vvar]".  On some
configurations you *will* get SIGBUS if you try to read from some of
the vvar pages.  (That's what started this whole thread.)  Some pages
in "[vvar]" may have strange caching modes, so SIGBUS might not be the
only surprising thing about poking at it.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-16 22:40                                   ` Andy Lutomirski
@ 2014-05-16 22:56                                     ` H. Peter Anvin
  2014-05-17  6:15                                     ` Cyrill Gorcunov
  1 sibling, 0 replies; 27+ messages in thread
From: H. Peter Anvin @ 2014-05-16 22:56 UTC (permalink / raw)
  To: Andy Lutomirski, Cyrill Gorcunov
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov

On 05/16/2014 03:40 PM, Andy Lutomirski wrote:
> 
> My current draft is here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=vdso/cleanups
> 
> On 64-bit userspace, it results in:
> 
> 7fffa1dfd000-7fffa1dfe000 r-xp 00000000 00:00 0                          [vdso]
> 7fffa1dfe000-7fffa1e00000 r--p 00000000 00:00 0                          [vvar]
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
>   [vsyscall]
> 
> On 32-bit userspace, it results in:
> 
> f7748000-f7749000 r-xp 00000000 00:00 0                                  [vdso]
> f7749000-f774b000 r--p 00000000 00:00 0                                  [vvar]
> ffd94000-ffdb5000 rw-p 00000000 00:00 0                                  [stack]
> 
> Is this good for CRIU?  Another approach would be to name both of
> these things "vdso", since they are sort of both the vdso, but that
> might be a bit confusing -- [vvar] is not static text the way that
> [vdso] is.
> 
> If I backport this for 3.15 (which might be nasty -- I would argue
> that the code change is actually a cleanup, but it's fairly
> intrusive), then [vvar] will be *before* [vdso], not after it.  I'd be
> very hesitant to name both of them "[vdso]" in that case, since there
> is probably code that assumes that the beginning of "[vdso]" is a DSO.
> 
> Note that it is *not* safe to blindly read from "[vvar]".  On some
> configurations you *will* get SIGBUS if you try to read from some of
> the vvar pages.  (That's what started this whole thread.)  Some pages
> in "[vvar]" may have strange caching modes, so SIGBUS might not be the
> only surprising thing about poking at it.
> 

mremap() should work on these pages, right?

	-hpa



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-16 22:40                                   ` Andy Lutomirski
  2014-05-16 22:56                                     ` H. Peter Anvin
@ 2014-05-17  6:15                                     ` Cyrill Gorcunov
  1 sibling, 0 replies; 27+ messages in thread
From: Cyrill Gorcunov @ 2014-05-17  6:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML,
	Pavel Emelyanov, H. Peter Anvin

On Fri, May 16, 2014 at 03:40:48PM -0700, Andy Lutomirski wrote:
> >>
> >> There are other ways how to find where additional pages are laying but it
> >> would be great if there a straightforward interface for that (ie some mark
> >> in /proc/pid/maps output).
> >
> > I'll try to write a patch in time for 3.15.
> >
> 
> My current draft is here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=vdso/cleanups
> 
> On 64-bit userspace, it results in:
> 
> 7fffa1dfd000-7fffa1dfe000 r-xp 00000000 00:00 0                          [vdso]
> 7fffa1dfe000-7fffa1e00000 r--p 00000000 00:00 0                          [vvar]
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
> 
> On 32-bit userspace, it results in:
> 
> f7748000-f7749000 r-xp 00000000 00:00 0                                  [vdso]
> f7749000-f774b000 r--p 00000000 00:00 0                                  [vvar]
> ffd94000-ffdb5000 rw-p 00000000 00:00 0                                  [stack]
> 
> Is this good for CRIU?  Another approach would be to name both of
> these things "vdso", since they are sort of both the vdso, but that
> might be a bit confusing -- [vvar] is not static text the way that
> [vdso] is.

Yeah, thanks a lot, Andy, this is more than enough.

> If I backport this for 3.15 (which might be nasty -- I would argue
> that the code change is actually a cleanup, but it's fairly
> intrusive), then [vvar] will be *before* [vdso], not after it.  I'd be
> very hesitant to name both of them "[vdso]" in that case, since there
> is probably code that assumes that the beginning of "[vdso]" is a DSO.
> 
> Note that it is *not* safe to blindly read from "[vvar]".  On some
> configurations you *will* get SIGBUS if you try to read from some of
> the vvar pages.  (That's what started this whole thread.)  Some pages
> in "[vvar]" may have strange caching modes, so SIGBUS might not be the
> only surprising thing about poking at it.

Ouch. Thanks for the note, I'll read new code with more attention and
report the effect it did over criu (prob. on next week).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-15 19:42                   ` Andy Lutomirski
@ 2014-05-19  8:27                     ` Pavel Emelyanov
  2014-05-19  8:40                       ` Cyrill Gorcunov
  0 siblings, 1 reply; 27+ messages in thread
From: Pavel Emelyanov @ 2014-05-19  8:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-mm, Cyrill Gorcunov, LKML, Sasha Levin, Andrew Morton, Dave Jones

On 05/15/2014 11:42 PM, Andy Lutomirski wrote:
> On May 14, 2014 8:36 PM, "Pavel Emelyanov" <xemul@parallels.com> wrote:
>>
>> On 05/15/2014 02:23 AM, Andy Lutomirski wrote:
>>> On Wed, May 14, 2014 at 3:11 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>>>> On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote:
>>>>> On Wed, May 14, 2014 at 2:31 PM, Andrew Morton
>>>>> <akpm@linux-foundation.org> wrote:
>>>>>> On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote:
>>>>>>
>>>>>>>> In my linux-next all that code got deleted by Andy's "x86, vdso:
>>>>>>>> Reimplement vdso.so preparation in build-time C" anyway.  What kernel
>>>>>>>> were you looking at?
>>>>>>>
>>>>>>> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 .
>>>>>>>
>>>>>>> I don't see Andy's patch removing that code either.
>>>>>>
>>>>>> ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into
>>>>>> arch/x86/vdso/vma.c.
>>>>>>
>>>>>> Maybe you managed to take a fault against the symbol area between the
>>>>>> _install_special_mapping() and the remap_pfn_range() call, but mmap_sem
>>>>>> should prevent that.
>>>>>>
>>>>>> Or the remap_pfn_range() call never happened.  Should map_vdso() be
>>>>>> running _install_special_mapping() at all if
>>>>>> image->sym_vvar_page==NULL?
>>>>>
>>>>> I'm confused: are we talking about 3.15-rcsomething or linux-next?
>>>>> That code changed.
>>>>>
>>>>> Would this all make more sense if there were just a single vma in
>>>>> here?  cc: Pavel and Cyrill, who might have to deal with this stuff in
>>>>> CRIU
>>>>
>>>> Well, for criu we've not modified any vdso kernel's code (except
>>>> setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping).
>>>> And never experienced problems Sasha points. Looks like indeed in
>>>> -next code is pretty different from mainline one. To figure out
>>>> why I need to fetch -next branch and get some research. I would
>>>> try to do that tomorrow (still hoping someone more experienced
>>>> in mm system would beat me on that).
>>>
>>> I can summarize:
>>>
>>> On 3.14 and before, the vdso is just a bunch of ELF headers and
>>> executable data.  When executed by 64-bit binaries, it reads from the
>>> fixmap to do its thing.  That is, it reads from kernel addresses that
>>> don't have vmas.  When executed by 32-bit binaries, it doesn't read
>>> anything, since there was no 32-bit timing code.
>>>
>>> On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
>>> a separate vma containing two pages worth of time-varying read-only
>>> data.  The vdso reads those pages using PIC references.
>>>
>>> On linux-next, all vdsos work the same way.  There are two vmas.  The
>>> first vma is executable text, which can be poked at by ptrace, etc
>>> normally.  The second vma contains time-varying state, should not
>>> allow poking, and is accessed by PIC references.
>>
>> Is this 2nd vma seen in /proc/pid/maps? And if so, is it marked somehow?
> 
> It is in maps, and it's not marked.  I can write a patch to change
> that.  I imagine it shouldn't be called [vdso], though.

That would be great.

>>
>>> What does CRIU do to restore the vdso?  Will 3.15 and/or linux-next
>>> need to make some concession for CRIU?
>>
>> We detect the vdso by "[vdso]" mark in proc at dump time and mark it in
>> the images. At restore time we check that vdso symbols layout hasn't changed
>> and just remap it in proper location.
>>
>> If this remains the same in -next, then we're fine :)
> 
> If you just remap the vdso, you'll crash.
> 
> This is the case in 3.15, too, for 32-bit apps, anyway.
> 
> What happens if you try to checkpoint a program that's in the vdso or,
> worse, in a signal frame with the vdso on the stack?

Nothing good, unfortunately :( And this is one of the things we're investigating.
Cyrill can shed more light on it, as he's the one in charge.

> --Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: mm: NULL ptr deref handling mmaping of special mappings
  2014-05-19  8:27                     ` Pavel Emelyanov
@ 2014-05-19  8:40                       ` Cyrill Gorcunov
  0 siblings, 0 replies; 27+ messages in thread
From: Cyrill Gorcunov @ 2014-05-19  8:40 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andy Lutomirski, linux-mm, LKML, Sasha Levin, Andrew Morton, Dave Jones

On Mon, May 19, 2014 at 12:27:29PM +0400, Pavel Emelyanov wrote:
> > 
> > What happens if you try to checkpoint a program that's in the vdso or,
> > worse, in a signal frame with the vdso on the stack?
> 
> Nothing good, unfortunately :( And this is one of the things we're investigating.
> Cyrill can shed more light on it, as he's the one in charge.

vdso on stack should not be a big deal, because we keep original vdso proxy.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-05-19  8:40 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-14 15:55 mm: NULL ptr deref handling mmaping of special mappings Sasha Levin
2014-05-14 20:23 ` Andrew Morton
2014-05-14 20:41   ` Sasha Levin
2014-05-14 21:03     ` Andrew Morton
2014-05-14 21:11       ` Sasha Levin
2014-05-14 21:31         ` Andrew Morton
2014-05-14 21:33           ` Andy Lutomirski
2014-05-14 22:11             ` Cyrill Gorcunov
2014-05-14 22:23               ` Andy Lutomirski
2014-05-15  2:36                 ` Pavel Emelyanov
2014-05-15 19:42                   ` Andy Lutomirski
2014-05-19  8:27                     ` Pavel Emelyanov
2014-05-19  8:40                       ` Cyrill Gorcunov
2014-05-15  8:45                 ` Cyrill Gorcunov
2014-05-15 19:46                   ` Andy Lutomirski
2014-05-15 19:53                     ` Cyrill Gorcunov
2014-05-15 19:59                       ` Andy Lutomirski
2014-05-15 20:19                         ` Cyrill Gorcunov
2014-05-15 21:31                           ` Cyrill Gorcunov
2014-05-15 21:42                             ` Andy Lutomirski
2014-05-15 21:57                               ` Cyrill Gorcunov
2014-05-15 22:15                                 ` Andy Lutomirski
2014-05-16 22:40                                   ` Andy Lutomirski
2014-05-16 22:56                                     ` H. Peter Anvin
2014-05-17  6:15                                     ` Cyrill Gorcunov
2014-05-14 22:51           ` Andy Lutomirski
2014-05-14 21:26       ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).