* mm: NULL ptr deref handling mmaping of special mappings @ 2014-05-14 15:55 Sasha Levin 2014-05-14 20:23 ` Andrew Morton 0 siblings, 1 reply; 27+ messages in thread From: Sasha Levin @ 2014-05-14 15:55 UTC (permalink / raw) To: linux-mm; +Cc: Andrew Morton, Dave Jones, LKML Hi all, While fuzzing with trinity inside a KVM tools guest running the latest -next kernel I've stumbled on the following spew: [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at (null) [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961) [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0 [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 1634.972913] Dumping ftrace buffer: [ 1634.975493] (ftrace buffer empty) [ 1634.977470] Modules linked in: [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461 [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000 [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961) [ 1634.977513] RSP: 0018:ffff88033351faf8 EFLAGS: 00010202 [ 1634.977513] RAX: 0000000000000000 RBX: ffff88033351fbb0 RCX: 0000000000000028 [ 1634.977513] RDX: 0000000000000001 RSI: ffff88033351fb38 RDI: ffff88006d28b600 [ 1634.977513] RBP: ffff88033351fb18 R08: ffff88033351fbb0 R09: 0000000000000000 [ 1634.977513] R10: ffff88001bcca690 R11: 0000000000000000 R12: ffff88001bcca690 [ 1634.977513] R13: ffff880000000748 R14: 0000000000000028 R15: ffff88006d28b600 [ 1634.977513] FS: 00007f79da4b8700(0000) GS:ffff8801b4c00000(0000) knlGS:0000000000000000 [ 1634.977513] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1634.977513] CR2: 0000000000000000 CR3: 00000003334fb000 CR4: 00000000000006a0 [ 1634.977513] Stack: [ 1634.977513] ffff88033351fb08 00000000000005d1 00000000000005d1 ffff88033351fbb0 [ 1634.977513] ffff88033351fb78 ffffffffac2b83a9 ffff88033351fc28 ffffffffac1cb679 [ 1634.977513] 0000000000000028 00000007f79da4e9 00007f79da4e9000 0000000000000000 [ 1634.977513] Call Trace: [ 1634.977513] __do_fault (mm/memory.c:2703) [ 1634.977513] ? __lock_acquire (kernel/locking/lockdep.c:3189) [ 1634.977513] do_read_fault.isra.40 (mm/memory.c:2883) [ 1634.977513] ? get_parent_ip (kernel/sched/core.c:2519) [ 1634.977513] ? get_parent_ip (kernel/sched/core.c:2519) [ 1634.977513] ? __mem_cgroup_count_vm_event (include/linux/rcupdate.h:391 include/linux/rcupdate.h:893 mm/memcontrol.c:1302) [ 1634.977513] ? get_parent_ip (kernel/sched/core.c:2519) [ 1634.977513] __handle_mm_fault (mm/memory.c:3021 mm/memory.c:3182 mm/memory.c:3306) [ 1634.977513] ? __const_udelay (arch/x86/lib/delay.c:126) [ 1634.977513] ? __rcu_read_unlock (kernel/rcu/update.c:97) [ 1634.977513] handle_mm_fault (mm/memory.c:3329) [ 1634.977513] __do_page_fault (arch/x86/mm/fault.c:1224) [ 1634.977513] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86) [ 1634.977513] ? sched_clock (arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:305) [ 1634.977513] ? sched_clock_local (kernel/sched/clock.c:214) [ 1634.977513] ? get_parent_ip (kernel/sched/core.c:2519) [ 1634.977513] ? context_tracking_user_exit (kernel/context_tracking.c:182) [ 1634.977513] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [ 1634.977513] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2)) [ 1634.977513] do_page_fault (arch/x86/mm/fault.c:1277 include/linux/jump_label.h:105 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1278) [ 1634.977513] do_async_page_fault (arch/x86/kernel/kvm.c:264) [ 1634.977513] async_page_fault (arch/x86/kernel/entry_64.S:1552) [ 1634.977513] ? copy_user_generic_unrolled (arch/x86/lib/copy_user_64.S:153) [ 1634.977513] ? SyS_add_key (security/keys/keyctl.c:109 security/keys/keyctl.c:62) [ 1634.977513] tracesys (arch/x86/kernel/entry_64.S:746) [ 1634.977513] Code: 1f 40 00 66 66 66 66 90 55 48 89 e5 53 48 83 ec 18 48 8b 56 08 48 8b 87 a8 00 00 00 48 2b 97 98 00 00 00 74 1e 66 0f 1f 44 00 00 <48> 83 38 00 74 62 48 83 c0 08 48 83 ea 01 75 f0 0f 1f 84 00 00 [ 1634.977513] RIP special_mapping_fault (mm/mmap.c:2961) [ 1634.977513] RSP <ffff88033351faf8> [ 1634.977513] CR2: 0000000000000000 Thanks, Sasha ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 15:55 mm: NULL ptr deref handling mmaping of special mappings Sasha Levin @ 2014-05-14 20:23 ` Andrew Morton 2014-05-14 20:41 ` Sasha Levin 0 siblings, 1 reply; 27+ messages in thread From: Andrew Morton @ 2014-05-14 20:23 UTC (permalink / raw) To: Sasha Levin; +Cc: linux-mm, Dave Jones, LKML On Wed, 14 May 2014 11:55:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > Hi all, > > While fuzzing with trinity inside a KVM tools guest running the latest -next > kernel I've stumbled on the following spew: > > [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at (null) > [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961) > [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0 > [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 1634.972913] Dumping ftrace buffer: > [ 1634.975493] (ftrace buffer empty) > [ 1634.977470] Modules linked in: > [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461 > [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000 > [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961) Somebody's gone and broken the x86 oops output. It used to say "special_mapping_fault+0x30/0x120" but the offset info has now disappeared. That was useful for guesstimating whereabouts in the function it died. The line number isn't very useful as it's not possible (or at least, not convenient) for others to reliably reproduce your kernel. <scrabbles with git for a while> : static int special_mapping_fault(struct vm_area_struct *vma, : struct vm_fault *vmf) : { : pgoff_t pgoff; : struct page **pages; : : /* : * special mappings have no vm_file, and in that case, the mm : * uses vm_pgoff internally. So we have to subtract it from here. : * We are allowed to do this because we are the mm; do not copy : * this code into drivers! : */ : pgoff = vmf->pgoff - vma->vm_pgoff; : : for (pages = vma->vm_private_data; pgoff && *pages; ++pages) : pgoff--; : : if (*pages) { : struct page *page = *pages; : get_page(page); : vmf->page = page; : return 0; : } : : return VM_FAULT_SIGBUS; : } OK so it might be the "if (*pages)". So vma->vm_private_data was NULL and pgoff was zero. As usual, I can't imagine what race would cause that :( ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 20:23 ` Andrew Morton @ 2014-05-14 20:41 ` Sasha Levin 2014-05-14 21:03 ` Andrew Morton 0 siblings, 1 reply; 27+ messages in thread From: Sasha Levin @ 2014-05-14 20:41 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, Dave Jones, LKML On 05/14/2014 04:23 PM, Andrew Morton wrote: > On Wed, 14 May 2014 11:55:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > >> Hi all, >> >> While fuzzing with trinity inside a KVM tools guest running the latest -next >> kernel I've stumbled on the following spew: >> >> [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at (null) >> [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961) >> [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0 >> [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> [ 1634.972913] Dumping ftrace buffer: >> [ 1634.975493] (ftrace buffer empty) >> [ 1634.977470] Modules linked in: >> [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461 >> [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000 >> [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961) > > Somebody's gone and broken the x86 oops output. It used to say > "special_mapping_fault+0x30/0x120" but the offset info has now > disappeared. That was useful for guesstimating whereabouts in the > function it died. I'm the one who "broke" the oops output, but I thought I'm helping people read that output instead of making it harder... What happened before is that due to my rather complex .config, the offsets didn't make sense to anyone who didn't build the kernel with my .config, so I had to repeatedly send it out to folks who attempted to get basic things like line numbers. > The line number isn't very useful as it's not possible (or at least, > not convenient) for others to reliably reproduce your kernel. I don't understand that part. I'm usually stating in the beginning of my mails that I run my testing on the latest -next kernel. And indeed if you look at today's -next, that line number would point to: for (pages = vma->vm_private_data; pgoff && *pages; ++pages) <=== HERE pgoff--; So I'm not sure how replacing the offset with line numbers is making things worse? previously offsets were useless for people who tried to debug these spews so that's why I switched it to line numbers in the first place. > <scrabbles with git for a while> > > : static int special_mapping_fault(struct vm_area_struct *vma, > : struct vm_fault *vmf) > : { > : pgoff_t pgoff; > : struct page **pages; > : > : /* > : * special mappings have no vm_file, and in that case, the mm > : * uses vm_pgoff internally. So we have to subtract it from here. > : * We are allowed to do this because we are the mm; do not copy > : * this code into drivers! > : */ > : pgoff = vmf->pgoff - vma->vm_pgoff; > : > : for (pages = vma->vm_private_data; pgoff && *pages; ++pages) > : pgoff--; > : > : if (*pages) { > : struct page *page = *pages; > : get_page(page); > : vmf->page = page; > : return 0; > : } > : > : return VM_FAULT_SIGBUS; > : } > > OK so it might be the "if (*pages)". So vma->vm_private_data was NULL > and pgoff was zero. As usual, I can't imagine what race would cause > that :( Yup, it's the *pages part in the 'for' loop above that. I did find the following in the vdso code: vma = _install_special_mapping(mm, addr + image->size, image->sym_end_mapping - image->size, VM_READ, NULL); Which installs a mapping with a NULL ptr for pages (if I understand that correctly), but that code has been there for a while now. Thanks, Sasha been there ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 20:41 ` Sasha Levin @ 2014-05-14 21:03 ` Andrew Morton 2014-05-14 21:11 ` Sasha Levin 2014-05-14 21:26 ` Andy Lutomirski 0 siblings, 2 replies; 27+ messages in thread From: Andrew Morton @ 2014-05-14 21:03 UTC (permalink / raw) To: Sasha Levin; +Cc: linux-mm, Dave Jones, LKML, Andy Lutomirski On Wed, 14 May 2014 16:41:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > On 05/14/2014 04:23 PM, Andrew Morton wrote: > > On Wed, 14 May 2014 11:55:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > > > >> Hi all, > >> > >> While fuzzing with trinity inside a KVM tools guest running the latest -next > >> kernel I've stumbled on the following spew: > >> > >> [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at (null) > >> [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961) > >> [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0 > >> [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > >> [ 1634.972913] Dumping ftrace buffer: > >> [ 1634.975493] (ftrace buffer empty) > >> [ 1634.977470] Modules linked in: > >> [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461 > >> [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000 > >> [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961) > > > > Somebody's gone and broken the x86 oops output. It used to say > > "special_mapping_fault+0x30/0x120" but the offset info has now > > disappeared. That was useful for guesstimating whereabouts in the > > function it died. > > I'm the one who "broke" the oops output, but I thought I'm helping people > read that output instead of making it harder... > > What happened before is that due to my rather complex .config, the offsets > didn't make sense to anyone who didn't build the kernel with my .config, > so I had to repeatedly send it out to folks who attempted to get basic > things like line numbers. > > > The line number isn't very useful as it's not possible (or at least, > > not convenient) for others to reliably reproduce your kernel. > > I don't understand that part. I'm usually stating in the beginning of my > mails that I run my testing on the latest -next kernel. Your "latest next kernel" apparently differes from mine ;( It would be useful if you could just quote the +/-5 lines, perhaps? > And indeed if > you look at today's -next, that line number would point to: > > for (pages = vma->vm_private_data; pgoff && *pages; ++pages) <=== HERE > pgoff--; > > So I'm not sure how replacing the offset with line numbers is making things > worse? previously offsets were useless for people who tried to debug these > spews so that's why I switched it to line numbers in the first place. > > > <scrabbles with git for a while> > > > > : static int special_mapping_fault(struct vm_area_struct *vma, > > : struct vm_fault *vmf) > > : { > > : pgoff_t pgoff; > > : struct page **pages; > > : > > : /* > > : * special mappings have no vm_file, and in that case, the mm > > : * uses vm_pgoff internally. So we have to subtract it from here. > > : * We are allowed to do this because we are the mm; do not copy > > : * this code into drivers! > > : */ > > : pgoff = vmf->pgoff - vma->vm_pgoff; > > : > > : for (pages = vma->vm_private_data; pgoff && *pages; ++pages) > > : pgoff--; > > : > > : if (*pages) { > > : struct page *page = *pages; > > : get_page(page); > > : vmf->page = page; > > : return 0; > > : } > > : > > : return VM_FAULT_SIGBUS; > > : } > > > > OK so it might be the "if (*pages)". So vma->vm_private_data was NULL > > and pgoff was zero. As usual, I can't imagine what race would cause > > that :( > > Yup, it's the *pages part in the 'for' loop above that. I did find the > following in the vdso code: > > vma = _install_special_mapping(mm, > addr + image->size, > image->sym_end_mapping - image->size, > VM_READ, > NULL); > > Which installs a mapping with a NULL ptr for pages (if I understand that > correctly), but that code has been there for a while now. Well that's weird. I don't see anything which permits that. Maybe nobody faulted against that address before? It's unclear what that code's actually doing and nobody bothered commenting it of course. Maybe it's installing a guard page? In my linux-next all that code got deleted by Andy's "x86, vdso: Reimplement vdso.so preparation in build-time C" anyway. What kernel were you looking at? Andy, are you able to shed some light on why arch_setup_additional_pages() is (or was) passing a NULL into _install_special_mapping()? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 21:03 ` Andrew Morton @ 2014-05-14 21:11 ` Sasha Levin 2014-05-14 21:31 ` Andrew Morton 2014-05-14 21:26 ` Andy Lutomirski 1 sibling, 1 reply; 27+ messages in thread From: Sasha Levin @ 2014-05-14 21:11 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm, Dave Jones, LKML, Andy Lutomirski On 05/14/2014 05:03 PM, Andrew Morton wrote: > On Wed, 14 May 2014 16:41:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > >> On 05/14/2014 04:23 PM, Andrew Morton wrote: >>> On Wed, 14 May 2014 11:55:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: >>> >>>> Hi all, >>>> >>>> While fuzzing with trinity inside a KVM tools guest running the latest -next >>>> kernel I've stumbled on the following spew: >>>> >>>> [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at (null) >>>> [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961) >>>> [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0 >>>> [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >>>> [ 1634.972913] Dumping ftrace buffer: >>>> [ 1634.975493] (ftrace buffer empty) >>>> [ 1634.977470] Modules linked in: >>>> [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461 >>>> [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000 >>>> [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961) >>> >>> Somebody's gone and broken the x86 oops output. It used to say >>> "special_mapping_fault+0x30/0x120" but the offset info has now >>> disappeared. That was useful for guesstimating whereabouts in the >>> function it died. >> >> I'm the one who "broke" the oops output, but I thought I'm helping people >> read that output instead of making it harder... >> >> What happened before is that due to my rather complex .config, the offsets >> didn't make sense to anyone who didn't build the kernel with my .config, >> so I had to repeatedly send it out to folks who attempted to get basic >> things like line numbers. >> >>> The line number isn't very useful as it's not possible (or at least, >>> not convenient) for others to reliably reproduce your kernel. >> >> I don't understand that part. I'm usually stating in the beginning of my >> mails that I run my testing on the latest -next kernel. > > Your "latest next kernel" apparently differes from mine ;( It would be > useful if you could just quote the +/-5 lines, perhaps? Oh, I see what happened. I have the remap_file_pages() get_file/fput fix merged in which modified line count. Yup, I'll start quoting the line themselves as well. >> And indeed if >> you look at today's -next, that line number would point to: >> >> for (pages = vma->vm_private_data; pgoff && *pages; ++pages) <=== HERE >> pgoff--; >> >> So I'm not sure how replacing the offset with line numbers is making things >> worse? previously offsets were useless for people who tried to debug these >> spews so that's why I switched it to line numbers in the first place. >> >>> <scrabbles with git for a while> >>> >>> : static int special_mapping_fault(struct vm_area_struct *vma, >>> : struct vm_fault *vmf) >>> : { >>> : pgoff_t pgoff; >>> : struct page **pages; >>> : >>> : /* >>> : * special mappings have no vm_file, and in that case, the mm >>> : * uses vm_pgoff internally. So we have to subtract it from here. >>> : * We are allowed to do this because we are the mm; do not copy >>> : * this code into drivers! >>> : */ >>> : pgoff = vmf->pgoff - vma->vm_pgoff; >>> : >>> : for (pages = vma->vm_private_data; pgoff && *pages; ++pages) >>> : pgoff--; >>> : >>> : if (*pages) { >>> : struct page *page = *pages; >>> : get_page(page); >>> : vmf->page = page; >>> : return 0; >>> : } >>> : >>> : return VM_FAULT_SIGBUS; >>> : } >>> >>> OK so it might be the "if (*pages)". So vma->vm_private_data was NULL >>> and pgoff was zero. As usual, I can't imagine what race would cause >>> that :( >> >> Yup, it's the *pages part in the 'for' loop above that. I did find the >> following in the vdso code: >> >> vma = _install_special_mapping(mm, >> addr + image->size, >> image->sym_end_mapping - image->size, >> VM_READ, >> NULL); >> >> Which installs a mapping with a NULL ptr for pages (if I understand that >> correctly), but that code has been there for a while now. > > Well that's weird. I don't see anything which permits that. Maybe > nobody faulted against that address before? > > It's unclear what that code's actually doing and nobody bothered > commenting it of course. Maybe it's installing a guard page? > > In my linux-next all that code got deleted by Andy's "x86, vdso: > Reimplement vdso.so preparation in build-time C" anyway. What kernel > were you looking at? Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 . I don't see Andy's patch removing that code either. I'm running next-20140514... Thanks, Sasha ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 21:11 ` Sasha Levin @ 2014-05-14 21:31 ` Andrew Morton 2014-05-14 21:33 ` Andy Lutomirski 2014-05-14 22:51 ` Andy Lutomirski 0 siblings, 2 replies; 27+ messages in thread From: Andrew Morton @ 2014-05-14 21:31 UTC (permalink / raw) To: Sasha Levin; +Cc: linux-mm, Dave Jones, LKML, Andy Lutomirski On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > > In my linux-next all that code got deleted by Andy's "x86, vdso: > > Reimplement vdso.so preparation in build-time C" anyway. What kernel > > were you looking at? > > Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 . > > I don't see Andy's patch removing that code either. ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into arch/x86/vdso/vma.c. Maybe you managed to take a fault against the symbol area between the _install_special_mapping() and the remap_pfn_range() call, but mmap_sem should prevent that. Or the remap_pfn_range() call never happened. Should map_vdso() be running _install_special_mapping() at all if image->sym_vvar_page==NULL? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 21:31 ` Andrew Morton @ 2014-05-14 21:33 ` Andy Lutomirski 2014-05-14 22:11 ` Cyrill Gorcunov 2014-05-14 22:51 ` Andy Lutomirski 1 sibling, 1 reply; 27+ messages in thread From: Andy Lutomirski @ 2014-05-14 21:33 UTC (permalink / raw) To: Andrew Morton Cc: Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov, Cyrill Gorcunov On Wed, May 14, 2014 at 2:31 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > >> > In my linux-next all that code got deleted by Andy's "x86, vdso: >> > Reimplement vdso.so preparation in build-time C" anyway. What kernel >> > were you looking at? >> >> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 . >> >> I don't see Andy's patch removing that code either. > > ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into > arch/x86/vdso/vma.c. > > Maybe you managed to take a fault against the symbol area between the > _install_special_mapping() and the remap_pfn_range() call, but mmap_sem > should prevent that. > > Or the remap_pfn_range() call never happened. Should map_vdso() be > running _install_special_mapping() at all if > image->sym_vvar_page==NULL? I'm confused: are we talking about 3.15-rcsomething or linux-next? That code changed. Would this all make more sense if there were just a single vma in here? cc: Pavel and Cyrill, who might have to deal with this stuff in CRIU --Andy -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 21:33 ` Andy Lutomirski @ 2014-05-14 22:11 ` Cyrill Gorcunov 2014-05-14 22:23 ` Andy Lutomirski 0 siblings, 1 reply; 27+ messages in thread From: Cyrill Gorcunov @ 2014-05-14 22:11 UTC (permalink / raw) To: Andy Lutomirski Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote: > On Wed, May 14, 2014 at 2:31 PM, Andrew Morton > <akpm@linux-foundation.org> wrote: > > On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > > > >> > In my linux-next all that code got deleted by Andy's "x86, vdso: > >> > Reimplement vdso.so preparation in build-time C" anyway. What kernel > >> > were you looking at? > >> > >> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 . > >> > >> I don't see Andy's patch removing that code either. > > > > ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into > > arch/x86/vdso/vma.c. > > > > Maybe you managed to take a fault against the symbol area between the > > _install_special_mapping() and the remap_pfn_range() call, but mmap_sem > > should prevent that. > > > > Or the remap_pfn_range() call never happened. Should map_vdso() be > > running _install_special_mapping() at all if > > image->sym_vvar_page==NULL? > > I'm confused: are we talking about 3.15-rcsomething or linux-next? > That code changed. > > Would this all make more sense if there were just a single vma in > here? cc: Pavel and Cyrill, who might have to deal with this stuff in > CRIU Well, for criu we've not modified any vdso kernel's code (except setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping). And never experienced problems Sasha points. Looks like indeed in -next code is pretty different from mainline one. To figure out why I need to fetch -next branch and get some research. I would try to do that tomorrow (still hoping someone more experienced in mm system would beat me on that). ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 22:11 ` Cyrill Gorcunov @ 2014-05-14 22:23 ` Andy Lutomirski 2014-05-15 2:36 ` Pavel Emelyanov 2014-05-15 8:45 ` Cyrill Gorcunov 0 siblings, 2 replies; 27+ messages in thread From: Andy Lutomirski @ 2014-05-14 22:23 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Wed, May 14, 2014 at 3:11 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: > On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote: >> On Wed, May 14, 2014 at 2:31 PM, Andrew Morton >> <akpm@linux-foundation.org> wrote: >> > On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: >> > >> >> > In my linux-next all that code got deleted by Andy's "x86, vdso: >> >> > Reimplement vdso.so preparation in build-time C" anyway. What kernel >> >> > were you looking at? >> >> >> >> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 . >> >> >> >> I don't see Andy's patch removing that code either. >> > >> > ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into >> > arch/x86/vdso/vma.c. >> > >> > Maybe you managed to take a fault against the symbol area between the >> > _install_special_mapping() and the remap_pfn_range() call, but mmap_sem >> > should prevent that. >> > >> > Or the remap_pfn_range() call never happened. Should map_vdso() be >> > running _install_special_mapping() at all if >> > image->sym_vvar_page==NULL? >> >> I'm confused: are we talking about 3.15-rcsomething or linux-next? >> That code changed. >> >> Would this all make more sense if there were just a single vma in >> here? cc: Pavel and Cyrill, who might have to deal with this stuff in >> CRIU > > Well, for criu we've not modified any vdso kernel's code (except > setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping). > And never experienced problems Sasha points. Looks like indeed in > -next code is pretty different from mainline one. To figure out > why I need to fetch -next branch and get some research. I would > try to do that tomorrow (still hoping someone more experienced > in mm system would beat me on that). I can summarize: On 3.14 and before, the vdso is just a bunch of ELF headers and executable data. When executed by 64-bit binaries, it reads from the fixmap to do its thing. That is, it reads from kernel addresses that don't have vmas. When executed by 32-bit binaries, it doesn't read anything, since there was no 32-bit timing code. On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by a separate vma containing two pages worth of time-varying read-only data. The vdso reads those pages using PIC references. On linux-next, all vdsos work the same way. There are two vmas. The first vma is executable text, which can be poked at by ptrace, etc normally. The second vma contains time-varying state, should not allow poking, and is accessed by PIC references. What does CRIU do to restore the vdso? Will 3.15 and/or linux-next need to make some concession for CRIU? --Andy ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 22:23 ` Andy Lutomirski @ 2014-05-15 2:36 ` Pavel Emelyanov 2014-05-15 19:42 ` Andy Lutomirski 2014-05-15 8:45 ` Cyrill Gorcunov 1 sibling, 1 reply; 27+ messages in thread From: Pavel Emelyanov @ 2014-05-15 2:36 UTC (permalink / raw) To: Andy Lutomirski Cc: Cyrill Gorcunov, Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML On 05/15/2014 02:23 AM, Andy Lutomirski wrote: > On Wed, May 14, 2014 at 3:11 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: >> On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote: >>> On Wed, May 14, 2014 at 2:31 PM, Andrew Morton >>> <akpm@linux-foundation.org> wrote: >>>> On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: >>>> >>>>>> In my linux-next all that code got deleted by Andy's "x86, vdso: >>>>>> Reimplement vdso.so preparation in build-time C" anyway. What kernel >>>>>> were you looking at? >>>>> >>>>> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 . >>>>> >>>>> I don't see Andy's patch removing that code either. >>>> >>>> ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into >>>> arch/x86/vdso/vma.c. >>>> >>>> Maybe you managed to take a fault against the symbol area between the >>>> _install_special_mapping() and the remap_pfn_range() call, but mmap_sem >>>> should prevent that. >>>> >>>> Or the remap_pfn_range() call never happened. Should map_vdso() be >>>> running _install_special_mapping() at all if >>>> image->sym_vvar_page==NULL? >>> >>> I'm confused: are we talking about 3.15-rcsomething or linux-next? >>> That code changed. >>> >>> Would this all make more sense if there were just a single vma in >>> here? cc: Pavel and Cyrill, who might have to deal with this stuff in >>> CRIU >> >> Well, for criu we've not modified any vdso kernel's code (except >> setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping). >> And never experienced problems Sasha points. Looks like indeed in >> -next code is pretty different from mainline one. To figure out >> why I need to fetch -next branch and get some research. I would >> try to do that tomorrow (still hoping someone more experienced >> in mm system would beat me on that). > > I can summarize: > > On 3.14 and before, the vdso is just a bunch of ELF headers and > executable data. When executed by 64-bit binaries, it reads from the > fixmap to do its thing. That is, it reads from kernel addresses that > don't have vmas. When executed by 32-bit binaries, it doesn't read > anything, since there was no 32-bit timing code. > > On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by > a separate vma containing two pages worth of time-varying read-only > data. The vdso reads those pages using PIC references. > > On linux-next, all vdsos work the same way. There are two vmas. The > first vma is executable text, which can be poked at by ptrace, etc > normally. The second vma contains time-varying state, should not > allow poking, and is accessed by PIC references. Is this 2nd vma seen in /proc/pid/maps? And if so, is it marked somehow? > What does CRIU do to restore the vdso? Will 3.15 and/or linux-next > need to make some concession for CRIU? We detect the vdso by "[vdso]" mark in proc at dump time and mark it in the images. At restore time we check that vdso symbols layout hasn't changed and just remap it in proper location. If this remains the same in -next, then we're fine :) Thanks, Pavel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 2:36 ` Pavel Emelyanov @ 2014-05-15 19:42 ` Andy Lutomirski 2014-05-19 8:27 ` Pavel Emelyanov 0 siblings, 1 reply; 27+ messages in thread From: Andy Lutomirski @ 2014-05-15 19:42 UTC (permalink / raw) To: Pavel Emelyanov Cc: linux-mm, Cyrill Gorcunov, LKML, Sasha Levin, Andrew Morton, Dave Jones On May 14, 2014 8:36 PM, "Pavel Emelyanov" <xemul@parallels.com> wrote: > > On 05/15/2014 02:23 AM, Andy Lutomirski wrote: > > On Wed, May 14, 2014 at 3:11 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: > >> On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote: > >>> On Wed, May 14, 2014 at 2:31 PM, Andrew Morton > >>> <akpm@linux-foundation.org> wrote: > >>>> On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > >>>> > >>>>>> In my linux-next all that code got deleted by Andy's "x86, vdso: > >>>>>> Reimplement vdso.so preparation in build-time C" anyway. What kernel > >>>>>> were you looking at? > >>>>> > >>>>> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 . > >>>>> > >>>>> I don't see Andy's patch removing that code either. > >>>> > >>>> ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into > >>>> arch/x86/vdso/vma.c. > >>>> > >>>> Maybe you managed to take a fault against the symbol area between the > >>>> _install_special_mapping() and the remap_pfn_range() call, but mmap_sem > >>>> should prevent that. > >>>> > >>>> Or the remap_pfn_range() call never happened. Should map_vdso() be > >>>> running _install_special_mapping() at all if > >>>> image->sym_vvar_page==NULL? > >>> > >>> I'm confused: are we talking about 3.15-rcsomething or linux-next? > >>> That code changed. > >>> > >>> Would this all make more sense if there were just a single vma in > >>> here? cc: Pavel and Cyrill, who might have to deal with this stuff in > >>> CRIU > >> > >> Well, for criu we've not modified any vdso kernel's code (except > >> setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping). > >> And never experienced problems Sasha points. Looks like indeed in > >> -next code is pretty different from mainline one. To figure out > >> why I need to fetch -next branch and get some research. I would > >> try to do that tomorrow (still hoping someone more experienced > >> in mm system would beat me on that). > > > > I can summarize: > > > > On 3.14 and before, the vdso is just a bunch of ELF headers and > > executable data. When executed by 64-bit binaries, it reads from the > > fixmap to do its thing. That is, it reads from kernel addresses that > > don't have vmas. When executed by 32-bit binaries, it doesn't read > > anything, since there was no 32-bit timing code. > > > > On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by > > a separate vma containing two pages worth of time-varying read-only > > data. The vdso reads those pages using PIC references. > > > > On linux-next, all vdsos work the same way. There are two vmas. The > > first vma is executable text, which can be poked at by ptrace, etc > > normally. The second vma contains time-varying state, should not > > allow poking, and is accessed by PIC references. > > Is this 2nd vma seen in /proc/pid/maps? And if so, is it marked somehow? It is in maps, and it's not marked. I can write a patch to change that. I imagine it shouldn't be called [vdso], though. > > > What does CRIU do to restore the vdso? Will 3.15 and/or linux-next > > need to make some concession for CRIU? > > We detect the vdso by "[vdso]" mark in proc at dump time and mark it in > the images. At restore time we check that vdso symbols layout hasn't changed > and just remap it in proper location. > > If this remains the same in -next, then we're fine :) If you just remap the vdso, you'll crash. This is the case in 3.15, too, for 32-bit apps, anyway. What happens if you try to checkpoint a program that's in the vdso or, worse, in a signal frame with the vdso on the stack? --Andy > > Thanks, > Pavel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 19:42 ` Andy Lutomirski @ 2014-05-19 8:27 ` Pavel Emelyanov 2014-05-19 8:40 ` Cyrill Gorcunov 0 siblings, 1 reply; 27+ messages in thread From: Pavel Emelyanov @ 2014-05-19 8:27 UTC (permalink / raw) To: Andy Lutomirski Cc: linux-mm, Cyrill Gorcunov, LKML, Sasha Levin, Andrew Morton, Dave Jones On 05/15/2014 11:42 PM, Andy Lutomirski wrote: > On May 14, 2014 8:36 PM, "Pavel Emelyanov" <xemul@parallels.com> wrote: >> >> On 05/15/2014 02:23 AM, Andy Lutomirski wrote: >>> On Wed, May 14, 2014 at 3:11 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: >>>> On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote: >>>>> On Wed, May 14, 2014 at 2:31 PM, Andrew Morton >>>>> <akpm@linux-foundation.org> wrote: >>>>>> On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: >>>>>> >>>>>>>> In my linux-next all that code got deleted by Andy's "x86, vdso: >>>>>>>> Reimplement vdso.so preparation in build-time C" anyway. What kernel >>>>>>>> were you looking at? >>>>>>> >>>>>>> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 . >>>>>>> >>>>>>> I don't see Andy's patch removing that code either. >>>>>> >>>>>> ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into >>>>>> arch/x86/vdso/vma.c. >>>>>> >>>>>> Maybe you managed to take a fault against the symbol area between the >>>>>> _install_special_mapping() and the remap_pfn_range() call, but mmap_sem >>>>>> should prevent that. >>>>>> >>>>>> Or the remap_pfn_range() call never happened. Should map_vdso() be >>>>>> running _install_special_mapping() at all if >>>>>> image->sym_vvar_page==NULL? >>>>> >>>>> I'm confused: are we talking about 3.15-rcsomething or linux-next? >>>>> That code changed. >>>>> >>>>> Would this all make more sense if there were just a single vma in >>>>> here? cc: Pavel and Cyrill, who might have to deal with this stuff in >>>>> CRIU >>>> >>>> Well, for criu we've not modified any vdso kernel's code (except >>>> setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping). >>>> And never experienced problems Sasha points. Looks like indeed in >>>> -next code is pretty different from mainline one. To figure out >>>> why I need to fetch -next branch and get some research. I would >>>> try to do that tomorrow (still hoping someone more experienced >>>> in mm system would beat me on that). >>> >>> I can summarize: >>> >>> On 3.14 and before, the vdso is just a bunch of ELF headers and >>> executable data. When executed by 64-bit binaries, it reads from the >>> fixmap to do its thing. That is, it reads from kernel addresses that >>> don't have vmas. When executed by 32-bit binaries, it doesn't read >>> anything, since there was no 32-bit timing code. >>> >>> On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by >>> a separate vma containing two pages worth of time-varying read-only >>> data. The vdso reads those pages using PIC references. >>> >>> On linux-next, all vdsos work the same way. There are two vmas. The >>> first vma is executable text, which can be poked at by ptrace, etc >>> normally. The second vma contains time-varying state, should not >>> allow poking, and is accessed by PIC references. >> >> Is this 2nd vma seen in /proc/pid/maps? And if so, is it marked somehow? > > It is in maps, and it's not marked. I can write a patch to change > that. I imagine it shouldn't be called [vdso], though. That would be great. >> >>> What does CRIU do to restore the vdso? Will 3.15 and/or linux-next >>> need to make some concession for CRIU? >> >> We detect the vdso by "[vdso]" mark in proc at dump time and mark it in >> the images. At restore time we check that vdso symbols layout hasn't changed >> and just remap it in proper location. >> >> If this remains the same in -next, then we're fine :) > > If you just remap the vdso, you'll crash. > > This is the case in 3.15, too, for 32-bit apps, anyway. > > What happens if you try to checkpoint a program that's in the vdso or, > worse, in a signal frame with the vdso on the stack? Nothing good, unfortunately :( And this is one of the things we're investigating. Cyrill can shed more light on it, as he's the one in charge. > --Andy ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-19 8:27 ` Pavel Emelyanov @ 2014-05-19 8:40 ` Cyrill Gorcunov 0 siblings, 0 replies; 27+ messages in thread From: Cyrill Gorcunov @ 2014-05-19 8:40 UTC (permalink / raw) To: Pavel Emelyanov Cc: Andy Lutomirski, linux-mm, LKML, Sasha Levin, Andrew Morton, Dave Jones On Mon, May 19, 2014 at 12:27:29PM +0400, Pavel Emelyanov wrote: > > > > What happens if you try to checkpoint a program that's in the vdso or, > > worse, in a signal frame with the vdso on the stack? > > Nothing good, unfortunately :( And this is one of the things we're investigating. > Cyrill can shed more light on it, as he's the one in charge. vdso on stack should not be a big deal, because we keep original vdso proxy. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 22:23 ` Andy Lutomirski 2014-05-15 2:36 ` Pavel Emelyanov @ 2014-05-15 8:45 ` Cyrill Gorcunov 2014-05-15 19:46 ` Andy Lutomirski 1 sibling, 1 reply; 27+ messages in thread From: Cyrill Gorcunov @ 2014-05-15 8:45 UTC (permalink / raw) To: Andy Lutomirski Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote: > I can summarize: > > On 3.14 and before, the vdso is just a bunch of ELF headers and > executable data. When executed by 64-bit binaries, it reads from the > fixmap to do its thing. That is, it reads from kernel addresses that > don't have vmas. When executed by 32-bit binaries, it doesn't read > anything, since there was no 32-bit timing code. > > On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by > a separate vma containing two pages worth of time-varying read-only > data. The vdso reads those pages using PIC references. Andy, could you please point me where is the code which creates a second vma? latest 3.15 master branch [root@fc ~]# cat /proc/self/maps ... 7fff57b6e000-7fff57b8f000 rw-p 00000000 00:00 0 [stack] 7fff57bff000-7fff57c00000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] [root@fc ~]# Or you mean vsyscall area? If yes, then in criu we don't dump vsyscall zone. On restore we don't touch vsyscall either but for vdso there are two cases - if there were no kernel change on vdso contents we simply use vdso provided by the kernel at the moment of criu startup - if vdso has been changed and looks different from one saved in image during checkpoint, we map it from image but then patch (push jmp instruction) so when application calls for some of vdso function it jumps into vdso code saved in image and then jumps into vdso mapped by the kernel (ie kind of proxy calls) This force us to do own Elf parsing inside criu to calculate proper offsets. We don't support (and have no plans to support) x86-32 kernels but there left a case with compatible mode (32bit app on 64bit kernel) which has not yet been implemented though. > On linux-next, all vdsos work the same way. There are two vmas. The > first vma is executable text, which can be poked at by ptrace, etc > normally. The second vma contains time-varying state, should not > allow poking, and is accessed by PIC references. > > What does CRIU do to restore the vdso? Will 3.15 and/or linux-next > need to make some concession for CRIU? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 8:45 ` Cyrill Gorcunov @ 2014-05-15 19:46 ` Andy Lutomirski 2014-05-15 19:53 ` Cyrill Gorcunov 0 siblings, 1 reply; 27+ messages in thread From: Andy Lutomirski @ 2014-05-15 19:46 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Thu, May 15, 2014 at 1:45 AM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: > On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote: >> I can summarize: >> >> On 3.14 and before, the vdso is just a bunch of ELF headers and >> executable data. When executed by 64-bit binaries, it reads from the >> fixmap to do its thing. That is, it reads from kernel addresses that >> don't have vmas. When executed by 32-bit binaries, it doesn't read >> anything, since there was no 32-bit timing code. >> >> On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by >> a separate vma containing two pages worth of time-varying read-only >> data. The vdso reads those pages using PIC references. > > Andy, could you please point me where is the code which creates a second vma? > latest 3.15 master branch Search for _install_special_mapping in arch/x86/vdso. It's in a different place in 3.15-rc and -next. > > [root@fc ~]# cat /proc/self/maps > ... > 7fff57b6e000-7fff57b8f000 rw-p 00000000 00:00 0 [stack] > 7fff57bff000-7fff57c00000 r-xp 00000000 00:00 0 [vdso] > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] > [root@fc ~]# > What version and bitness is this? > Or you mean vsyscall area? If yes, then in criu we don't dump vsyscall zone. > On restore we don't touch vsyscall either but for vdso there are two cases vsyscalls are almost gone now :) > > - if there were no kernel change on vdso contents we simply use vdso provided > by the kernel at the moment of criu startup > > - if vdso has been changed and looks different from one saved in image during > checkpoint, we map it from image but then patch (push jmp instruction) so > when application calls for some of vdso function it jumps into vdso code > saved in image and then jumps into vdso mapped by the kernel (ie kind of > proxy calls) This force us to do own Elf parsing inside criu to calculate > proper offsets. Yuck :) --Andy ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 19:46 ` Andy Lutomirski @ 2014-05-15 19:53 ` Cyrill Gorcunov 2014-05-15 19:59 ` Andy Lutomirski 0 siblings, 1 reply; 27+ messages in thread From: Cyrill Gorcunov @ 2014-05-15 19:53 UTC (permalink / raw) To: Andy Lutomirski Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Thu, May 15, 2014 at 12:46:34PM -0700, Andy Lutomirski wrote: > On Thu, May 15, 2014 at 1:45 AM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: > > On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote: > >> I can summarize: > >> > >> On 3.14 and before, the vdso is just a bunch of ELF headers and > >> executable data. When executed by 64-bit binaries, it reads from the > >> fixmap to do its thing. That is, it reads from kernel addresses that > >> don't have vmas. When executed by 32-bit binaries, it doesn't read > >> anything, since there was no 32-bit timing code. > >> > >> On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by > >> a separate vma containing two pages worth of time-varying read-only > >> data. The vdso reads those pages using PIC references. > > > > Andy, could you please point me where is the code which creates a second vma? > > latest 3.15 master branch > > Search for _install_special_mapping in arch/x86/vdso. It's in a > different place in 3.15-rc and -next. As far as I see _install_special_mapping allocates one vma from cache and vma->vm_start = addr; vma->vm_end = addr + len; so where is the second one? > > > > > [root@fc ~]# cat /proc/self/maps > > ... > > 7fff57b6e000-7fff57b8f000 rw-p 00000000 00:00 0 [stack] > > 7fff57bff000-7fff57c00000 r-xp 00000000 00:00 0 [vdso] > > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] > > [root@fc ~]# > > > > What version and bitness is this? x86-64, 3.15-rc5 > > > Or you mean vsyscall area? If yes, then in criu we don't dump vsyscall zone. > > On restore we don't touch vsyscall either but for vdso there are two cases > > vsyscalls are almost gone now :) Good to know ;) > > > > > - if there were no kernel change on vdso contents we simply use vdso provided > > by the kernel at the moment of criu startup > > > > - if vdso has been changed and looks different from one saved in image during > > checkpoint, we map it from image but then patch (push jmp instruction) so > > when application calls for some of vdso function it jumps into vdso code > > saved in image and then jumps into vdso mapped by the kernel (ie kind of > > proxy calls) This force us to do own Elf parsing inside criu to calculate > > proper offsets. > > Yuck :) Yeah, I know, we simply had no choise. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 19:53 ` Cyrill Gorcunov @ 2014-05-15 19:59 ` Andy Lutomirski 2014-05-15 20:19 ` Cyrill Gorcunov 0 siblings, 1 reply; 27+ messages in thread From: Andy Lutomirski @ 2014-05-15 19:59 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Thu, May 15, 2014 at 12:53 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: > On Thu, May 15, 2014 at 12:46:34PM -0700, Andy Lutomirski wrote: >> On Thu, May 15, 2014 at 1:45 AM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: >> > On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote: >> >> I can summarize: >> >> >> >> On 3.14 and before, the vdso is just a bunch of ELF headers and >> >> executable data. When executed by 64-bit binaries, it reads from the >> >> fixmap to do its thing. That is, it reads from kernel addresses that >> >> don't have vmas. When executed by 32-bit binaries, it doesn't read >> >> anything, since there was no 32-bit timing code. >> >> >> >> On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by >> >> a separate vma containing two pages worth of time-varying read-only >> >> data. The vdso reads those pages using PIC references. >> > >> > Andy, could you please point me where is the code which creates a second vma? >> > latest 3.15 master branch >> >> Search for _install_special_mapping in arch/x86/vdso. It's in a >> different place in 3.15-rc and -next. > > As far as I see _install_special_mapping allocates one vma from cache and > > vma->vm_start = addr; > vma->vm_end = addr + len; > > so where is the second one? Look at its callers in vdso32-setup.c and/or vma.c, depending on version. > >> >> > >> > [root@fc ~]# cat /proc/self/maps >> > ... >> > 7fff57b6e000-7fff57b8f000 rw-p 00000000 00:00 0 [stack] >> > 7fff57bff000-7fff57c00000 r-xp 00000000 00:00 0 [vdso] >> > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] >> > [root@fc ~]# >> > >> >> What version and bitness is this? > > x86-64, 3.15-rc5 Aha. Give tip/x86/vdso or -next a try or boot a 32-bit 3.15-rc kernel and you'll see it. --Andy ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 19:59 ` Andy Lutomirski @ 2014-05-15 20:19 ` Cyrill Gorcunov 2014-05-15 21:31 ` Cyrill Gorcunov 0 siblings, 1 reply; 27+ messages in thread From: Cyrill Gorcunov @ 2014-05-15 20:19 UTC (permalink / raw) To: Andy Lutomirski Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Thu, May 15, 2014 at 12:59:04PM -0700, Andy Lutomirski wrote: > >> > >> What version and bitness is this? > > > > x86-64, 3.15-rc5 > > Aha. Give tip/x86/vdso or -next a try or boot a 32-bit 3.15-rc kernel > and you'll see it. I see what you mean. We're rather targeting on bare x86-64 at the moment but compat mode is needed as well (not yet implemented though). I'll take a precise look into this area. Thanks! ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 20:19 ` Cyrill Gorcunov @ 2014-05-15 21:31 ` Cyrill Gorcunov 2014-05-15 21:42 ` Andy Lutomirski 0 siblings, 1 reply; 27+ messages in thread From: Cyrill Gorcunov @ 2014-05-15 21:31 UTC (permalink / raw) To: Andy Lutomirski Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Fri, May 16, 2014 at 12:19:14AM +0400, Cyrill Gorcunov wrote: > > I see what you mean. We're rather targeting on bare x86-64 at the moment > but compat mode is needed as well (not yet implemented though). I'll take > a precise look into this area. Thanks! Indeed, because we were not running 32bit tasks vdso32-setup.c::arch_setup_additional_pages has never been called. That's the mode we will have to implement one day. Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables be kind of immutable constants? If yes, we could calculate where the additional vma lives without requiring any kind of [vdso] mark in proc/pid/maps output. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 21:31 ` Cyrill Gorcunov @ 2014-05-15 21:42 ` Andy Lutomirski 2014-05-15 21:57 ` Cyrill Gorcunov 0 siblings, 1 reply; 27+ messages in thread From: Andy Lutomirski @ 2014-05-15 21:42 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Thu, May 15, 2014 at 2:31 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: > On Fri, May 16, 2014 at 12:19:14AM +0400, Cyrill Gorcunov wrote: >> >> I see what you mean. We're rather targeting on bare x86-64 at the moment >> but compat mode is needed as well (not yet implemented though). I'll take >> a precise look into this area. Thanks! > > Indeed, because we were not running 32bit tasks vdso32-setup.c::arch_setup_additional_pages > has never been called. That's the mode we will have to implement one day. > > Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables > be kind of immutable constants? If yes, we could calculate where the additional > vma lives without requiring any kind of [vdso] mark in proc/pid/maps output. Please don't! These might, in principle, even vary between tasks on the same system. Certainly the relative positions of the vmas will be different between 3.15 and 3.16, since we need almost my entire cleanup series to reliably put them into their 3.16 location. And I intend to change the number of pages in 3.16 or 3.17. --Andy ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 21:42 ` Andy Lutomirski @ 2014-05-15 21:57 ` Cyrill Gorcunov 2014-05-15 22:15 ` Andy Lutomirski 0 siblings, 1 reply; 27+ messages in thread From: Cyrill Gorcunov @ 2014-05-15 21:57 UTC (permalink / raw) To: Andy Lutomirski Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Thu, May 15, 2014 at 02:42:48PM -0700, Andy Lutomirski wrote: > > > > Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables > > be kind of immutable constants? If yes, we could calculate where the additional > > vma lives without requiring any kind of [vdso] mark in proc/pid/maps output. > > Please don't! > > These might, in principle, even vary between tasks on the same system. > Certainly the relative positions of the vmas will be different > between 3.15 and 3.16, since we need almost my entire cleanup series > to reliably put them into their 3.16 location. And I intend to change > the number of pages in 3.16 or 3.17. There are other ways how to find where additional pages are laying but it would be great if there a straightforward interface for that (ie some mark in /proc/pid/maps output). ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 21:57 ` Cyrill Gorcunov @ 2014-05-15 22:15 ` Andy Lutomirski 2014-05-16 22:40 ` Andy Lutomirski 0 siblings, 1 reply; 27+ messages in thread From: Andy Lutomirski @ 2014-05-15 22:15 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On Thu, May 15, 2014 at 2:57 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: > On Thu, May 15, 2014 at 02:42:48PM -0700, Andy Lutomirski wrote: >> > >> > Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables >> > be kind of immutable constants? If yes, we could calculate where the additional >> > vma lives without requiring any kind of [vdso] mark in proc/pid/maps output. >> >> Please don't! >> >> These might, in principle, even vary between tasks on the same system. >> Certainly the relative positions of the vmas will be different >> between 3.15 and 3.16, since we need almost my entire cleanup series >> to reliably put them into their 3.16 location. And I intend to change >> the number of pages in 3.16 or 3.17. > > There are other ways how to find where additional pages are laying but it > would be great if there a straightforward interface for that (ie some mark > in /proc/pid/maps output). I'll try to write a patch in time for 3.15. --Andy ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-15 22:15 ` Andy Lutomirski @ 2014-05-16 22:40 ` Andy Lutomirski 2014-05-16 22:56 ` H. Peter Anvin 2014-05-17 6:15 ` Cyrill Gorcunov 0 siblings, 2 replies; 27+ messages in thread From: Andy Lutomirski @ 2014-05-16 22:40 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov, H. Peter Anvin On Thu, May 15, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Thu, May 15, 2014 at 2:57 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote: >> On Thu, May 15, 2014 at 02:42:48PM -0700, Andy Lutomirski wrote: >>> > >>> > Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables >>> > be kind of immutable constants? If yes, we could calculate where the additional >>> > vma lives without requiring any kind of [vdso] mark in proc/pid/maps output. >>> >>> Please don't! >>> >>> These might, in principle, even vary between tasks on the same system. >>> Certainly the relative positions of the vmas will be different >>> between 3.15 and 3.16, since we need almost my entire cleanup series >>> to reliably put them into their 3.16 location. And I intend to change >>> the number of pages in 3.16 or 3.17. >> >> There are other ways how to find where additional pages are laying but it >> would be great if there a straightforward interface for that (ie some mark >> in /proc/pid/maps output). > > I'll try to write a patch in time for 3.15. > My current draft is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=vdso/cleanups On 64-bit userspace, it results in: 7fffa1dfd000-7fffa1dfe000 r-xp 00000000 00:00 0 [vdso] 7fffa1dfe000-7fffa1e00000 r--p 00000000 00:00 0 [vvar] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] On 32-bit userspace, it results in: f7748000-f7749000 r-xp 00000000 00:00 0 [vdso] f7749000-f774b000 r--p 00000000 00:00 0 [vvar] ffd94000-ffdb5000 rw-p 00000000 00:00 0 [stack] Is this good for CRIU? Another approach would be to name both of these things "vdso", since they are sort of both the vdso, but that might be a bit confusing -- [vvar] is not static text the way that [vdso] is. If I backport this for 3.15 (which might be nasty -- I would argue that the code change is actually a cleanup, but it's fairly intrusive), then [vvar] will be *before* [vdso], not after it. I'd be very hesitant to name both of them "[vdso]" in that case, since there is probably code that assumes that the beginning of "[vdso]" is a DSO. Note that it is *not* safe to blindly read from "[vvar]". On some configurations you *will* get SIGBUS if you try to read from some of the vvar pages. (That's what started this whole thread.) Some pages in "[vvar]" may have strange caching modes, so SIGBUS might not be the only surprising thing about poking at it. --Andy ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-16 22:40 ` Andy Lutomirski @ 2014-05-16 22:56 ` H. Peter Anvin 2014-05-17 6:15 ` Cyrill Gorcunov 1 sibling, 0 replies; 27+ messages in thread From: H. Peter Anvin @ 2014-05-16 22:56 UTC (permalink / raw) To: Andy Lutomirski, Cyrill Gorcunov Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov On 05/16/2014 03:40 PM, Andy Lutomirski wrote: > > My current draft is here: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=vdso/cleanups > > On 64-bit userspace, it results in: > > 7fffa1dfd000-7fffa1dfe000 r-xp 00000000 00:00 0 [vdso] > 7fffa1dfe000-7fffa1e00000 r--p 00000000 00:00 0 [vvar] > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 > [vsyscall] > > On 32-bit userspace, it results in: > > f7748000-f7749000 r-xp 00000000 00:00 0 [vdso] > f7749000-f774b000 r--p 00000000 00:00 0 [vvar] > ffd94000-ffdb5000 rw-p 00000000 00:00 0 [stack] > > Is this good for CRIU? Another approach would be to name both of > these things "vdso", since they are sort of both the vdso, but that > might be a bit confusing -- [vvar] is not static text the way that > [vdso] is. > > If I backport this for 3.15 (which might be nasty -- I would argue > that the code change is actually a cleanup, but it's fairly > intrusive), then [vvar] will be *before* [vdso], not after it. I'd be > very hesitant to name both of them "[vdso]" in that case, since there > is probably code that assumes that the beginning of "[vdso]" is a DSO. > > Note that it is *not* safe to blindly read from "[vvar]". On some > configurations you *will* get SIGBUS if you try to read from some of > the vvar pages. (That's what started this whole thread.) Some pages > in "[vvar]" may have strange caching modes, so SIGBUS might not be the > only surprising thing about poking at it. > mremap() should work on these pages, right? -hpa ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-16 22:40 ` Andy Lutomirski 2014-05-16 22:56 ` H. Peter Anvin @ 2014-05-17 6:15 ` Cyrill Gorcunov 1 sibling, 0 replies; 27+ messages in thread From: Cyrill Gorcunov @ 2014-05-17 6:15 UTC (permalink / raw) To: Andy Lutomirski Cc: Andrew Morton, Sasha Levin, linux-mm, Dave Jones, LKML, Pavel Emelyanov, H. Peter Anvin On Fri, May 16, 2014 at 03:40:48PM -0700, Andy Lutomirski wrote: > >> > >> There are other ways how to find where additional pages are laying but it > >> would be great if there a straightforward interface for that (ie some mark > >> in /proc/pid/maps output). > > > > I'll try to write a patch in time for 3.15. > > > > My current draft is here: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=vdso/cleanups > > On 64-bit userspace, it results in: > > 7fffa1dfd000-7fffa1dfe000 r-xp 00000000 00:00 0 [vdso] > 7fffa1dfe000-7fffa1e00000 r--p 00000000 00:00 0 [vvar] > ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] > > On 32-bit userspace, it results in: > > f7748000-f7749000 r-xp 00000000 00:00 0 [vdso] > f7749000-f774b000 r--p 00000000 00:00 0 [vvar] > ffd94000-ffdb5000 rw-p 00000000 00:00 0 [stack] > > Is this good for CRIU? Another approach would be to name both of > these things "vdso", since they are sort of both the vdso, but that > might be a bit confusing -- [vvar] is not static text the way that > [vdso] is. Yeah, thanks a lot, Andy, this is more than enough. > If I backport this for 3.15 (which might be nasty -- I would argue > that the code change is actually a cleanup, but it's fairly > intrusive), then [vvar] will be *before* [vdso], not after it. I'd be > very hesitant to name both of them "[vdso]" in that case, since there > is probably code that assumes that the beginning of "[vdso]" is a DSO. > > Note that it is *not* safe to blindly read from "[vvar]". On some > configurations you *will* get SIGBUS if you try to read from some of > the vvar pages. (That's what started this whole thread.) Some pages > in "[vvar]" may have strange caching modes, so SIGBUS might not be the > only surprising thing about poking at it. Ouch. Thanks for the note, I'll read new code with more attention and report the effect it did over criu (prob. on next week). ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 21:31 ` Andrew Morton 2014-05-14 21:33 ` Andy Lutomirski @ 2014-05-14 22:51 ` Andy Lutomirski 1 sibling, 0 replies; 27+ messages in thread From: Andy Lutomirski @ 2014-05-14 22:51 UTC (permalink / raw) To: Andrew Morton; +Cc: Sasha Levin, linux-mm, Dave Jones, LKML On Wed, May 14, 2014 at 2:31 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > >> > In my linux-next all that code got deleted by Andy's "x86, vdso: >> > Reimplement vdso.so preparation in build-time C" anyway. What kernel >> > were you looking at? >> >> Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 . >> >> I don't see Andy's patch removing that code either. > > ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into > arch/x86/vdso/vma.c. > > Maybe you managed to take a fault against the symbol area between the > _install_special_mapping() and the remap_pfn_range() call, but mmap_sem > should prevent that. > > Or the remap_pfn_range() call never happened. Should map_vdso() be > running _install_special_mapping() at all if > image->sym_vvar_page==NULL? You're almost right, but that was enough to point me in the right direction :) The mapping is still needed, since there are two pages. qemu -no-hpet will trigger this, but the nohpet kernel option will not. The latter is arguably a bug in the nohpet option. Fix coming. -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: mm: NULL ptr deref handling mmaping of special mappings 2014-05-14 21:03 ` Andrew Morton 2014-05-14 21:11 ` Sasha Levin @ 2014-05-14 21:26 ` Andy Lutomirski 1 sibling, 0 replies; 27+ messages in thread From: Andy Lutomirski @ 2014-05-14 21:26 UTC (permalink / raw) To: Andrew Morton; +Cc: Sasha Levin, linux-mm, Dave Jones, LKML, Stefani Seibold On Wed, May 14, 2014 at 2:03 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 14 May 2014 16:41:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: > >> On 05/14/2014 04:23 PM, Andrew Morton wrote: >> > On Wed, 14 May 2014 11:55:45 -0400 Sasha Levin <sasha.levin@oracle.com> wrote: >> > >> >> Hi all, >> >> >> >> While fuzzing with trinity inside a KVM tools guest running the latest -next >> >> kernel I've stumbled on the following spew: >> >> >> >> [ 1634.969408] BUG: unable to handle kernel NULL pointer dereference at (null) >> >> [ 1634.970538] IP: special_mapping_fault (mm/mmap.c:2961) >> >> [ 1634.971420] PGD 3334fc067 PUD 3334cf067 PMD 0 >> >> [ 1634.972081] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> >> [ 1634.972913] Dumping ftrace buffer: >> >> [ 1634.975493] (ftrace buffer empty) >> >> [ 1634.977470] Modules linked in: >> >> [ 1634.977513] CPU: 6 PID: 29578 Comm: trinity-c269 Not tainted 3.15.0-rc5-next-20140513-sasha-00020-gebce144-dirty #461 >> >> [ 1634.977513] task: ffff880333158000 ti: ffff88033351e000 task.ti: ffff88033351e000 >> >> [ 1634.977513] RIP: special_mapping_fault (mm/mmap.c:2961) >> > >> > Somebody's gone and broken the x86 oops output. It used to say >> > "special_mapping_fault+0x30/0x120" but the offset info has now >> > disappeared. That was useful for guesstimating whereabouts in the >> > function it died. >> >> I'm the one who "broke" the oops output, but I thought I'm helping people >> read that output instead of making it harder... >> >> What happened before is that due to my rather complex .config, the offsets >> didn't make sense to anyone who didn't build the kernel with my .config, >> so I had to repeatedly send it out to folks who attempted to get basic >> things like line numbers. >> >> > The line number isn't very useful as it's not possible (or at least, >> > not convenient) for others to reliably reproduce your kernel. >> >> I don't understand that part. I'm usually stating in the beginning of my >> mails that I run my testing on the latest -next kernel. > > Your "latest next kernel" apparently differes from mine ;( It would be > useful if you could just quote the +/-5 lines, perhaps? > > >> And indeed if >> you look at today's -next, that line number would point to: >> >> for (pages = vma->vm_private_data; pgoff && *pages; ++pages) <=== HERE >> pgoff--; >> >> So I'm not sure how replacing the offset with line numbers is making things >> worse? previously offsets were useless for people who tried to debug these >> spews so that's why I switched it to line numbers in the first place. >> >> > <scrabbles with git for a while> >> > >> > : static int special_mapping_fault(struct vm_area_struct *vma, >> > : struct vm_fault *vmf) >> > : { >> > : pgoff_t pgoff; >> > : struct page **pages; >> > : >> > : /* >> > : * special mappings have no vm_file, and in that case, the mm >> > : * uses vm_pgoff internally. So we have to subtract it from here. >> > : * We are allowed to do this because we are the mm; do not copy >> > : * this code into drivers! >> > : */ >> > : pgoff = vmf->pgoff - vma->vm_pgoff; >> > : >> > : for (pages = vma->vm_private_data; pgoff && *pages; ++pages) >> > : pgoff--; >> > : >> > : if (*pages) { >> > : struct page *page = *pages; >> > : get_page(page); >> > : vmf->page = page; >> > : return 0; >> > : } >> > : >> > : return VM_FAULT_SIGBUS; >> > : } >> > >> > OK so it might be the "if (*pages)". So vma->vm_private_data was NULL >> > and pgoff was zero. As usual, I can't imagine what race would cause >> > that :( >> >> Yup, it's the *pages part in the 'for' loop above that. I did find the >> following in the vdso code: >> >> vma = _install_special_mapping(mm, >> addr + image->size, >> image->sym_end_mapping - image->size, >> VM_READ, >> NULL); >> >> Which installs a mapping with a NULL ptr for pages (if I understand that >> correctly), but that code has been there for a while now. > > Well that's weird. I don't see anything which permits that. Maybe > nobody faulted against that address before? > > It's unclear what that code's actually doing and nobody bothered > commenting it of course. Maybe it's installing a guard page? > > In my linux-next all that code got deleted by Andy's "x86, vdso: > Reimplement vdso.so preparation in build-time C" anyway. What kernel > were you looking at? > > Andy, are you able to shed some light on why > arch_setup_additional_pages() is (or was) passing a NULL into > _install_special_mapping()? > Ugh. I just moved that code; I didn't delete it. In fact, I likely made it worse by causing x86_64 to use it, too. Oddly, this code worked when I tested it. Do you know what Trinity did to break it? Did the PTEs get nuked due to memory pressure? The intent is to use remap_pfn_range (a couple of lines below) to map some kernel data into userspace read-only. I think the right fix would be to pass in the pages array directly but to modify _install_special_mapping to clear VM_MAYWRITE. The idea is that ptrace and mprotect should *not* be usable to COW these pages: userspace will shoot itself in the foot quite thoroughly if the pages get COWed. Additionally, one of those pages actually represents some hardware registers, which almost certainly needs to be mapped UC. IIRC on x86_32 there are two pages in there. One is regular kernel memory that has a struct page associated with it. The other is the HPET, which is a page of hardware registers that may not have a struct page. Can anyone who's less clueless than I am about how vmas work help? Does this need to use its own vmops instead of the special mapping stuff? --Andy ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2014-05-19 8:40 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-05-14 15:55 mm: NULL ptr deref handling mmaping of special mappings Sasha Levin 2014-05-14 20:23 ` Andrew Morton 2014-05-14 20:41 ` Sasha Levin 2014-05-14 21:03 ` Andrew Morton 2014-05-14 21:11 ` Sasha Levin 2014-05-14 21:31 ` Andrew Morton 2014-05-14 21:33 ` Andy Lutomirski 2014-05-14 22:11 ` Cyrill Gorcunov 2014-05-14 22:23 ` Andy Lutomirski 2014-05-15 2:36 ` Pavel Emelyanov 2014-05-15 19:42 ` Andy Lutomirski 2014-05-19 8:27 ` Pavel Emelyanov 2014-05-19 8:40 ` Cyrill Gorcunov 2014-05-15 8:45 ` Cyrill Gorcunov 2014-05-15 19:46 ` Andy Lutomirski 2014-05-15 19:53 ` Cyrill Gorcunov 2014-05-15 19:59 ` Andy Lutomirski 2014-05-15 20:19 ` Cyrill Gorcunov 2014-05-15 21:31 ` Cyrill Gorcunov 2014-05-15 21:42 ` Andy Lutomirski 2014-05-15 21:57 ` Cyrill Gorcunov 2014-05-15 22:15 ` Andy Lutomirski 2014-05-16 22:40 ` Andy Lutomirski 2014-05-16 22:56 ` H. Peter Anvin 2014-05-17 6:15 ` Cyrill Gorcunov 2014-05-14 22:51 ` Andy Lutomirski 2014-05-14 21:26 ` Andy Lutomirski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).