* vvar, gup && coredump [not found] ` <20150311200052.GA22654@redhat.com> @ 2015-03-12 14:34 ` Oleg Nesterov 2015-03-12 16:29 ` Andy Lutomirski 0 siblings, 1 reply; 31+ messages in thread From: Oleg Nesterov @ 2015-03-12 14:34 UTC (permalink / raw) To: Jan Kratochvil, Andy Lutomirski Cc: Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel Add cc's, change subject. On 03/11, Oleg Nesterov wrote: > > On 03/05, Jan Kratochvil wrote: > > > > On Thu, 05 Mar 2015 21:52:56 +0100, Sergio Durigan Junior wrote: > > > On Thursday, March 05 2015, Jan Kratochvil wrote: > > > > On Thu, 05 Mar 2015 04:48:09 +0100, Sergio Durigan Junior wrote: > > > > Currently it also tries to dump [vvar] (by default rules) but that is > > > > unreadable for some reason, causing: > > > > warning: Memory read failed for corefile section, 8192 bytes at 0x7ffff6ceb000. > > > > ^^^^^^^^^^^^^^ > > > It would be good to get a reply from a kernel aware person what does it mean > > before such patch gets accepted. It can be also just a Linux kernel bug. > > _So far_ this doesn't look like a kernel bug to me. > > But! I need to recheck. In fact, it seems to me that I should discuss > this on lkml. I have some concerns, but most probably this is only my > misunderstanding, I need to read this (new to me) code more carefully. Hi Andy, we need your help ;) So, the problem is that gdb can't access the "vvar" mapping which looks like the "normal" vma from user-space pov. Technically this is clear. vvar_mapping->pages is the "dummy" no_pages[] array, get_user_pages() can't succeed. In fact even follow_page() can't work because of VM_PFNMAP/_PAGE_SPECIAL set by remap_pfn_range(). What is not clear: do we really want gup() to fail? Or it is not trivial to turn __vvar_page into the "normal" page? (to simplify the discussion, lets ignore hpet mapping for now). Because this doesn't look consistent. gdb tries to "coredump" the live process like the kernel does, but fails to dump the "r--p ... [vvar]" region. OK, gdb can look at VM_DONTDUMP bit in "VmFlags:" field in /proc/pid/smaps and skip this vma. But, why (afaics) the kernel dumps this vma then? Lets look at vma_dump_size(), /* always dump the vdso and vsyscall sections */ if (always_dump_vma(vma)) goto whole; if (vma->vm_flags & VM_DONTDUMP) return 0; so the kernel ignores VM_DONTDUMP in this case, always_dump_vma() returns true because of special_mapping_name(). Perhaps we should check VM_DONTDUMP before always_dump_vma() ? Or. We can teach gdb to read and dump its own "vvar" mapping to mimic the kernel behaviour, this is the same read-only memory. But this hack doesn't look nice, gdb should not know "too much" about the kernel internals. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 14:34 ` vvar, gup && coredump Oleg Nesterov @ 2015-03-12 16:29 ` Andy Lutomirski 2015-03-12 16:54 ` Oleg Nesterov 0 siblings, 1 reply; 31+ messages in thread From: Andy Lutomirski @ 2015-03-12 16:29 UTC (permalink / raw) To: Oleg Nesterov Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel On Thu, Mar 12, 2015 at 7:34 AM, Oleg Nesterov <oleg@redhat.com> wrote: > Add cc's, change subject. > > On 03/11, Oleg Nesterov wrote: >> >> On 03/05, Jan Kratochvil wrote: >> > >> > On Thu, 05 Mar 2015 21:52:56 +0100, Sergio Durigan Junior wrote: >> > > On Thursday, March 05 2015, Jan Kratochvil wrote: >> > > > On Thu, 05 Mar 2015 04:48:09 +0100, Sergio Durigan Junior wrote: >> > > > Currently it also tries to dump [vvar] (by default rules) but that is >> > > > unreadable for some reason, causing: >> > > > warning: Memory read failed for corefile section, 8192 bytes at 0x7ffff6ceb000. >> > > > ^^^^^^^^^^^^^^ >> >> > It would be good to get a reply from a kernel aware person what does it mean >> > before such patch gets accepted. It can be also just a Linux kernel bug. >> >> _So far_ this doesn't look like a kernel bug to me. >> >> But! I need to recheck. In fact, it seems to me that I should discuss >> this on lkml. I have some concerns, but most probably this is only my >> misunderstanding, I need to read this (new to me) code more carefully. > > Hi Andy, we need your help ;) > > So, the problem is that gdb can't access the "vvar" mapping which looks > like the "normal" vma from user-space pov. > > Technically this is clear. vvar_mapping->pages is the "dummy" no_pages[] > array, get_user_pages() can't succeed. In fact even follow_page() can't > work because of VM_PFNMAP/_PAGE_SPECIAL set by remap_pfn_range(). > > What is not clear: do we really want gup() to fail? Or it is not trivial > to turn __vvar_page into the "normal" page? (to simplify the discussion, > lets ignore hpet mapping for now). We could presumably fiddle with the vma to allow get_user_pages to work on at least the first vvar page. There are some decently large caveats, though: - We don't want to COW it. If someone pokes at that page with ptrace, for example, and it gets COWed, everything will stop working because the offending process will no longer see updates. That way lies infinite loops. - The implementation could be odd. The vma is either VM_MIXEDMAP or VM_PFNMAP, and I don't see any practical way to change that. - The HPET and perhaps pvclock stuff. The HPET probably doesn't have a struct page at all, so you can't possibly get_user_pages it. > > Because this doesn't look consistent. gdb tries to "coredump" the live > process like the kernel does, but fails to dump the "r--p ... [vvar]" > region. > > > OK, gdb can look at VM_DONTDUMP bit in "VmFlags:" field in /proc/pid/smaps > and skip this vma. But, why (afaics) the kernel dumps this vma then? Lets > look at vma_dump_size(), > > /* always dump the vdso and vsyscall sections */ > if (always_dump_vma(vma)) > goto whole; > > if (vma->vm_flags & VM_DONTDUMP) > return 0; > > so the kernel ignores VM_DONTDUMP in this case, always_dump_vma() returns > true because of special_mapping_name(). Perhaps we should check VM_DONTDUMP > before always_dump_vma() ? > That sounds reasonable to me. I'll write the patch later today. gdb will still need changes, though, right? --Andy > > Or. We can teach gdb to read and dump its own "vvar" mapping to mimic the > kernel behaviour, this is the same read-only memory. But this hack doesn't > look nice, gdb should not know "too much" about the kernel internals. > > Oleg. > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 16:29 ` Andy Lutomirski @ 2015-03-12 16:54 ` Oleg Nesterov 2015-03-12 17:17 ` Andy Lutomirski 2015-03-12 17:46 ` Oleg Nesterov 0 siblings, 2 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-12 16:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel On 03/12, Andy Lutomirski wrote: > > On Thu, Mar 12, 2015 at 7:34 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > What is not clear: do we really want gup() to fail? Or it is not trivial > > to turn __vvar_page into the "normal" page? (to simplify the discussion, > > lets ignore hpet mapping for now). > > We could presumably fiddle with the vma to allow get_user_pages to > work on at least the first vvar page. There are some decently large > caveats, though: > > - We don't want to COW it. If someone pokes at that page with > ptrace, for example, and it gets COWed, everything will stop working > because the offending process will no longer see updates. That way > lies infinite loops. Of course, but this looks simple... is_cow_mapping() == F so FOLL_FORCE won't work anyway? > - The implementation could be odd. The vma is either VM_MIXEDMAP or > VM_PFNMAP, and I don't see any practical way to change that. > > - The HPET and perhaps pvclock stuff. The HPET probably doesn't have > a struct page at all, so you can't possibly get_user_pages it. Yes, this is true. OK, lets not dump it. I'll probably send a patch which changes vma_dump_size() to check VM_DONTDUMP first... But this leads to another question: why do we want to expose this "vvar" vma at all? For the moment, forget about compat 32-bit applications running under 64-bit kernel. Can't we simply add FIX_VVAR_PAGE into fixed_addresses{}, map it into init_mm via set_fixmap(FIX_VVAR_PAGE, __PAGE_USER) and change __vdso.* functions to use fix_to_virt() address? I don't really understand the low-level details, I'd like to understand if this can work or not. And if it can work, why this is undesirable. As for 32-bit applications. Yes, this can't work because 32-bit simply can't access this "high" memory. But you know, it would be very nice to have the fixmap-like "global" area in init_mm which is also visible to compat applications. If we had it, uprobes could work without xol vma's. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 16:54 ` Oleg Nesterov @ 2015-03-12 17:17 ` Andy Lutomirski 2015-03-12 17:39 ` Oleg Nesterov 2015-03-12 17:46 ` Oleg Nesterov 1 sibling, 1 reply; 31+ messages in thread From: Andy Lutomirski @ 2015-03-12 17:17 UTC (permalink / raw) To: Oleg Nesterov Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel On Thu, Mar 12, 2015 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/12, Andy Lutomirski wrote: >> >> On Thu, Mar 12, 2015 at 7:34 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> > >> > What is not clear: do we really want gup() to fail? Or it is not trivial >> > to turn __vvar_page into the "normal" page? (to simplify the discussion, >> > lets ignore hpet mapping for now). >> >> We could presumably fiddle with the vma to allow get_user_pages to >> work on at least the first vvar page. There are some decently large >> caveats, though: >> >> - We don't want to COW it. If someone pokes at that page with >> ptrace, for example, and it gets COWed, everything will stop working >> because the offending process will no longer see updates. That way >> lies infinite loops. > > Of course, but this looks simple... is_cow_mapping() == F so FOLL_FORCE > won't work anyway? > >> - The implementation could be odd. The vma is either VM_MIXEDMAP or >> VM_PFNMAP, and I don't see any practical way to change that. >> >> - The HPET and perhaps pvclock stuff. The HPET probably doesn't have >> a struct page at all, so you can't possibly get_user_pages it. > > Yes, this is true. OK, lets not dump it. I'll probably send a patch which > changes vma_dump_size() to check VM_DONTDUMP first... > > But this leads to another question: why do we want to expose this > "vvar" vma at all? > > For the moment, forget about compat 32-bit applications running under > 64-bit kernel. > > Can't we simply add FIX_VVAR_PAGE into fixed_addresses{}, map it into > init_mm via set_fixmap(FIX_VVAR_PAGE, __PAGE_USER) and change __vdso.* > functions to use fix_to_virt() address? > > I don't really understand the low-level details, I'd like to understand > if this can work or not. And if it can work, why this is undesirable. > > As for 32-bit applications. Yes, this can't work because 32-bit simply > can't access this "high" memory. But you know, it would be very nice to > have the fixmap-like "global" area in init_mm which is also visible to > compat applications. If we had it, uprobes could work without xol vma's. > It could work for 32-bit native, but not for 32-bit compat. Also, I have grand plans to add per-task vvar overrides for seccomp and such. And RIP-relative addressing is a bit nicer than absolute :) It used to work that way, but we changed it in 3.15 IIRC. On a related note, I'm hoping to rework the mm part pretty heavily: http://lkml.kernel.org/r/cover.1414629045.git.luto@amacapital.net --Andy > Oleg. > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 17:17 ` Andy Lutomirski @ 2015-03-12 17:39 ` Oleg Nesterov 2015-03-12 17:45 ` Sergio Durigan Junior 2015-03-12 17:55 ` Andy Lutomirski 0 siblings, 2 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-12 17:39 UTC (permalink / raw) To: Andy Lutomirski Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel On 03/12, Andy Lutomirski wrote: > > On Thu, Mar 12, 2015 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > On 03/12, Andy Lutomirski wrote: > >> > > As for 32-bit applications. Yes, this can't work because 32-bit simply > > can't access this "high" memory. But you know, it would be very nice to > > have the fixmap-like "global" area in init_mm which is also visible to > > compat applications. If we had it, uprobes could work without xol vma's. > > > It could work for 32-bit native, but not for 32-bit compat. Yes, yes, I meant 32-bit compat apps. Once again, it would be nice if we had the "low" fixmaps in init_mm. But unlikely this is possible... > On a related note, I'm hoping to rework the mm part pretty heavily: > > http://lkml.kernel.org/r/cover.1414629045.git.luto@amacapital.net OK... not that I really understand this email. Well. Speaking of vdso. I understand that unlikely we can do this, but for uprobes it would be nice to have a anon-inode file behind this mapping, so that vma_interval_tree_foreach() could work, etc. OK, this is completely off-topic, please forget. And I noticed that I didn't read your previous email carefully enough... > That sounds reasonable to me. I'll write the patch later today. Sure, please send a patch if you want to do this. > gdb will still need changes, though, right? This is up to gdb developers. To me, it should simply skip this VM_DONTDUMP vma. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 17:39 ` Oleg Nesterov @ 2015-03-12 17:45 ` Sergio Durigan Junior 2015-03-12 18:02 ` Oleg Nesterov 2015-03-12 17:55 ` Andy Lutomirski 1 sibling, 1 reply; 31+ messages in thread From: Sergio Durigan Junior @ 2015-03-12 17:45 UTC (permalink / raw) To: Oleg Nesterov Cc: Andy Lutomirski, Jan Kratochvil, GDB Patches, Pedro Alves, linux-kernel On Thursday, March 12 2015, Oleg Nesterov wrote: >> gdb will still need changes, though, right? > > This is up to gdb developers. To me, it should simply skip this > VM_DONTDUMP vma. If I understood this discussion correctly (and thanks Andy and Oleg for, *ahem*, dumping all this useful information for us!), GDB will not need modifications in the Linux kernel in this area. In fact, my patch already implements the "ignore VM_DONTDUMP mappings" part, so we're pretty much covered. Thanks, -- Sergio GPG key ID: 0x65FC5E36 Please send encrypted e-mail if possible http://sergiodj.net/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 17:45 ` Sergio Durigan Junior @ 2015-03-12 18:02 ` Oleg Nesterov 2015-03-13 4:50 ` Sergio Durigan Junior 0 siblings, 1 reply; 31+ messages in thread From: Oleg Nesterov @ 2015-03-12 18:02 UTC (permalink / raw) To: Sergio Durigan Junior Cc: Andy Lutomirski, Jan Kratochvil, GDB Patches, Pedro Alves, linux-kernel On 03/12, Sergio Durigan Junior wrote: > > On Thursday, March 12 2015, Oleg Nesterov wrote: > > >> gdb will still need changes, though, right? > > > > This is up to gdb developers. To me, it should simply skip this > > VM_DONTDUMP vma. > > If I understood this discussion correctly (and thanks Andy and Oleg for, > *ahem*, dumping all this useful information for us!), GDB will not need > modifications in the Linux kernel in this area. In fact, my patch > already implements the "ignore VM_DONTDUMP mappings" part, so we're > pretty much covered. OK, thanks. And it seems that we all agree that the kernel should not dump this vma too. Could you confirm that this is fine from gdb pov just in case? However. Even if we do not want it in the coredump, this can confuse gdb users which might want to read this memory during debugging. So perhaps we still can add ->access() to "fix" PTRACE_PEEK/access_remote_vm later. But I see another email from Andy, so lets forget about this for now. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 18:02 ` Oleg Nesterov @ 2015-03-13 4:50 ` Sergio Durigan Junior 2015-03-13 15:04 ` Oleg Nesterov 0 siblings, 1 reply; 31+ messages in thread From: Sergio Durigan Junior @ 2015-03-13 4:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Andy Lutomirski, Jan Kratochvil, GDB Patches, Pedro Alves, linux-kernel On Thursday, March 12 2015, Oleg Nesterov wrote: >> If I understood this discussion correctly (and thanks Andy and Oleg for, >> *ahem*, dumping all this useful information for us!), GDB will not need >> modifications in the Linux kernel in this area. In fact, my patch >> already implements the "ignore VM_DONTDUMP mappings" part, so we're >> pretty much covered. > > OK, thanks. > > And it seems that we all agree that the kernel should not dump this vma > too. Could you confirm that this is fine from gdb pov just in case? Yes, this is what we expect from the GDB side. This mapping is marked as "dd", so it does not make sense to dump it. While I have you guys, would it be possible for the Linux kernel to include a new flag on VmFlags to uniquely identify an anonymous mapping? Currently, there is no easy way to do that from userspace. My patch implements the following heuristic on GDB: if (pathname == "/dev/zero (deleted)" || pathname == "/SYSV%08x (deleted)" || pathname == "<file> (deleted)" || "Anonymous:" field is > 0 kB || "AnonHugePages:" field is > 0 kB) mapping is anonymous; However, this can be fragile. The Linux kernel checks for i_nlink == 0, but there is no easy way for GDB to check this on userspace (as Jan mentioned, one could look at /proc/PID/map_files/, but they are only accessible by root). That's why I think it would be good to provide this info right away in /proc/PID/smaps... Thanks, -- Sergio GPG key ID: 0x65FC5E36 Please send encrypted e-mail if possible http://sergiodj.net/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-13 4:50 ` Sergio Durigan Junior @ 2015-03-13 15:04 ` Oleg Nesterov 0 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-13 15:04 UTC (permalink / raw) To: Sergio Durigan Junior Cc: Andy Lutomirski, Jan Kratochvil, GDB Patches, Pedro Alves, linux-kernel On 03/13, Sergio Durigan Junior wrote: > > On Thursday, March 12 2015, Oleg Nesterov wrote: > > > And it seems that we all agree that the kernel should not dump this vma > > too. Could you confirm that this is fine from gdb pov just in case? > > Yes, this is what we expect from the GDB side. This mapping is marked > as "dd", so it does not make sense to dump it. OK. > While I have you guys, would it be possible for the Linux kernel to > include a new flag on VmFlags to uniquely identify an anonymous mapping? Note that "anonymous" is not the right term here... I mean it is a bit confusing. Lets discuss this again on debug-list, then we will see if gdb needs more info from kernel. > Currently, there is no easy way to do that from userspace. My patch > implements the following heuristic on GDB: > > if (pathname == "/dev/zero (deleted)" > || pathname == "/SYSV%08x (deleted)" > || pathname == "<file> (deleted)" And for example, this is not anonymous mapping. But, > mapping is anonymous; I agree, gdb should treat it as anonymous. > However, this can be fragile. The Linux kernel checks for i_nlink == 0, Yes, as we already disccussed, I think the kernel should be changed. It should do something like shmem_mapping() || d_unlinked(), I think. But this needs another discussion on lkml, and in another thread. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 17:39 ` Oleg Nesterov 2015-03-12 17:45 ` Sergio Durigan Junior @ 2015-03-12 17:55 ` Andy Lutomirski 2015-03-12 18:08 ` Oleg Nesterov 1 sibling, 1 reply; 31+ messages in thread From: Andy Lutomirski @ 2015-03-12 17:55 UTC (permalink / raw) To: Oleg Nesterov Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel On Thu, Mar 12, 2015 at 10:39 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/12, Andy Lutomirski wrote: >> >> On Thu, Mar 12, 2015 at 9:54 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> > On 03/12, Andy Lutomirski wrote: >> >> >> > As for 32-bit applications. Yes, this can't work because 32-bit simply >> > can't access this "high" memory. But you know, it would be very nice to >> > have the fixmap-like "global" area in init_mm which is also visible to >> > compat applications. If we had it, uprobes could work without xol vma's. >> > >> It could work for 32-bit native, but not for 32-bit compat. > > Yes, yes, I meant 32-bit compat apps. Once again, it would be nice if we > had the "low" fixmaps in init_mm. But unlikely this is possible... > >> On a related note, I'm hoping to rework the mm part pretty heavily: >> >> http://lkml.kernel.org/r/cover.1414629045.git.luto@amacapital.net > > OK... not that I really understand this email. > > Well. Speaking of vdso. I understand that unlikely we can do this, but > for uprobes it would be nice to have a anon-inode file behind this mapping, > so that vma_interval_tree_foreach() could work, etc. OK, this is completely > off-topic, please forget. Couldn't you do that directly in the uprobes code? That is, create an anon_inode file and just map it the old-fashioned way? --Andy > > > > And I noticed that I didn't read your previous email carefully enough... > >> That sounds reasonable to me. I'll write the patch later today. > > Sure, please send a patch if you want to do this. > >> gdb will still need changes, though, right? > > This is up to gdb developers. To me, it should simply skip this > VM_DONTDUMP vma. > > Oleg. > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 17:55 ` Andy Lutomirski @ 2015-03-12 18:08 ` Oleg Nesterov 2015-03-12 18:33 ` Andy Lutomirski 0 siblings, 1 reply; 31+ messages in thread From: Oleg Nesterov @ 2015-03-12 18:08 UTC (permalink / raw) To: Andy Lutomirski Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel On 03/12, Andy Lutomirski wrote: > > On Thu, Mar 12, 2015 at 10:39 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > Well. Speaking of vdso. I understand that unlikely we can do this, but > > for uprobes it would be nice to have a anon-inode file behind this mapping, > > so that vma_interval_tree_foreach() could work, etc. OK, this is completely > > off-topic, please forget. > > Couldn't you do that directly in the uprobes code? That is, create an > anon_inode file and just map it the old-fashioned way? This won't help. Uprobes wants this file mmaped by all applications, so that build_map_info() can find mm's, vma's, etc to install the system-wide breakpoint. But again, this is off-topic and unlikely possible. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 18:08 ` Oleg Nesterov @ 2015-03-12 18:33 ` Andy Lutomirski 2015-03-13 15:22 ` Oleg Nesterov 0 siblings, 1 reply; 31+ messages in thread From: Andy Lutomirski @ 2015-03-12 18:33 UTC (permalink / raw) To: Oleg Nesterov; +Cc: linux-kernel [removed people who will be bored] On Thu, Mar 12, 2015 at 11:08 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/12, Andy Lutomirski wrote: >> >> On Thu, Mar 12, 2015 at 10:39 AM, Oleg Nesterov <oleg@redhat.com> wrote: >> > Well. Speaking of vdso. I understand that unlikely we can do this, but >> > for uprobes it would be nice to have a anon-inode file behind this mapping, >> > so that vma_interval_tree_foreach() could work, etc. OK, this is completely >> > off-topic, please forget. >> >> Couldn't you do that directly in the uprobes code? That is, create an >> anon_inode file and just map it the old-fashioned way? > > This won't help. Uprobes wants this file mmaped by all applications, so > that build_map_info() can find mm's, vma's, etc to install the system-wide > breakpoint. But again, this is off-topic and unlikely possible. > What's wrong with off-topic? :) As vvar demonstrates, it's possible to add new per-process vmas that show up in all processes automatically. The trickiest part is making it work with CRIU. --Andy > Oleg. > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 18:33 ` Andy Lutomirski @ 2015-03-13 15:22 ` Oleg Nesterov 0 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-13 15:22 UTC (permalink / raw) To: Andy Lutomirski; +Cc: linux-kernel On 03/12, Andy Lutomirski wrote: > > What's wrong with off-topic? :) OK ;) > As vvar demonstrates, it's possible to add new per-process vmas that > show up in all processes automatically. The trickiest part is making > it work with CRIU. Ah... I seem to finaly understand why exactly you added this "vvar" vma... To make the life of c/r developers more interesting, right? Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 16:54 ` Oleg Nesterov 2015-03-12 17:17 ` Andy Lutomirski @ 2015-03-12 17:46 ` Oleg Nesterov 2015-03-12 17:54 ` Andy Lutomirski ` (2 more replies) 1 sibling, 3 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-12 17:46 UTC (permalink / raw) To: Andy Lutomirski Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel On 03/12, Oleg Nesterov wrote: > > Yes, this is true. OK, lets not dump it. OTOH. We can probably add ->access() into special_mapping_vmops, this way __access_remote_vm() could work even if gup() fails ? Jan, Sergio. How much do we want do dump this area ? The change above should be justified. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 17:46 ` Oleg Nesterov @ 2015-03-12 17:54 ` Andy Lutomirski 2015-03-12 18:05 ` Oleg Nesterov 2015-03-12 18:19 ` Pedro Alves 2015-03-16 19:01 ` install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) Oleg Nesterov 2 siblings, 1 reply; 31+ messages in thread From: Andy Lutomirski @ 2015-03-12 17:54 UTC (permalink / raw) To: Oleg Nesterov Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel On Thu, Mar 12, 2015 at 10:46 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/12, Oleg Nesterov wrote: >> >> Yes, this is true. OK, lets not dump it. > > OTOH. We can probably add ->access() into special_mapping_vmops, this > way __access_remote_vm() could work even if gup() fails ? Let's wait until my special_mapping vmops rework lands to do that. I'll dust it off and resubmit it. > > Jan, Sergio. How much do we want do dump this area ? The change above > should be justified. > > Oleg. > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 17:54 ` Andy Lutomirski @ 2015-03-12 18:05 ` Oleg Nesterov 2015-03-12 18:23 ` Sergio Durigan Junior 0 siblings, 1 reply; 31+ messages in thread From: Oleg Nesterov @ 2015-03-12 18:05 UTC (permalink / raw) To: Andy Lutomirski Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel On 03/12, Andy Lutomirski wrote: > > On Thu, Mar 12, 2015 at 10:46 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > On 03/12, Oleg Nesterov wrote: > >> > >> Yes, this is true. OK, lets not dump it. > > > > OTOH. We can probably add ->access() into special_mapping_vmops, this > > way __access_remote_vm() could work even if gup() fails ? > > Let's wait until my special_mapping vmops rework lands to do that. > I'll dust it off and resubmit it. OK. Please CC me. Not that I think I can help, just I want to understand what you are going to do. Although currently I do not even read most of emails I get ;) Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 18:05 ` Oleg Nesterov @ 2015-03-12 18:23 ` Sergio Durigan Junior 0 siblings, 0 replies; 31+ messages in thread From: Sergio Durigan Junior @ 2015-03-12 18:23 UTC (permalink / raw) To: Oleg Nesterov Cc: Andy Lutomirski, Jan Kratochvil, GDB Patches, Pedro Alves, linux-kernel On Thursday, March 12 2015, Oleg Nesterov wrote: >> Let's wait until my special_mapping vmops rework lands to do that. >> I'll dust it off and resubmit it. > > OK. Please CC me. Not that I think I can help, just I want to understand > what you are going to do. If you think your work will impact the core dumping part, please include me in the Cc as well. -- Sergio GPG key ID: 0x65FC5E36 Please send encrypted e-mail if possible http://sergiodj.net/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 17:46 ` Oleg Nesterov 2015-03-12 17:54 ` Andy Lutomirski @ 2015-03-12 18:19 ` Pedro Alves 2015-03-12 18:25 ` Andy Lutomirski 2015-03-16 19:01 ` install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) Oleg Nesterov 2 siblings, 1 reply; 31+ messages in thread From: Pedro Alves @ 2015-03-12 18:19 UTC (permalink / raw) To: Oleg Nesterov, Andy Lutomirski Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, linux-kernel On 03/12/2015 05:46 PM, Oleg Nesterov wrote: > On 03/12, Oleg Nesterov wrote: >> >> Yes, this is true. OK, lets not dump it. > > OTOH. We can probably add ->access() into special_mapping_vmops, this > way __access_remote_vm() could work even if gup() fails ? > > Jan, Sergio. How much do we want do dump this area ? The change above > should be justified. Memory mappings that weren't touched since they were initially mapped can be retrieved from the program binary and the shared libraries, even if the core dump is moved to another machine. However, in vvar case, sounds like there's nowhere to read it from offline? In that case, it could be justified to dump it. Thanks, Pedro Alves ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: vvar, gup && coredump 2015-03-12 18:19 ` Pedro Alves @ 2015-03-12 18:25 ` Andy Lutomirski 0 siblings, 0 replies; 31+ messages in thread From: Andy Lutomirski @ 2015-03-12 18:25 UTC (permalink / raw) To: Pedro Alves Cc: Oleg Nesterov, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, linux-kernel On Thu, Mar 12, 2015 at 11:19 AM, Pedro Alves <palves@redhat.com> wrote: > On 03/12/2015 05:46 PM, Oleg Nesterov wrote: >> On 03/12, Oleg Nesterov wrote: >>> >>> Yes, this is true. OK, lets not dump it. >> >> OTOH. We can probably add ->access() into special_mapping_vmops, this >> way __access_remote_vm() could work even if gup() fails ? >> >> Jan, Sergio. How much do we want do dump this area ? The change above >> should be justified. > > Memory mappings that weren't touched since they were initially mapped can > be retrieved from the program binary and the shared libraries, even if > the core dump is moved to another machine. However, in vvar case, > sounds like there's nowhere to read it from offline? In that case, > it could be justified to dump it. This is why we currently dump the vdso text. On arm64 (the only other architecture that uses a real vma for vvar data IIRC), we use a more normal vma and we dump it. x86 is the odd one out here. We could just leave the kernel alone. The data that gets dumped is of dubious value, but it could be slightly helpful when debugging vdso crashes*, but, of course, dumping it is inherently racy. * The vdso never crashes :) --Andy > > Thanks, > Pedro Alves > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 31+ messages in thread
* install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) 2015-03-12 17:46 ` Oleg Nesterov 2015-03-12 17:54 ` Andy Lutomirski 2015-03-12 18:19 ` Pedro Alves @ 2015-03-16 19:01 ` Oleg Nesterov 2015-03-16 19:20 ` Andy Lutomirski 2015-03-16 19:40 ` Pedro Alves 2 siblings, 2 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-16 19:01 UTC (permalink / raw) To: Andy Lutomirski, Hugh Dickins, Linus Torvalds Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel On 03/12, Oleg Nesterov wrote: > > OTOH. We can probably add ->access() into special_mapping_vmops, this > way __access_remote_vm() could work even if gup() fails ? So I tried to think how special_mapping_vmops->access() can work, it needs to rely on ->vm_pgoff. But afaics this logic is just broken. Lets even forget about vvar vma which uses remap_file_pages(). Lets look at "[vdso]" which uses the "normal" pages. The comment in special_mapping_fault() says * special mappings have no vm_file, and in that case, the mm * uses vm_pgoff internally. Yes. But afaics mm/ doesn't do this correctly. So * do not copy this code into drivers! looks like a good recommendation ;) I think that this logic is wrong even if ARRAY_SIZE(pages) == 1, but I am not sure. But since vdso use 2 pages, it is trivial to show that this logic is wrong. To verify, I changed show_map_vma() to expose pgoff even if !file, but this test-case can show the problem too: #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <assert.h> void *find_vdso_vaddr(void) { FILE *perl; char buf[32] = {}; perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;" "/^(.*?)-.*vdso/ && print hex $1 while <>'", "r"); fread(buf, sizeof(buf), 1, perl); fclose(perl); return (void *)atol(buf); } #define PAGE_SIZE 4096 int main(void) { void *vdso = find_vdso_vaddr(); assert(vdso); // of course they should differ, and they do so far printf("vdso pages differ: %d\n", !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE)); // split into 2 vma's assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0); // force another fault on the next check assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0); // now they no longer differ, the 2nd vm_pgoff is wrong printf("vdso pages differ: %d\n", !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE)); return 0; } output: vdso pages differ: 1 vdso pages differ: 0 And not only "split_vma" is wrong, I think that "move_vma" is not right too. Note this check in copy_vma(), /* * If anonymous vma has not yet been faulted, update new pgoff * to match new location, to increase its chance of merging. */ if (unlikely(!vma->vm_file && !vma->anon_vma)) { pgoff = addr >> PAGE_SHIFT; faulted_in_anon_vma = false; } I can easily misread this code. But it doesn't look right too. If vdso was cow'ed (breakpoint installed by gdb) and sys_nremap()'ed, then the new pgoff will be wrong too after, say, MADV_DONTNEED. Or I am totally confused? Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) 2015-03-16 19:01 ` install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) Oleg Nesterov @ 2015-03-16 19:20 ` Andy Lutomirski 2015-03-16 19:40 ` Pedro Alves 1 sibling, 0 replies; 31+ messages in thread From: Andy Lutomirski @ 2015-03-16 19:20 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel, linux-mm [cc: linux-mm] On Mon, Mar 16, 2015 at 12:01 PM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/12, Oleg Nesterov wrote: >> >> OTOH. We can probably add ->access() into special_mapping_vmops, this >> way __access_remote_vm() could work even if gup() fails ? > > So I tried to think how special_mapping_vmops->access() can work, it > needs to rely on ->vm_pgoff. > > But afaics this logic is just broken. Lets even forget about vvar vma > which uses remap_file_pages(). Lets look at "[vdso]" which uses the > "normal" pages. > > The comment in special_mapping_fault() says > > * special mappings have no vm_file, and in that case, the mm > * uses vm_pgoff internally. > > Yes. But afaics mm/ doesn't do this correctly. So > > * do not copy this code into drivers! > > looks like a good recommendation ;) > > I think that this logic is wrong even if ARRAY_SIZE(pages) == 1, but I am > not sure. But since vdso use 2 pages, it is trivial to show that this logic > is wrong. To verify, I changed show_map_vma() to expose pgoff even if !file, > but this test-case can show the problem too: > > #include <stdio.h> > #include <unistd.h> > #include <stdlib.h> > #include <string.h> > #include <sys/mman.h> > #include <assert.h> > > void *find_vdso_vaddr(void) > { > FILE *perl; > char buf[32] = {}; > > perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;" > "/^(.*?)-.*vdso/ && print hex $1 while <>'", "r"); > fread(buf, sizeof(buf), 1, perl); > fclose(perl); > > return (void *)atol(buf); > } > > #define PAGE_SIZE 4096 > > int main(void) > { > void *vdso = find_vdso_vaddr(); > assert(vdso); > > // of course they should differ, and they do so far > printf("vdso pages differ: %d\n", > !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE)); > > // split into 2 vma's > assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0); > > // force another fault on the next check > assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0); I really hope this doesn't do anything (or fails) on the vvar page, which is a pfnmap. > > // now they no longer differ, the 2nd vm_pgoff is wrong > printf("vdso pages differ: %d\n", > !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE)); > > return 0; > } > > output: > > vdso pages differ: 1 > vdso pages differ: 0 > > And not only "split_vma" is wrong, I think that "move_vma" is not right too. > Note this check in copy_vma(), > > /* > * If anonymous vma has not yet been faulted, update new pgoff > * to match new location, to increase its chance of merging. > */ > if (unlikely(!vma->vm_file && !vma->anon_vma)) { > pgoff = addr >> PAGE_SHIFT; > faulted_in_anon_vma = false; > } > > I can easily misread this code. But it doesn't look right too. If vdso was cow'ed > (breakpoint installed by gdb) and sys_nremap()'ed, then the new pgoff will be wrong > too after, say, MADV_DONTNEED. > > Or I am totally confused? Ick, you're probably right. For what it's worth, the vdso *seems* to be okay (on 64-bit only, and only if you don't poke at it too hard) if you mremap it in one piece. CRIU does that. What does the mm code do with vm_pgoff for vmas with no vm_file? I'm mystified. There's this comment: * The way we recognize COWed pages within VM_PFNMAP mappings is through the * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit * set, and the vm_pgoff will point to the first PFN mapped: thus every special * mapping will always honor the rule * * pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT) Is that referring to special mappings in the install_special_mapping sense or to something else. FWIW, the vdso ins't a VM_PFNMAP at all. --Andy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) @ 2015-03-16 19:20 ` Andy Lutomirski 0 siblings, 0 replies; 31+ messages in thread From: Andy Lutomirski @ 2015-03-16 19:20 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel, linux-mm [cc: linux-mm] On Mon, Mar 16, 2015 at 12:01 PM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/12, Oleg Nesterov wrote: >> >> OTOH. We can probably add ->access() into special_mapping_vmops, this >> way __access_remote_vm() could work even if gup() fails ? > > So I tried to think how special_mapping_vmops->access() can work, it > needs to rely on ->vm_pgoff. > > But afaics this logic is just broken. Lets even forget about vvar vma > which uses remap_file_pages(). Lets look at "[vdso]" which uses the > "normal" pages. > > The comment in special_mapping_fault() says > > * special mappings have no vm_file, and in that case, the mm > * uses vm_pgoff internally. > > Yes. But afaics mm/ doesn't do this correctly. So > > * do not copy this code into drivers! > > looks like a good recommendation ;) > > I think that this logic is wrong even if ARRAY_SIZE(pages) == 1, but I am > not sure. But since vdso use 2 pages, it is trivial to show that this logic > is wrong. To verify, I changed show_map_vma() to expose pgoff even if !file, > but this test-case can show the problem too: > > #include <stdio.h> > #include <unistd.h> > #include <stdlib.h> > #include <string.h> > #include <sys/mman.h> > #include <assert.h> > > void *find_vdso_vaddr(void) > { > FILE *perl; > char buf[32] = {}; > > perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;" > "/^(.*?)-.*vdso/ && print hex $1 while <>'", "r"); > fread(buf, sizeof(buf), 1, perl); > fclose(perl); > > return (void *)atol(buf); > } > > #define PAGE_SIZE 4096 > > int main(void) > { > void *vdso = find_vdso_vaddr(); > assert(vdso); > > // of course they should differ, and they do so far > printf("vdso pages differ: %d\n", > !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE)); > > // split into 2 vma's > assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0); > > // force another fault on the next check > assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0); I really hope this doesn't do anything (or fails) on the vvar page, which is a pfnmap. > > // now they no longer differ, the 2nd vm_pgoff is wrong > printf("vdso pages differ: %d\n", > !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE)); > > return 0; > } > > output: > > vdso pages differ: 1 > vdso pages differ: 0 > > And not only "split_vma" is wrong, I think that "move_vma" is not right too. > Note this check in copy_vma(), > > /* > * If anonymous vma has not yet been faulted, update new pgoff > * to match new location, to increase its chance of merging. > */ > if (unlikely(!vma->vm_file && !vma->anon_vma)) { > pgoff = addr >> PAGE_SHIFT; > faulted_in_anon_vma = false; > } > > I can easily misread this code. But it doesn't look right too. If vdso was cow'ed > (breakpoint installed by gdb) and sys_nremap()'ed, then the new pgoff will be wrong > too after, say, MADV_DONTNEED. > > Or I am totally confused? Ick, you're probably right. For what it's worth, the vdso *seems* to be okay (on 64-bit only, and only if you don't poke at it too hard) if you mremap it in one piece. CRIU does that. What does the mm code do with vm_pgoff for vmas with no vm_file? I'm mystified. There's this comment: * The way we recognize COWed pages within VM_PFNMAP mappings is through the * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit * set, and the vm_pgoff will point to the first PFN mapped: thus every special * mapping will always honor the rule * * pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT) Is that referring to special mappings in the install_special_mapping sense or to something else. FWIW, the vdso ins't a VM_PFNMAP at all. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) 2015-03-16 19:20 ` Andy Lutomirski @ 2015-03-16 19:44 ` Oleg Nesterov -1 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-16 19:44 UTC (permalink / raw) To: Andy Lutomirski Cc: Hugh Dickins, Linus Torvalds, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel, linux-mm On 03/16, Andy Lutomirski wrote: > > Ick, you're probably right. For what it's worth, the vdso *seems* to > be okay (on 64-bit only, and only if you don't poke at it too hard) if > you mremap it in one piece. CRIU does that. I need to run away till tomorrow, but looking at this code even if "one piece" case doesn't look right if it was cow'ed. I'll verify tomorrow. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) @ 2015-03-16 19:44 ` Oleg Nesterov 0 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-16 19:44 UTC (permalink / raw) To: Andy Lutomirski Cc: Hugh Dickins, Linus Torvalds, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel, linux-mm On 03/16, Andy Lutomirski wrote: > > Ick, you're probably right. For what it's worth, the vdso *seems* to > be okay (on 64-bit only, and only if you don't poke at it too hard) if > you mremap it in one piece. CRIU does that. I need to run away till tomorrow, but looking at this code even if "one piece" case doesn't look right if it was cow'ed. I'll verify tomorrow. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) 2015-03-16 19:44 ` Oleg Nesterov @ 2015-03-17 13:43 ` Oleg Nesterov -1 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-17 13:43 UTC (permalink / raw) To: Andy Lutomirski Cc: Hugh Dickins, Linus Torvalds, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel, linux-mm On 03/16, Oleg Nesterov wrote: > > On 03/16, Andy Lutomirski wrote: > > > > Ick, you're probably right. For what it's worth, the vdso *seems* to > > be okay (on 64-bit only, and only if you don't poke at it too hard) if > > you mremap it in one piece. CRIU does that. > > I need to run away till tomorrow, but looking at this code even if "one piece" > case doesn't look right if it was cow'ed. I'll verify tomorrow. And I am still not sure this all is 100% correct, but I got lost in this code. Probably this is fine... But at least the bug exposed by the test-case looks clear: do_linear_fault: vmf->pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; ... special_mapping_fault: pgoff = vmf->pgoff - vma->vm_pgoff; So special_mapping_fault() can only work if this mapping starts from the first page in ->pages[]. So perhaps we need _something like_ the (wrong/incomplete) patch below... Or, really, perhaps we can create vdso_mapping ? So that map_vdso() could simply mmap the anon_inode file... Oleg. --- x/mm/mmap.c +++ x/mm/mmap.c @@ -2832,6 +2832,8 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) return 0; } +bool is_special_vma(struct vm_area_struct *vma); + /* * Copy the vma structure to a new location in the same mm, * prior to moving page table entries, to effect an mremap move. @@ -2851,7 +2853,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, * If anonymous vma has not yet been faulted, update new pgoff * to match new location, to increase its chance of merging. */ - if (unlikely(!vma->vm_file && !vma->anon_vma)) { + if (unlikely(!vma->vm_file && !is_special_vma(vma) && !vma->anon_vma)) { pgoff = addr >> PAGE_SHIFT; faulted_in_anon_vma = false; } @@ -2953,6 +2955,11 @@ static const struct vm_operations_struct legacy_special_mapping_vmops = { .fault = special_mapping_fault, }; +bool is_special_vma(struct vm_area_struct *vma) +{ + return vma->vm_ops == &special_mapping_vmops; +} + static int special_mapping_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { @@ -2965,7 +2972,7 @@ static int special_mapping_fault(struct vm_area_struct *vma, * We are allowed to do this because we are the mm; do not copy * this code into drivers! */ - pgoff = vmf->pgoff - vma->vm_pgoff; + pgoff = vmf->pgoff; if (vma->vm_ops == &legacy_special_mapping_vmops) pages = vma->vm_private_data; @@ -3014,6 +3021,7 @@ static struct vm_area_struct *__install_special_mapping( if (ret) goto out; + vma->vm_pgoff = 0; mm->total_vm += len >> PAGE_SHIFT; perf_event_mmap(vma); ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) @ 2015-03-17 13:43 ` Oleg Nesterov 0 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-17 13:43 UTC (permalink / raw) To: Andy Lutomirski Cc: Hugh Dickins, Linus Torvalds, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel, linux-mm On 03/16, Oleg Nesterov wrote: > > On 03/16, Andy Lutomirski wrote: > > > > Ick, you're probably right. For what it's worth, the vdso *seems* to > > be okay (on 64-bit only, and only if you don't poke at it too hard) if > > you mremap it in one piece. CRIU does that. > > I need to run away till tomorrow, but looking at this code even if "one piece" > case doesn't look right if it was cow'ed. I'll verify tomorrow. And I am still not sure this all is 100% correct, but I got lost in this code. Probably this is fine... But at least the bug exposed by the test-case looks clear: do_linear_fault: vmf->pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; ... special_mapping_fault: pgoff = vmf->pgoff - vma->vm_pgoff; So special_mapping_fault() can only work if this mapping starts from the first page in ->pages[]. So perhaps we need _something like_ the (wrong/incomplete) patch below... Or, really, perhaps we can create vdso_mapping ? So that map_vdso() could simply mmap the anon_inode file... Oleg. --- x/mm/mmap.c +++ x/mm/mmap.c @@ -2832,6 +2832,8 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) return 0; } +bool is_special_vma(struct vm_area_struct *vma); + /* * Copy the vma structure to a new location in the same mm, * prior to moving page table entries, to effect an mremap move. @@ -2851,7 +2853,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, * If anonymous vma has not yet been faulted, update new pgoff * to match new location, to increase its chance of merging. */ - if (unlikely(!vma->vm_file && !vma->anon_vma)) { + if (unlikely(!vma->vm_file && !is_special_vma(vma) && !vma->anon_vma)) { pgoff = addr >> PAGE_SHIFT; faulted_in_anon_vma = false; } @@ -2953,6 +2955,11 @@ static const struct vm_operations_struct legacy_special_mapping_vmops = { .fault = special_mapping_fault, }; +bool is_special_vma(struct vm_area_struct *vma) +{ + return vma->vm_ops == &special_mapping_vmops; +} + static int special_mapping_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { @@ -2965,7 +2972,7 @@ static int special_mapping_fault(struct vm_area_struct *vma, * We are allowed to do this because we are the mm; do not copy * this code into drivers! */ - pgoff = vmf->pgoff - vma->vm_pgoff; + pgoff = vmf->pgoff; if (vma->vm_ops == &legacy_special_mapping_vmops) pages = vma->vm_private_data; @@ -3014,6 +3021,7 @@ static struct vm_area_struct *__install_special_mapping( if (ret) goto out; + vma->vm_pgoff = 0; mm->total_vm += len >> PAGE_SHIFT; perf_event_mmap(vma); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) 2015-03-17 13:43 ` Oleg Nesterov @ 2015-03-18 1:44 ` Andy Lutomirski -1 siblings, 0 replies; 31+ messages in thread From: Andy Lutomirski @ 2015-03-18 1:44 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel, linux-mm On Tue, Mar 17, 2015 at 6:43 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/16, Oleg Nesterov wrote: >> >> On 03/16, Andy Lutomirski wrote: >> > >> > Ick, you're probably right. For what it's worth, the vdso *seems* to >> > be okay (on 64-bit only, and only if you don't poke at it too hard) if >> > you mremap it in one piece. CRIU does that. >> >> I need to run away till tomorrow, but looking at this code even if "one piece" >> case doesn't look right if it was cow'ed. I'll verify tomorrow. > > And I am still not sure this all is 100% correct, but I got lost in this code. > Probably this is fine... > > But at least the bug exposed by the test-case looks clear: > > do_linear_fault: > > vmf->pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) > + vma->vm_pgoff; > ... > > special_mapping_fault: > > pgoff = vmf->pgoff - vma->vm_pgoff; > > > So special_mapping_fault() can only work if this mapping starts from the > first page in ->pages[]. > > So perhaps we need _something like_ the (wrong/incomplete) patch below... > > Or, really, perhaps we can create vdso_mapping ? So that map_vdso() could > simply mmap the anon_inode file... That's slightly tricky, I think, because it could start showing up in /proc/PID/map_files or whatever it's called, and I don't think we want that. I also don't want to commit to all special mappings everywhere being semantically identical (there are already two kinds on both x86 and arm64, and I'd eventually like to have them vary per-process as well). None of that precludes using non-null vm_file, but it's a complication. Your patch does look like a considerable improvement, though. Let me see if I can find some time to fold it in with the rest of my special mapping rework over the next few days. --Andy > > Oleg. > > --- x/mm/mmap.c > +++ x/mm/mmap.c > @@ -2832,6 +2832,8 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) > return 0; > } > > +bool is_special_vma(struct vm_area_struct *vma); > + > /* > * Copy the vma structure to a new location in the same mm, > * prior to moving page table entries, to effect an mremap move. > @@ -2851,7 +2853,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > * If anonymous vma has not yet been faulted, update new pgoff > * to match new location, to increase its chance of merging. > */ > - if (unlikely(!vma->vm_file && !vma->anon_vma)) { > + if (unlikely(!vma->vm_file && !is_special_vma(vma) && !vma->anon_vma)) { > pgoff = addr >> PAGE_SHIFT; > faulted_in_anon_vma = false; > } > @@ -2953,6 +2955,11 @@ static const struct vm_operations_struct legacy_special_mapping_vmops = { > .fault = special_mapping_fault, > }; > > +bool is_special_vma(struct vm_area_struct *vma) > +{ > + return vma->vm_ops == &special_mapping_vmops; > +} > + > static int special_mapping_fault(struct vm_area_struct *vma, > struct vm_fault *vmf) > { > @@ -2965,7 +2972,7 @@ static int special_mapping_fault(struct vm_area_struct *vma, > * We are allowed to do this because we are the mm; do not copy > * this code into drivers! > */ > - pgoff = vmf->pgoff - vma->vm_pgoff; > + pgoff = vmf->pgoff; > > if (vma->vm_ops == &legacy_special_mapping_vmops) > pages = vma->vm_private_data; > @@ -3014,6 +3021,7 @@ static struct vm_area_struct *__install_special_mapping( > if (ret) > goto out; > > + vma->vm_pgoff = 0; > mm->total_vm += len >> PAGE_SHIFT; > > perf_event_mmap(vma); > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) @ 2015-03-18 1:44 ` Andy Lutomirski 0 siblings, 0 replies; 31+ messages in thread From: Andy Lutomirski @ 2015-03-18 1:44 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel, linux-mm On Tue, Mar 17, 2015 at 6:43 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/16, Oleg Nesterov wrote: >> >> On 03/16, Andy Lutomirski wrote: >> > >> > Ick, you're probably right. For what it's worth, the vdso *seems* to >> > be okay (on 64-bit only, and only if you don't poke at it too hard) if >> > you mremap it in one piece. CRIU does that. >> >> I need to run away till tomorrow, but looking at this code even if "one piece" >> case doesn't look right if it was cow'ed. I'll verify tomorrow. > > And I am still not sure this all is 100% correct, but I got lost in this code. > Probably this is fine... > > But at least the bug exposed by the test-case looks clear: > > do_linear_fault: > > vmf->pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) > + vma->vm_pgoff; > ... > > special_mapping_fault: > > pgoff = vmf->pgoff - vma->vm_pgoff; > > > So special_mapping_fault() can only work if this mapping starts from the > first page in ->pages[]. > > So perhaps we need _something like_ the (wrong/incomplete) patch below... > > Or, really, perhaps we can create vdso_mapping ? So that map_vdso() could > simply mmap the anon_inode file... That's slightly tricky, I think, because it could start showing up in /proc/PID/map_files or whatever it's called, and I don't think we want that. I also don't want to commit to all special mappings everywhere being semantically identical (there are already two kinds on both x86 and arm64, and I'd eventually like to have them vary per-process as well). None of that precludes using non-null vm_file, but it's a complication. Your patch does look like a considerable improvement, though. Let me see if I can find some time to fold it in with the rest of my special mapping rework over the next few days. --Andy > > Oleg. > > --- x/mm/mmap.c > +++ x/mm/mmap.c > @@ -2832,6 +2832,8 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) > return 0; > } > > +bool is_special_vma(struct vm_area_struct *vma); > + > /* > * Copy the vma structure to a new location in the same mm, > * prior to moving page table entries, to effect an mremap move. > @@ -2851,7 +2853,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > * If anonymous vma has not yet been faulted, update new pgoff > * to match new location, to increase its chance of merging. > */ > - if (unlikely(!vma->vm_file && !vma->anon_vma)) { > + if (unlikely(!vma->vm_file && !is_special_vma(vma) && !vma->anon_vma)) { > pgoff = addr >> PAGE_SHIFT; > faulted_in_anon_vma = false; > } > @@ -2953,6 +2955,11 @@ static const struct vm_operations_struct legacy_special_mapping_vmops = { > .fault = special_mapping_fault, > }; > > +bool is_special_vma(struct vm_area_struct *vma) > +{ > + return vma->vm_ops == &special_mapping_vmops; > +} > + > static int special_mapping_fault(struct vm_area_struct *vma, > struct vm_fault *vmf) > { > @@ -2965,7 +2972,7 @@ static int special_mapping_fault(struct vm_area_struct *vma, > * We are allowed to do this because we are the mm; do not copy > * this code into drivers! > */ > - pgoff = vmf->pgoff - vma->vm_pgoff; > + pgoff = vmf->pgoff; > > if (vma->vm_ops == &legacy_special_mapping_vmops) > pages = vma->vm_private_data; > @@ -3014,6 +3021,7 @@ static struct vm_area_struct *__install_special_mapping( > if (ret) > goto out; > > + vma->vm_pgoff = 0; > mm->total_vm += len >> PAGE_SHIFT; > > perf_event_mmap(vma); > -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) 2015-03-18 1:44 ` Andy Lutomirski @ 2015-03-18 18:06 ` Oleg Nesterov -1 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-18 18:06 UTC (permalink / raw) To: Andy Lutomirski Cc: Hugh Dickins, Linus Torvalds, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel, linux-mm On 03/17, Andy Lutomirski wrote: > > On Tue, Mar 17, 2015 at 6:43 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > But at least the bug exposed by the test-case looks clear: > > > > do_linear_fault: > > > > vmf->pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) > > + vma->vm_pgoff; > > ... > > > > special_mapping_fault: > > > > pgoff = vmf->pgoff - vma->vm_pgoff; > > > > > > So special_mapping_fault() can only work if this mapping starts from the > > first page in ->pages[]. > > > > So perhaps we need _something like_ the (wrong/incomplete) patch below... > > > > Or, really, perhaps we can create vdso_mapping ? So that map_vdso() could > > simply mmap the anon_inode file... > > That's slightly tricky, I think, because it could start showing up in > /proc/PID/map_files or whatever it's called, and I don't think we want > that. Hmm. To me this looke liks improvement. And again, with this change uprobe-in-vdso can work. OK, this is off-topic right now, lets forget this for the moment. > Your patch does look like a considerable improvement, though. Let me > see if I can find some time to fold it in with the rest of my special > mapping rework over the next few days. I'll try to recheck... Perhaps I'll send this (changed) patch for review. This is a bugfix, even if the bug is minor. And note that with this change vvar->access() becomes trivial. I think it makes sense to fix "gup() fails in vvar" too. Gdb developers have enough other problems with the poor kernel interfaces ;) Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) @ 2015-03-18 18:06 ` Oleg Nesterov 0 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2015-03-18 18:06 UTC (permalink / raw) To: Andy Lutomirski Cc: Hugh Dickins, Linus Torvalds, Jan Kratochvil, Sergio Durigan Junior, GDB Patches, Pedro Alves, linux-kernel, linux-mm On 03/17, Andy Lutomirski wrote: > > On Tue, Mar 17, 2015 at 6:43 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > > > But at least the bug exposed by the test-case looks clear: > > > > do_linear_fault: > > > > vmf->pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) > > + vma->vm_pgoff; > > ... > > > > special_mapping_fault: > > > > pgoff = vmf->pgoff - vma->vm_pgoff; > > > > > > So special_mapping_fault() can only work if this mapping starts from the > > first page in ->pages[]. > > > > So perhaps we need _something like_ the (wrong/incomplete) patch below... > > > > Or, really, perhaps we can create vdso_mapping ? So that map_vdso() could > > simply mmap the anon_inode file... > > That's slightly tricky, I think, because it could start showing up in > /proc/PID/map_files or whatever it's called, and I don't think we want > that. Hmm. To me this looke liks improvement. And again, with this change uprobe-in-vdso can work. OK, this is off-topic right now, lets forget this for the moment. > Your patch does look like a considerable improvement, though. Let me > see if I can find some time to fold it in with the rest of my special > mapping rework over the next few days. I'll try to recheck... Perhaps I'll send this (changed) patch for review. This is a bugfix, even if the bug is minor. And note that with this change vvar->access() becomes trivial. I think it makes sense to fix "gup() fails in vvar" too. Gdb developers have enough other problems with the poor kernel interfaces ;) Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) 2015-03-16 19:01 ` install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) Oleg Nesterov 2015-03-16 19:20 ` Andy Lutomirski @ 2015-03-16 19:40 ` Pedro Alves 1 sibling, 0 replies; 31+ messages in thread From: Pedro Alves @ 2015-03-16 19:40 UTC (permalink / raw) To: Oleg Nesterov, Andy Lutomirski, Hugh Dickins, Linus Torvalds Cc: Jan Kratochvil, Sergio Durigan Junior, GDB Patches, linux-kernel Thanks for looking over all this, guys. Really appreciated. On 03/16/2015 07:01 PM, Oleg Nesterov wrote: > is wrong. To verify, I changed show_map_vma() to expose pgoff even if !file, > but this test-case can show the problem too: Might be good to add tests like this to selftests/ once all this is sorted. Pedro Alves ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2015-03-18 18:51 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <878ufc9kau.fsf@redhat.com> [not found] ` <20150305154827.GA9441@host1.jankratochvil.net> [not found] ` <87zj7r5fpz.fsf@redhat.com> [not found] ` <20150305205744.GA13165@host1.jankratochvil.net> [not found] ` <20150311200052.GA22654@redhat.com> 2015-03-12 14:34 ` vvar, gup && coredump Oleg Nesterov 2015-03-12 16:29 ` Andy Lutomirski 2015-03-12 16:54 ` Oleg Nesterov 2015-03-12 17:17 ` Andy Lutomirski 2015-03-12 17:39 ` Oleg Nesterov 2015-03-12 17:45 ` Sergio Durigan Junior 2015-03-12 18:02 ` Oleg Nesterov 2015-03-13 4:50 ` Sergio Durigan Junior 2015-03-13 15:04 ` Oleg Nesterov 2015-03-12 17:55 ` Andy Lutomirski 2015-03-12 18:08 ` Oleg Nesterov 2015-03-12 18:33 ` Andy Lutomirski 2015-03-13 15:22 ` Oleg Nesterov 2015-03-12 17:46 ` Oleg Nesterov 2015-03-12 17:54 ` Andy Lutomirski 2015-03-12 18:05 ` Oleg Nesterov 2015-03-12 18:23 ` Sergio Durigan Junior 2015-03-12 18:19 ` Pedro Alves 2015-03-12 18:25 ` Andy Lutomirski 2015-03-16 19:01 ` install_special_mapping && vm_pgoff (Was: vvar, gup && coredump) Oleg Nesterov 2015-03-16 19:20 ` Andy Lutomirski 2015-03-16 19:20 ` Andy Lutomirski 2015-03-16 19:44 ` Oleg Nesterov 2015-03-16 19:44 ` Oleg Nesterov 2015-03-17 13:43 ` Oleg Nesterov 2015-03-17 13:43 ` Oleg Nesterov 2015-03-18 1:44 ` Andy Lutomirski 2015-03-18 1:44 ` Andy Lutomirski 2015-03-18 18:06 ` Oleg Nesterov 2015-03-18 18:06 ` Oleg Nesterov 2015-03-16 19:40 ` Pedro Alves
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.