All of lore.kernel.org
 help / color / mirror / Atom feed
* [Virtio-fs] Problem using gdb/lldb on a binary residing on a DAX-enabled volume
@ 2021-08-23  9:31 Sergio Lopez
  2021-08-24  5:04 ` Chirantan Ekbote
  0 siblings, 1 reply; 6+ messages in thread
From: Sergio Lopez @ 2021-08-23  9:31 UTC (permalink / raw)
  To: virtio-fs-list

[-- Attachment #1: Type: text/plain, Size: 4539 bytes --]

Hi,

I've noticed that trying to use gdb/lldb on any binary residing on a
DAX-enabled virtio-fs volume leads to a SIGSEGV in userspace...

[root@fedora ~]# mount -t virtiofs -o dax,context=system_u:object_r:root_t:s0 /dev/root /mnt/virtio-fs
[root@fedora ~]# gdb /mnt/virtio-fs/true 
GNU gdb (GDB) Fedora 10.2-3.fc34
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /mnt/virtio-fs/true...
Missing separate debuginfo for /mnt/virtio-fs/true
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/25/47d5e7307ccc0f5a75e5990afdc5e15293ccc1.debug
Reading symbols from .gnu_debugdata for /mnt/virtio-fs/true...
(No debugging symbols found in .gnu_debugdata for /mnt/virtio-fs/true)
(gdb) break main
Breakpoint 1 at 0x14c0
(gdb) r
Starting program: /mnt/virtio-fs/true 
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.33-5.fc34.x86_64

Program received signal SIGSEGV, Segmentation fault.
0x0000555555555960 in _start ()


... while the kernel drops a warning:


[   39.962629] ------------[ cut here ]------------
[   39.963854] WARNING: CPU: 1 PID: 718 at mm/memory.c:2666 wp_page_copy+0x565/0x5d0
[   39.965736] Modules linked in: virtio_net net_failover i2c_i801 failover i2c_smbus joydev drm ip_tablew
[   39.969040] CPU: 1 PID: 718 Comm: gdb Not tainted 5.11.12-300.fc34.x86_64 #1
[   39.970800] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebui4
[   39.973535] RIP: 0010:wp_page_copy+0x565/0x5d0
[   39.974675] Code: 89 7b 48 48 8b 43 30 49 39 07 75 76 48 8b 7c 24 08 48 8b 74 24 10 ba 00 10 00 00 e8 5
[   39.979165] RSP: 0018:ffffa3aa40cb7bc0 EFLAGS: 00010206
[   39.980483] RAX: 0000000000001000 RBX: ffffa3aa40cb7c60 RCX: 0000000000001000
[   39.982244] RDX: 0000000000001000 RSI: 0000555555555000 RDI: ffff898117a9f000
[   39.984015] RBP: ffffe934045ea7c0 R08: ffffe934041c7f68 R09: 0000000000000001
[   39.985780] R10: 0000000000117a9f R11: ffff89810825dd20 R12: ffff898105829770
[   39.987552] R13: 0000000000000000 R14: ffff898106c11dc0 R15: 0000000000000001
[   39.989324] FS:  00007f9691a91080(0000) GS:ffff898763a40000(0000) knlGS:0000000000000000
[   39.991319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   39.992760] CR2: 0000555555555000 CR3: 000000010422a000 CR4: 00000000000006e0
[   39.994535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   39.996312] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   39.998086] Call Trace:
[   39.998766]  handle_mm_fault+0x113f/0x1970
[   39.999828]  __get_user_pages+0x28c/0x7f0
[   40.000868]  __get_user_pages_remote+0xd4/0x320
[   40.002028]  __access_remote_vm+0x106/0x370
[   40.003099]  ptrace_access_vm+0x9d/0xd0
[   40.004100]  ptrace_request+0x560/0x620
[   40.005097]  arch_ptrace+0x13d/0x3b0
[   40.006039]  __x64_sys_ptrace+0x82/0x120
[   40.007050]  do_syscall_64+0x33/0x40
[   40.008002]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   40.009283] RIP: 0033:0x7f9694ce2aae
[   40.010220] Code: 70 41 83 f8 03 c7 44 24 10 08 00 00 00 48 89 44 24 18 48 8d 44 24 30 8b 70 08 4c 0f 4
[   40.014714] RSP: 002b:00007ffe0cffb4e0 EFLAGS: 00000202 ORIG_RAX: 0000000000000065
[   40.016577] RAX: ffffffffffffffda RBX: 00005555555554c0 RCX: 00007f9694ce2aae
[   40.018331] RDX: 00005555555554c0 RSI: 00000000000002ef RDI: 0000000000000005
[   40.020092] RBP: 0000000000000001 R08: 0000000000000004 R09: 00005555555554c0
[   40.021855] R10: 55415641fa1e0fcc R11: 0000000000000202 R12: 0000000000000001
[   40.023612] R13: 00007f9691a90848 R14: 00000000000002ef R15: 0000000000000000
[   40.025364] ---[ end trace 2273d62ee1978eb4 ]---


Seems like DAX breaks something in the ptrace_access_vm path. On a
volume without DAX works fine.

The guest is a Fedora with a 5.11.12-300.fc34.x86_64 kernel.

Sergio.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Virtio-fs] Problem using gdb/lldb on a binary residing on a DAX-enabled volume
  2021-08-23  9:31 [Virtio-fs] Problem using gdb/lldb on a binary residing on a DAX-enabled volume Sergio Lopez
@ 2021-08-24  5:04 ` Chirantan Ekbote
  2021-08-24 13:22   ` Sergio Lopez
  2021-08-24 21:11   ` Vivek Goyal
  0 siblings, 2 replies; 6+ messages in thread
From: Chirantan Ekbote @ 2021-08-24  5:04 UTC (permalink / raw)
  To: Sergio Lopez; +Cc: virtio-fs-list, David Riley, Sergey Senozhatsky, Yu Zhao

Hi Sergio,

On Mon, Aug 23, 2021 at 6:31 PM Sergio Lopez <slp@redhat.com> wrote:
>
> Hi,
>
> I've noticed that trying to use gdb/lldb on any binary residing on a
> DAX-enabled virtio-fs volume leads to a SIGSEGV in userspace...
>
> Seems like DAX breaks something in the ptrace_access_vm path. On a
> volume without DAX works fine.
>

We've seen this as well and unfortunately it doesn't appear to be
limited to virtio-fs.  Using DAX on a ext4 formatted virtio-pmem disk
image has the same problem.  We've actually disabled DAX everywhere
because of this.

Unfortunately most of the details are in an internal bug report but
I'll try to extract the relevant bits here.  This is well outside my
depth so I've CC'd some of the people who have looked at this.  The
initial bug report was for virtio-pmem+ext4 so some of the details are
specific to pmem but I suspect something similar is happening for
virtio-fs as well.

The issue is that process_vm_readv() corrupts the memory of files that
are mmap()'d when DAX is enabled.

1. A filesystem is mounted with DAX enabled.  pmem_attach_disk() sets
pfn_flags to PFN_DEV|PFN_MAP.  In the fuse case, this appears to
happen here [1].
2. When the (strace/gdb/etc) process does its initial read of the
mmap()'d region, the pfn flags for the page are inherited from the
pmem structure set to PFN_DEV|PFN_MAP in step 1.  During a call to
insert_pfn(), pte_mkdevmap is called to mark the pte as devmap.
3. If you follow the ftrace of the process_vm_readv(), it eventually
reaches do_wp_page(). If the target process had not previously read
the page in, this would not call do_wp_page() and instead just fault
in the page normally through the ext4/dax logic.
4. do_wp_page() calls vm_normal_page() which returns NULL due to the
remote pte being marked special and devmap (from above).  If we just
ignore the devmap check and return the page that has been found and
allow the normal copy to occur, then no problem occurs.  However, that
can't be safely done in normal dax cases.  Due to vm_normal_page()
returning NULL, wp_page_copy() is called (first call site) with a null
vmf->address.  If the mmap()d file is originally from a non-dax
filesystem (eg tmpfs), the second wp_page_copy() ends up being called
with a valid vmf->address.
5. cow_user_page() ends up in this unexpected case since
src==vmf->address is NULL, delimited with the following comment:

        /*
         * If the source page was a PFN mapping, we don't have
         * a "struct page" for it. We do a best-effort copy by
         * just copying from the original user address. If that
         * fails, we just zero-fill it. Live with it.
         */

The end effect of this is that there is the
__copy_from_user_inatomic() call with an invalid uaddr because the
uaddr is from the remote address space.   This results in another page
fault because that remote address isn't valid in the process calling
process_vm_readv().  It seems that there's a few issues here, a) that
it's trying to read from the remote address space as if it were local,
and b) that the failure here is corrupting the remote processes memory
and not just returning an empty page which would be less broken.

In the good case of the mmap()d file being from tmpfs,
src==vmf->address is non-NULL and copy_user_highpage can properly do
the copy of the page.  At that point, the caller is able to copy data
from that page to its own local buffer and return data successfully,
as well as avoid corrupting the remote process.

We've also found that reverting "17839856fd58: gup: document and work
around "COW can break either way" issue" seems to make the problem go
away.

Chirantan

[1]: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/d5ae8d7f85b7f6f6e60f1af8ff4be52b0926fde1/fs/fuse/virtio_fs.c#741


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Virtio-fs] Problem using gdb/lldb on a binary residing on a DAX-enabled volume
  2021-08-24  5:04 ` Chirantan Ekbote
@ 2021-08-24 13:22   ` Sergio Lopez
  2021-08-24 18:09     ` Yu Zhao
  2021-08-24 21:11   ` Vivek Goyal
  1 sibling, 1 reply; 6+ messages in thread
From: Sergio Lopez @ 2021-08-24 13:22 UTC (permalink / raw)
  To: Chirantan Ekbote; +Cc: virtio-fs-list, David Riley, Sergey Senozhatsky, Yu Zhao

[-- Attachment #1: Type: text/plain, Size: 4081 bytes --]

On Tue, Aug 24, 2021 at 02:04:45PM +0900, Chirantan Ekbote wrote:
> Hi Sergio,
> 
> On Mon, Aug 23, 2021 at 6:31 PM Sergio Lopez <slp@redhat.com> wrote:
> >
> > Hi,
> >
> > I've noticed that trying to use gdb/lldb on any binary residing on a
> > DAX-enabled virtio-fs volume leads to a SIGSEGV in userspace...
> >
> > Seems like DAX breaks something in the ptrace_access_vm path. On a
> > volume without DAX works fine.
> >
> 
> We've seen this as well and unfortunately it doesn't appear to be
> limited to virtio-fs.  Using DAX on a ext4 formatted virtio-pmem disk
> image has the same problem.  We've actually disabled DAX everywhere
> because of this.
> 
> Unfortunately most of the details are in an internal bug report but
> I'll try to extract the relevant bits here.  This is well outside my
> depth so I've CC'd some of the people who have looked at this.  The
> initial bug report was for virtio-pmem+ext4 so some of the details are
> specific to pmem but I suspect something similar is happening for
> virtio-fs as well.
> 
> The issue is that process_vm_readv() corrupts the memory of files that
> are mmap()'d when DAX is enabled.
> 
> 1. A filesystem is mounted with DAX enabled.  pmem_attach_disk() sets
> pfn_flags to PFN_DEV|PFN_MAP.  In the fuse case, this appears to
> happen here [1].
> 2. When the (strace/gdb/etc) process does its initial read of the
> mmap()'d region, the pfn flags for the page are inherited from the
> pmem structure set to PFN_DEV|PFN_MAP in step 1.  During a call to
> insert_pfn(), pte_mkdevmap is called to mark the pte as devmap.
> 3. If you follow the ftrace of the process_vm_readv(), it eventually
> reaches do_wp_page(). If the target process had not previously read
> the page in, this would not call do_wp_page() and instead just fault
> in the page normally through the ext4/dax logic.
> 4. do_wp_page() calls vm_normal_page() which returns NULL due to the
> remote pte being marked special and devmap (from above).  If we just
> ignore the devmap check and return the page that has been found and
> allow the normal copy to occur, then no problem occurs.  However, that
> can't be safely done in normal dax cases.  Due to vm_normal_page()
> returning NULL, wp_page_copy() is called (first call site) with a null
> vmf->address.  If the mmap()d file is originally from a non-dax
> filesystem (eg tmpfs), the second wp_page_copy() ends up being called
> with a valid vmf->address.
> 5. cow_user_page() ends up in this unexpected case since
> src==vmf->address is NULL, delimited with the following comment:
> 
>         /*
>          * If the source page was a PFN mapping, we don't have
>          * a "struct page" for it. We do a best-effort copy by
>          * just copying from the original user address. If that
>          * fails, we just zero-fill it. Live with it.
>          */
> 
> The end effect of this is that there is the
> __copy_from_user_inatomic() call with an invalid uaddr because the
> uaddr is from the remote address space.   This results in another page
> fault because that remote address isn't valid in the process calling
> process_vm_readv().  It seems that there's a few issues here, a) that
> it's trying to read from the remote address space as if it were local,
> and b) that the failure here is corrupting the remote processes memory
> and not just returning an empty page which would be less broken.
> 
> In the good case of the mmap()d file being from tmpfs,
> src==vmf->address is non-NULL and copy_user_highpage can properly do
> the copy of the page.  At that point, the caller is able to copy data
> from that page to its own local buffer and return data successfully,
> as well as avoid corrupting the remote process.
> 
> We've also found that reverting "17839856fd58: gup: document and work
> around "COW can break either way" issue" seems to make the problem go
> away.

Thanks a lot for the super-detailed write-up. Do you know if someone
is already working on a fix that can be upstreamed?

Sergio.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Virtio-fs] Problem using gdb/lldb on a binary residing on a DAX-enabled volume
  2021-08-24 13:22   ` Sergio Lopez
@ 2021-08-24 18:09     ` Yu Zhao
  2021-08-24 23:27       ` Andrea Arcangeli
  0 siblings, 1 reply; 6+ messages in thread
From: Yu Zhao @ 2021-08-24 18:09 UTC (permalink / raw)
  To: Sergio Lopez
  Cc: virtio-fs-list, David Riley, Sergey Senozhatsky, Andrea Arcangeli

On Tue, Aug 24, 2021 at 7:22 AM Sergio Lopez <slp@redhat.com> wrote:
>
> On Tue, Aug 24, 2021 at 02:04:45PM +0900, Chirantan Ekbote wrote:
> > Hi Sergio,
> >
> > On Mon, Aug 23, 2021 at 6:31 PM Sergio Lopez <slp@redhat.com> wrote:
> > >
> > > Hi,
> > >
> > > I've noticed that trying to use gdb/lldb on any binary residing on a
> > > DAX-enabled virtio-fs volume leads to a SIGSEGV in userspace...
> > >
> > > Seems like DAX breaks something in the ptrace_access_vm path. On a
> > > volume without DAX works fine.
> > >
> >
> > We've seen this as well and unfortunately it doesn't appear to be
> > limited to virtio-fs.  Using DAX on a ext4 formatted virtio-pmem disk
> > image has the same problem.  We've actually disabled DAX everywhere
> > because of this.
> >
> > Unfortunately most of the details are in an internal bug report but
> > I'll try to extract the relevant bits here.  This is well outside my
> > depth so I've CC'd some of the people who have looked at this.  The
> > initial bug report was for virtio-pmem+ext4 so some of the details are
> > specific to pmem but I suspect something similar is happening for
> > virtio-fs as well.
> >
> > The issue is that process_vm_readv() corrupts the memory of files that
> > are mmap()'d when DAX is enabled.
> >
> > 1. A filesystem is mounted with DAX enabled.  pmem_attach_disk() sets
> > pfn_flags to PFN_DEV|PFN_MAP.  In the fuse case, this appears to
> > happen here [1].
> > 2. When the (strace/gdb/etc) process does its initial read of the
> > mmap()'d region, the pfn flags for the page are inherited from the
> > pmem structure set to PFN_DEV|PFN_MAP in step 1.  During a call to
> > insert_pfn(), pte_mkdevmap is called to mark the pte as devmap.
> > 3. If you follow the ftrace of the process_vm_readv(), it eventually
> > reaches do_wp_page(). If the target process had not previously read
> > the page in, this would not call do_wp_page() and instead just fault
> > in the page normally through the ext4/dax logic.
> > 4. do_wp_page() calls vm_normal_page() which returns NULL due to the
> > remote pte being marked special and devmap (from above).  If we just
> > ignore the devmap check and return the page that has been found and
> > allow the normal copy to occur, then no problem occurs.  However, that
> > can't be safely done in normal dax cases.  Due to vm_normal_page()
> > returning NULL, wp_page_copy() is called (first call site) with a null
> > vmf->address.  If the mmap()d file is originally from a non-dax
> > filesystem (eg tmpfs), the second wp_page_copy() ends up being called
> > with a valid vmf->address.
> > 5. cow_user_page() ends up in this unexpected case since
> > src==vmf->address is NULL, delimited with the following comment:
> >
> >         /*
> >          * If the source page was a PFN mapping, we don't have
> >          * a "struct page" for it. We do a best-effort copy by
> >          * just copying from the original user address. If that
> >          * fails, we just zero-fill it. Live with it.
> >          */
> >
> > The end effect of this is that there is the
> > __copy_from_user_inatomic() call with an invalid uaddr because the
> > uaddr is from the remote address space.   This results in another page
> > fault because that remote address isn't valid in the process calling
> > process_vm_readv().  It seems that there's a few issues here, a) that
> > it's trying to read from the remote address space as if it were local,
> > and b) that the failure here is corrupting the remote processes memory
> > and not just returning an empty page which would be less broken.
> >
> > In the good case of the mmap()d file being from tmpfs,
> > src==vmf->address is non-NULL and copy_user_highpage can properly do
> > the copy of the page.  At that point, the caller is able to copy data
> > from that page to its own local buffer and return data successfully,
> > as well as avoid corrupting the remote process.
> >
> > We've also found that reverting "17839856fd58: gup: document and work
> > around "COW can break either way" issue" seems to make the problem go
> > away.
>
> Thanks a lot for the super-detailed write-up. Do you know if someone
> is already working on a fix that can be upstreamed?

Hi Sergio,

I cc'ed Andrea from RH -- he might have a fix queued in his tree.

Thanks.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Virtio-fs] Problem using gdb/lldb on a binary residing on a DAX-enabled volume
  2021-08-24  5:04 ` Chirantan Ekbote
  2021-08-24 13:22   ` Sergio Lopez
@ 2021-08-24 21:11   ` Vivek Goyal
  1 sibling, 0 replies; 6+ messages in thread
From: Vivek Goyal @ 2021-08-24 21:11 UTC (permalink / raw)
  To: Chirantan Ekbote
  Cc: Andrea Arcangeli, Yu Zhao, virtio-fs-list, David Riley,
	Sergey Senozhatsky, Dan Williams

On Tue, Aug 24, 2021 at 02:04:45PM +0900, Chirantan Ekbote wrote:
> Hi Sergio,
> 
> On Mon, Aug 23, 2021 at 6:31 PM Sergio Lopez <slp@redhat.com> wrote:
> >
> > Hi,
> >
> > I've noticed that trying to use gdb/lldb on any binary residing on a
> > DAX-enabled virtio-fs volume leads to a SIGSEGV in userspace...
> >
> > Seems like DAX breaks something in the ptrace_access_vm path. On a
> > volume without DAX works fine.
> >
> 
> We've seen this as well and unfortunately it doesn't appear to be
> limited to virtio-fs.  Using DAX on a ext4 formatted virtio-pmem disk
> image has the same problem.  We've actually disabled DAX everywhere
> because of this.

Thanks Chirantan for confirming that this probably a generic DAX
issue (and not limited to virtiofs) and providing all the details.

Copying Dan Williams. He might have an idea.

Vivek

> 
> Unfortunately most of the details are in an internal bug report but
> I'll try to extract the relevant bits here.  This is well outside my
> depth so I've CC'd some of the people who have looked at this.  The
> initial bug report was for virtio-pmem+ext4 so some of the details are
> specific to pmem but I suspect something similar is happening for
> virtio-fs as well.
> 
> The issue is that process_vm_readv() corrupts the memory of files that
> are mmap()'d when DAX is enabled.
> 
> 1. A filesystem is mounted with DAX enabled.  pmem_attach_disk() sets
> pfn_flags to PFN_DEV|PFN_MAP.  In the fuse case, this appears to
> happen here [1].
> 2. When the (strace/gdb/etc) process does its initial read of the
> mmap()'d region, the pfn flags for the page are inherited from the
> pmem structure set to PFN_DEV|PFN_MAP in step 1.  During a call to
> insert_pfn(), pte_mkdevmap is called to mark the pte as devmap.
> 3. If you follow the ftrace of the process_vm_readv(), it eventually
> reaches do_wp_page(). If the target process had not previously read
> the page in, this would not call do_wp_page() and instead just fault
> in the page normally through the ext4/dax logic.
> 4. do_wp_page() calls vm_normal_page() which returns NULL due to the
> remote pte being marked special and devmap (from above).  If we just
> ignore the devmap check and return the page that has been found and
> allow the normal copy to occur, then no problem occurs.  However, that
> can't be safely done in normal dax cases.  Due to vm_normal_page()
> returning NULL, wp_page_copy() is called (first call site) with a null
> vmf->address.  If the mmap()d file is originally from a non-dax
> filesystem (eg tmpfs), the second wp_page_copy() ends up being called
> with a valid vmf->address.
> 5. cow_user_page() ends up in this unexpected case since
> src==vmf->address is NULL, delimited with the following comment:
> 
>         /*
>          * If the source page was a PFN mapping, we don't have
>          * a "struct page" for it. We do a best-effort copy by
>          * just copying from the original user address. If that
>          * fails, we just zero-fill it. Live with it.
>          */
> 
> The end effect of this is that there is the
> __copy_from_user_inatomic() call with an invalid uaddr because the
> uaddr is from the remote address space.   This results in another page
> fault because that remote address isn't valid in the process calling
> process_vm_readv().  It seems that there's a few issues here, a) that
> it's trying to read from the remote address space as if it were local,
> and b) that the failure here is corrupting the remote processes memory
> and not just returning an empty page which would be less broken.
> 
> In the good case of the mmap()d file being from tmpfs,
> src==vmf->address is non-NULL and copy_user_highpage can properly do
> the copy of the page.  At that point, the caller is able to copy data
> from that page to its own local buffer and return data successfully,
> as well as avoid corrupting the remote process.
> 
> We've also found that reverting "17839856fd58: gup: document and work
> around "COW can break either way" issue" seems to make the problem go
> away.
> 
> Chirantan
> 
> [1]: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/d5ae8d7f85b7f6f6e60f1af8ff4be52b0926fde1/fs/fuse/virtio_fs.c#741
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Virtio-fs] Problem using gdb/lldb on a binary residing on a DAX-enabled volume
  2021-08-24 18:09     ` Yu Zhao
@ 2021-08-24 23:27       ` Andrea Arcangeli
  0 siblings, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2021-08-24 23:27 UTC (permalink / raw)
  To: Yu Zhao; +Cc: virtio-fs-list, David Riley, Sergey Senozhatsky

Hello everyone,

On Tue, Aug 24, 2021 at 12:09:37PM -0600, Yu Zhao wrote:
> On Tue, Aug 24, 2021 at 7:22 AM Sergio Lopez <slp@redhat.com> wrote:
> >
> > On Tue, Aug 24, 2021 at 02:04:45PM +0900, Chirantan Ekbote wrote:
> > > Hi Sergio,
> > >
> > > On Mon, Aug 23, 2021 at 6:31 PM Sergio Lopez <slp@redhat.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I've noticed that trying to use gdb/lldb on any binary residing on a
> > > > DAX-enabled virtio-fs volume leads to a SIGSEGV in userspace...
> > > >
> > > > Seems like DAX breaks something in the ptrace_access_vm path. On a
> > > > volume without DAX works fine.
> > > >
> > >
> > > We've seen this as well and unfortunately it doesn't appear to be
> > > limited to virtio-fs.  Using DAX on a ext4 formatted virtio-pmem disk
> > > image has the same problem.  We've actually disabled DAX everywhere
> > > because of this.
> > >
> > > Unfortunately most of the details are in an internal bug report but
> > > I'll try to extract the relevant bits here.  This is well outside my
> > > depth so I've CC'd some of the people who have looked at this.  The
> > > initial bug report was for virtio-pmem+ext4 so some of the details are
> > > specific to pmem but I suspect something similar is happening for
> > > virtio-fs as well.
> > >
> > > The issue is that process_vm_readv() corrupts the memory of files that
> > > are mmap()'d when DAX is enabled.
> > >
> > > 1. A filesystem is mounted with DAX enabled.  pmem_attach_disk() sets
> > > pfn_flags to PFN_DEV|PFN_MAP.  In the fuse case, this appears to
> > > happen here [1].
> > > 2. When the (strace/gdb/etc) process does its initial read of the
> > > mmap()'d region, the pfn flags for the page are inherited from the
> > > pmem structure set to PFN_DEV|PFN_MAP in step 1.  During a call to
> > > insert_pfn(), pte_mkdevmap is called to mark the pte as devmap.
> > > 3. If you follow the ftrace of the process_vm_readv(), it eventually
> > > reaches do_wp_page(). If the target process had not previously read
> > > the page in, this would not call do_wp_page() and instead just fault
> > > in the page normally through the ext4/dax logic.
> > > 4. do_wp_page() calls vm_normal_page() which returns NULL due to the
> > > remote pte being marked special and devmap (from above).  If we just
> > > ignore the devmap check and return the page that has been found and
> > > allow the normal copy to occur, then no problem occurs.  However, that
> > > can't be safely done in normal dax cases.  Due to vm_normal_page()
> > > returning NULL, wp_page_copy() is called (first call site) with a null
> > > vmf->address.  If the mmap()d file is originally from a non-dax
> > > filesystem (eg tmpfs), the second wp_page_copy() ends up being called
> > > with a valid vmf->address.
> > > 5. cow_user_page() ends up in this unexpected case since
> > > src==vmf->address is NULL, delimited with the following comment:
> > >
> > >         /*
> > >          * If the source page was a PFN mapping, we don't have
> > >          * a "struct page" for it. We do a best-effort copy by
> > >          * just copying from the original user address. If that
> > >          * fails, we just zero-fill it. Live with it.
> > >          */
> > >
> > > The end effect of this is that there is the
> > > __copy_from_user_inatomic() call with an invalid uaddr because the
> > > uaddr is from the remote address space.   This results in another page
> > > fault because that remote address isn't valid in the process calling
> > > process_vm_readv().  It seems that there's a few issues here, a) that
> > > it's trying to read from the remote address space as if it were local,
> > > and b) that the failure here is corrupting the remote processes memory
> > > and not just returning an empty page which would be less broken.
> > >
> > > In the good case of the mmap()d file being from tmpfs,
> > > src==vmf->address is non-NULL and copy_user_highpage can properly do
> > > the copy of the page.  At that point, the caller is able to copy data
> > > from that page to its own local buffer and return data successfully,
> > > as well as avoid corrupting the remote process.
> > >
> > > We've also found that reverting "17839856fd58: gup: document and work
> > > around "COW can break either way" issue" seems to make the problem go
> > > away.
> >
> > Thanks a lot for the super-detailed write-up. Do you know if someone
> > is already working on a fix that can be upstreamed?

Hmm, it appears you're not actually using latest upstream, if you were, you
couldn't backout 178 clean.

If you were to use latest upstream, I suppose your problem would be
already solved, however upstream cannot be backported to production
because upstream reopened CVE-2020-29374 with a slightly modified
reproducer:

https://github.com/aagit/kernel-testcases-for-v5.11/blob/main/vmsplice-v5.11.c

Upstream if backported would also cause all kind of ABI breaks to
clear_refs (obvious ones like RDMA being disabled from clear_refs
tracking, see 9348b73c2e1bfea74ccd4a44fb4ccc7276ab9623) but also some
that won't require any RDMA device like this ABI regression:

https://github.com/aagit/kernel-testcases-for-v5.11/blob/main/page_count_do_wp_page.c

Upstream also triggers false positive COWs after a swapin followed by
a memory write, if there's still a swapcache pin on the anon memory as
an example of how inaccurate it become.

So production has to stick to 178 since breaking ptrace on pmem is
preferable than dealing with the above fallout.

Note that 178 has other defect not just ptrace on pmem, it breaks KSM
density and it also broke uffd-wp (which I have to assume isn't
included yet in your current kernel version, so that shouldn't be a
concern).

> Hi Sergio,
> 
> I cc'ed Andrea from RH -- he might have a fix queued in his tree.

That's right thanks for the CC.

17839856fd58 was in the right direction because the only bug there
was, was in GUP and so it's correct to fix it in GUP itself.

To resolve the defect of 17839856fd58, whenever a GUP pin is being
taken for reading on a shared page with mapcount elevated, the
mapcount unshare logic triggers a new fault called COR fault, which
copies the page. Once done it leaves the pagetables like if a read
fault happened. That resolves all defects caused by 17839856fd58 which
instead acted as a write fault despite it was a GUP "read" access.

The patchset continues and leverages the COR fault to add full MM
coherency by the POSIX specs to FOLL_LONGTERM GUP pins no matter which
vma type. It's now as coherent as the long term pin was under the MMU
notifier it was impossible to achieve without the COR fault.

https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=mapcount_deshare&id=9f2f36029b5c8d4b3390c93e90a3b9254b7ec730

If you're curious to test it, this is the latest version:

https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=mapcount_deshare

I notice the branch is already super old, I didn't rebase it since mid
Jun sorry... There's also a v5.10-LTS -stable version. I'll rebase it
again soon as time permits.

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-24 23:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-23  9:31 [Virtio-fs] Problem using gdb/lldb on a binary residing on a DAX-enabled volume Sergio Lopez
2021-08-24  5:04 ` Chirantan Ekbote
2021-08-24 13:22   ` Sergio Lopez
2021-08-24 18:09     ` Yu Zhao
2021-08-24 23:27       ` Andrea Arcangeli
2021-08-24 21:11   ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.