All of lore.kernel.org
 help / color / mirror / Atom feed
* Bug report: vfio over kernel 5.19 - mm area
@ 2022-06-15 10:43 Yishai Hadas
  2022-06-15 10:52 ` Yishai Hadas
  0 siblings, 1 reply; 8+ messages in thread
From: Yishai Hadas @ 2022-06-15 10:43 UTC (permalink / raw)
  To: linux-mm@kvack.org akpm, Alex Williamson
  Cc: jason Gunthorpe, maor Gottlieb, Yishai Hadas, kvm, idok

Hi All,

Any idea what could cause the below break in 5.19 ? we run QEMU and 
immediately the machine is stuck.

Once I run, echo l > /proc/sysrq-trigger could see the below task which 
seems to be stuck..

This basic flow worked fine in 5.18.

[1162.056583] NMI backtrace for cpu 4
[ 1162.056585] CPU: 4 PID: 1979 Comm: qemu-system-x86 Not tainted 
5.19.0-rc1 #747
[ 1162.056587] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1162.056588] RIP: 0010:pmd_huge+0x0/0x20
[ 1162.056592] Code: 49 89 44 24 28 48 8b 47 30 49 89 44 24 30 31 c0 41 
5c c3 5b b8 01 00 00 00 5d 41 5c c3 cc cc cc cc cc cc cc cc cc cc cc cc 
cc <0f> 1f 44 00 00 31 c0 48 f7 c7 9f ff ff ff 74 0f 81 e7 81 00 00 00
[ 1162.056594] RSP: 0018:ffff888146253b38 EFLAGS: 00000202
[ 1162.056596] RAX: ffff888101461980 RBX: ffff888146253bc0 RCX: 
000ffffffffff000
[ 1162.056597] RDX: ffff88814fa22000 RSI: 00007f9f68231000 RDI: 
000000010a6b6067
[ 1162.056598] RBP: ffff888111b90dc0 R08: 000000000002f424 R09: 
0000000000000001
[ 1162.056599] R10: ffffffff825c2a40 R11: 0000000000000a08 R12: 
ffff88814fa22a08
[ 1162.056600] R13: 000000010a6b6067 R14: 0000000000052202 R15: 
00007f9f68231000
[ 1162.056602] FS:  00007f9f6c228c40(0000) GS:ffff88885f900000(0000) 
knlGS:0000000000000000
[ 1162.056605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1162.056606] CR2: 00005643994fd0ed CR3: 00000001496da005 CR4: 
0000000000372ea0
[ 1162.056607] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 1162.056609] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[ 1162.056610] Call Trace:
[ 1162.056611]  <TASK>
[ 1162.056611]  follow_page_mask+0x196/0x5e0
[ 1162.056615]  __get_user_pages+0x190/0x5d0
[ 1162.056617]  ? flush_workqueue_prep_pwqs+0x110/0x110
[ 1162.056620]  __gup_longterm_locked+0xaf/0x470
[ 1162.056624]  vaddr_get_pfns+0x8e/0x240 [vfio_iommu_type1]
[ 1162.056628]  ? qi_flush_iotlb+0x83/0xa0
[ 1162.056631]  vfio_pin_pages_remote+0x326/0x460 [vfio_iommu_type1]
[ 1162.056634]  vfio_iommu_type1_ioctl+0x421/0x14f0 [vfio_iommu_type1]
[ 1162.056638]  __x64_sys_ioctl+0x3e4/0x8e0
[ 1162.056641]  do_syscall_64+0x3d/0x90
[ 1162.056644]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 1162.056646] RIP: 0033:0x7f9f6d14317b
[ 1162.056648] Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00 00 
48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89 01 48
[ 1162.056650] RSP: 002b:00007fff4fca15b8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[ 1162.056652] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 
00007f9f6d14317b
[ 1162.056653] RDX: 00007fff4fca1620 RSI: 0000000000003b71 RDI: 
000000000000001c
[ 1162.056654] RBP: 00007fff4fca1650 R08: 0000000000000001 R09: 
0000000000000000
[ 1162.056655] R10: 0000000100000000 R11: 0000000000000246 R12: 
0000000000000000
[ 1162.056656] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000
[ 1162.056657]  </TASK>

Yishai


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug report: vfio over kernel 5.19 - mm area
  2022-06-15 10:43 Bug report: vfio over kernel 5.19 - mm area Yishai Hadas
@ 2022-06-15 10:52 ` Yishai Hadas
  2022-06-15 13:59   ` Joao Martins
  2022-06-15 14:02   ` Alex Williamson
  0 siblings, 2 replies; 8+ messages in thread
From: Yishai Hadas @ 2022-06-15 10:52 UTC (permalink / raw)
  To: Alex Williamson, akpm; +Cc: jason Gunthorpe, maor Gottlieb, kvm, idok, linux-mm

Adding some extra relevant people from the MM area.

On 15/06/2022 13:43, Yishai Hadas wrote:
> Hi All,
>
> Any idea what could cause the below break in 5.19 ? we run QEMU and 
> immediately the machine is stuck.
>
> Once I run, echo l > /proc/sysrq-trigger could see the below task 
> which seems to be stuck..
>
> This basic flow worked fine in 5.18.
>
> [1162.056583] NMI backtrace for cpu 4
> [ 1162.056585] CPU: 4 PID: 1979 Comm: qemu-system-x86 Not tainted 
> 5.19.0-rc1 #747
> [ 1162.056587] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
> BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [ 1162.056588] RIP: 0010:pmd_huge+0x0/0x20
> [ 1162.056592] Code: 49 89 44 24 28 48 8b 47 30 49 89 44 24 30 31 c0 
> 41 5c c3 5b b8 01 00 00 00 5d 41 5c c3 cc cc cc cc cc cc cc cc cc cc 
> cc cc cc <0f> 1f 44 00 00 31 c0 48 f7 c7 9f ff ff ff 74 0f 81 e7 81 00 
> 00 00
> [ 1162.056594] RSP: 0018:ffff888146253b38 EFLAGS: 00000202
> [ 1162.056596] RAX: ffff888101461980 RBX: ffff888146253bc0 RCX: 
> 000ffffffffff000
> [ 1162.056597] RDX: ffff88814fa22000 RSI: 00007f9f68231000 RDI: 
> 000000010a6b6067
> [ 1162.056598] RBP: ffff888111b90dc0 R08: 000000000002f424 R09: 
> 0000000000000001
> [ 1162.056599] R10: ffffffff825c2a40 R11: 0000000000000a08 R12: 
> ffff88814fa22a08
> [ 1162.056600] R13: 000000010a6b6067 R14: 0000000000052202 R15: 
> 00007f9f68231000
> [ 1162.056602] FS:  00007f9f6c228c40(0000) GS:ffff88885f900000(0000) 
> knlGS:0000000000000000
> [ 1162.056605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1162.056606] CR2: 00005643994fd0ed CR3: 00000001496da005 CR4: 
> 0000000000372ea0
> [ 1162.056607] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 1162.056609] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [ 1162.056610] Call Trace:
> [ 1162.056611]  <TASK>
> [ 1162.056611]  follow_page_mask+0x196/0x5e0
> [ 1162.056615]  __get_user_pages+0x190/0x5d0
> [ 1162.056617]  ? flush_workqueue_prep_pwqs+0x110/0x110
> [ 1162.056620]  __gup_longterm_locked+0xaf/0x470
> [ 1162.056624]  vaddr_get_pfns+0x8e/0x240 [vfio_iommu_type1]
> [ 1162.056628]  ? qi_flush_iotlb+0x83/0xa0
> [ 1162.056631]  vfio_pin_pages_remote+0x326/0x460 [vfio_iommu_type1]
> [ 1162.056634]  vfio_iommu_type1_ioctl+0x421/0x14f0 [vfio_iommu_type1]
> [ 1162.056638]  __x64_sys_ioctl+0x3e4/0x8e0
> [ 1162.056641]  do_syscall_64+0x3d/0x90
> [ 1162.056644]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [ 1162.056646] RIP: 0033:0x7f9f6d14317b
> [ 1162.056648] Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00 
> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89 
> 01 48
> [ 1162.056650] RSP: 002b:00007fff4fca15b8 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000010
> [ 1162.056652] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 
> 00007f9f6d14317b
> [ 1162.056653] RDX: 00007fff4fca1620 RSI: 0000000000003b71 RDI: 
> 000000000000001c
> [ 1162.056654] RBP: 00007fff4fca1650 R08: 0000000000000001 R09: 
> 0000000000000000
> [ 1162.056655] R10: 0000000100000000 R11: 0000000000000246 R12: 
> 0000000000000000
> [ 1162.056656] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000000
> [ 1162.056657]  </TASK>
>
> Yishai
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug report: vfio over kernel 5.19 - mm area
  2022-06-15 10:52 ` Yishai Hadas
@ 2022-06-15 13:59   ` Joao Martins
  2022-06-15 14:02   ` Alex Williamson
  1 sibling, 0 replies; 8+ messages in thread
From: Joao Martins @ 2022-06-15 13:59 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: jason Gunthorpe, maor Gottlieb, kvm, idok, linux-mm,
	Alex Williamson, akpm



On 6/15/22 11:52, Yishai Hadas wrote:
> Adding some extra relevant people from the MM area.
> 
> On 15/06/2022 13:43, Yishai Hadas wrote:
>> Hi All,
>>
>> Any idea what could cause the below break in 5.19 ? we run QEMU and 
>> immediately the machine is stuck.
>>
>> Once I run, echo l > /proc/sysrq-trigger could see the below task 
>> which seems to be stuck..
>>
>> This basic flow worked fine in 5.18.
>>

Maybe this one:

https://lore.kernel.org/all/165490039431.944052.12458624139225785964.stgit@omen/

.. but I think it's not yet merged for v5.19:

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/log/?h=mm-hotfixes-unstable

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug report: vfio over kernel 5.19 - mm area
  2022-06-15 10:52 ` Yishai Hadas
  2022-06-15 13:59   ` Joao Martins
@ 2022-06-15 14:02   ` Alex Williamson
  2022-06-15 14:14     ` Yi Liu
  2022-08-15 15:46     ` Yishai Hadas
  1 sibling, 2 replies; 8+ messages in thread
From: Alex Williamson @ 2022-06-15 14:02 UTC (permalink / raw)
  To: Yishai Hadas; +Cc: akpm, jason Gunthorpe, maor Gottlieb, kvm, idok, linux-mm

On Wed, 15 Jun 2022 13:52:10 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> Adding some extra relevant people from the MM area.
> 
> On 15/06/2022 13:43, Yishai Hadas wrote:
> > Hi All,
> >
> > Any idea what could cause the below break in 5.19 ? we run QEMU and 
> > immediately the machine is stuck.
> >
> > Once I run, echo l > /proc/sysrq-trigger could see the below task 
> > which seems to be stuck..
> >
> > This basic flow worked fine in 5.18.

Spent Friday bisecting this and posted this fix:

https://lore.kernel.org/all/165490039431.944052.12458624139225785964.stgit@omen/

I expect you're hotting the same.  Thanks,

Alex

> >
> > [1162.056583] NMI backtrace for cpu 4
> > [ 1162.056585] CPU: 4 PID: 1979 Comm: qemu-system-x86 Not tainted 
> > 5.19.0-rc1 #747
> > [ 1162.056587] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
> > BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > [ 1162.056588] RIP: 0010:pmd_huge+0x0/0x20
> > [ 1162.056592] Code: 49 89 44 24 28 48 8b 47 30 49 89 44 24 30 31 c0 
> > 41 5c c3 5b b8 01 00 00 00 5d 41 5c c3 cc cc cc cc cc cc cc cc cc cc 
> > cc cc cc <0f> 1f 44 00 00 31 c0 48 f7 c7 9f ff ff ff 74 0f 81 e7 81 00 
> > 00 00
> > [ 1162.056594] RSP: 0018:ffff888146253b38 EFLAGS: 00000202
> > [ 1162.056596] RAX: ffff888101461980 RBX: ffff888146253bc0 RCX: 
> > 000ffffffffff000
> > [ 1162.056597] RDX: ffff88814fa22000 RSI: 00007f9f68231000 RDI: 
> > 000000010a6b6067
> > [ 1162.056598] RBP: ffff888111b90dc0 R08: 000000000002f424 R09: 
> > 0000000000000001
> > [ 1162.056599] R10: ffffffff825c2a40 R11: 0000000000000a08 R12: 
> > ffff88814fa22a08
> > [ 1162.056600] R13: 000000010a6b6067 R14: 0000000000052202 R15: 
> > 00007f9f68231000
> > [ 1162.056602] FS:  00007f9f6c228c40(0000) GS:ffff88885f900000(0000) 
> > knlGS:0000000000000000
> > [ 1162.056605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1162.056606] CR2: 00005643994fd0ed CR3: 00000001496da005 CR4: 
> > 0000000000372ea0
> > [ 1162.056607] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> > 0000000000000000
> > [ 1162.056609] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> > 0000000000000400
> > [ 1162.056610] Call Trace:
> > [ 1162.056611]  <TASK>
> > [ 1162.056611]  follow_page_mask+0x196/0x5e0
> > [ 1162.056615]  __get_user_pages+0x190/0x5d0
> > [ 1162.056617]  ? flush_workqueue_prep_pwqs+0x110/0x110
> > [ 1162.056620]  __gup_longterm_locked+0xaf/0x470
> > [ 1162.056624]  vaddr_get_pfns+0x8e/0x240 [vfio_iommu_type1]
> > [ 1162.056628]  ? qi_flush_iotlb+0x83/0xa0
> > [ 1162.056631]  vfio_pin_pages_remote+0x326/0x460 [vfio_iommu_type1]
> > [ 1162.056634]  vfio_iommu_type1_ioctl+0x421/0x14f0 [vfio_iommu_type1]
> > [ 1162.056638]  __x64_sys_ioctl+0x3e4/0x8e0
> > [ 1162.056641]  do_syscall_64+0x3d/0x90
> > [ 1162.056644]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > [ 1162.056646] RIP: 0033:0x7f9f6d14317b
> > [ 1162.056648] Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00 
> > 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 
> > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89 
> > 01 48
> > [ 1162.056650] RSP: 002b:00007fff4fca15b8 EFLAGS: 00000246 ORIG_RAX: 
> > 0000000000000010
> > [ 1162.056652] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 
> > 00007f9f6d14317b
> > [ 1162.056653] RDX: 00007fff4fca1620 RSI: 0000000000003b71 RDI: 
> > 000000000000001c
> > [ 1162.056654] RBP: 00007fff4fca1650 R08: 0000000000000001 R09: 
> > 0000000000000000
> > [ 1162.056655] R10: 0000000100000000 R11: 0000000000000246 R12: 
> > 0000000000000000
> > [ 1162.056656] R13: 0000000000000000 R14: 0000000000000000 R15: 
> > 0000000000000000
> > [ 1162.056657]  </TASK>
> >
> > Yishai
> >  
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug report: vfio over kernel 5.19 - mm area
  2022-06-15 14:02   ` Alex Williamson
@ 2022-06-15 14:14     ` Yi Liu
  2022-06-15 14:22       ` Yishai Hadas
  2022-08-15 15:46     ` Yishai Hadas
  1 sibling, 1 reply; 8+ messages in thread
From: Yi Liu @ 2022-06-15 14:14 UTC (permalink / raw)
  To: Alex Williamson, Yishai Hadas
  Cc: akpm, jason Gunthorpe, maor Gottlieb, kvm, idok, linux-mm

Hi Alex,

On 2022/6/15 22:02, Alex Williamson wrote:
> On Wed, 15 Jun 2022 13:52:10 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
>> Adding some extra relevant people from the MM area.
>>
>> On 15/06/2022 13:43, Yishai Hadas wrote:
>>> Hi All,
>>>
>>> Any idea what could cause the below break in 5.19 ? we run QEMU and
>>> immediately the machine is stuck.
>>>
>>> Once I run, echo l > /proc/sysrq-trigger could see the below task
>>> which seems to be stuck..
>>>
>>> This basic flow worked fine in 5.18.
> 
> Spent Friday bisecting this and posted this fix:
> 
> https://lore.kernel.org/all/165490039431.944052.12458624139225785964.stgit@omen/
> 
> I expect you're hotting the same.  Thanks,

I also hit a hang at calling pin_user_pages_remote() in the
vaddr_get_pfns(). With the fix in the link, the issue got fixed.
You may add my test-by to your fix. :-)

> Alex
> 
>>>
>>> [1162.056583] NMI backtrace for cpu 4
>>> [ 1162.056585] CPU: 4 PID: 1979 Comm: qemu-system-x86 Not tainted
>>> 5.19.0-rc1 #747
>>> [ 1162.056587] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>>> BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>> [ 1162.056588] RIP: 0010:pmd_huge+0x0/0x20
>>> [ 1162.056592] Code: 49 89 44 24 28 48 8b 47 30 49 89 44 24 30 31 c0
>>> 41 5c c3 5b b8 01 00 00 00 5d 41 5c c3 cc cc cc cc cc cc cc cc cc cc
>>> cc cc cc <0f> 1f 44 00 00 31 c0 48 f7 c7 9f ff ff ff 74 0f 81 e7 81 00
>>> 00 00
>>> [ 1162.056594] RSP: 0018:ffff888146253b38 EFLAGS: 00000202
>>> [ 1162.056596] RAX: ffff888101461980 RBX: ffff888146253bc0 RCX:
>>> 000ffffffffff000
>>> [ 1162.056597] RDX: ffff88814fa22000 RSI: 00007f9f68231000 RDI:
>>> 000000010a6b6067
>>> [ 1162.056598] RBP: ffff888111b90dc0 R08: 000000000002f424 R09:
>>> 0000000000000001
>>> [ 1162.056599] R10: ffffffff825c2a40 R11: 0000000000000a08 R12:
>>> ffff88814fa22a08
>>> [ 1162.056600] R13: 000000010a6b6067 R14: 0000000000052202 R15:
>>> 00007f9f68231000
>>> [ 1162.056602] FS:  00007f9f6c228c40(0000) GS:ffff88885f900000(0000)
>>> knlGS:0000000000000000
>>> [ 1162.056605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1162.056606] CR2: 00005643994fd0ed CR3: 00000001496da005 CR4:
>>> 0000000000372ea0
>>> [ 1162.056607] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> [ 1162.056609] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>> 0000000000000400
>>> [ 1162.056610] Call Trace:
>>> [ 1162.056611]  <TASK>
>>> [ 1162.056611]  follow_page_mask+0x196/0x5e0
>>> [ 1162.056615]  __get_user_pages+0x190/0x5d0
>>> [ 1162.056617]  ? flush_workqueue_prep_pwqs+0x110/0x110
>>> [ 1162.056620]  __gup_longterm_locked+0xaf/0x470
>>> [ 1162.056624]  vaddr_get_pfns+0x8e/0x240 [vfio_iommu_type1]
>>> [ 1162.056628]  ? qi_flush_iotlb+0x83/0xa0
>>> [ 1162.056631]  vfio_pin_pages_remote+0x326/0x460 [vfio_iommu_type1]
>>> [ 1162.056634]  vfio_iommu_type1_ioctl+0x421/0x14f0 [vfio_iommu_type1]
>>> [ 1162.056638]  __x64_sys_ioctl+0x3e4/0x8e0
>>> [ 1162.056641]  do_syscall_64+0x3d/0x90
>>> [ 1162.056644]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>> [ 1162.056646] RIP: 0033:0x7f9f6d14317b
>>> [ 1162.056648] Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00
>>> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00
>>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89
>>> 01 48
>>> [ 1162.056650] RSP: 002b:00007fff4fca15b8 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000010
>>> [ 1162.056652] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
>>> 00007f9f6d14317b
>>> [ 1162.056653] RDX: 00007fff4fca1620 RSI: 0000000000003b71 RDI:
>>> 000000000000001c
>>> [ 1162.056654] RBP: 00007fff4fca1650 R08: 0000000000000001 R09:
>>> 0000000000000000
>>> [ 1162.056655] R10: 0000000100000000 R11: 0000000000000246 R12:
>>> 0000000000000000
>>> [ 1162.056656] R13: 0000000000000000 R14: 0000000000000000 R15:
>>> 0000000000000000
>>> [ 1162.056657]  </TASK>
>>>
>>> Yishai
>>>   
>>
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug report: vfio over kernel 5.19 - mm area
  2022-06-15 14:14     ` Yi Liu
@ 2022-06-15 14:22       ` Yishai Hadas
  0 siblings, 0 replies; 8+ messages in thread
From: Yishai Hadas @ 2022-06-15 14:22 UTC (permalink / raw)
  To: Yi Liu, Alex Williamson
  Cc: akpm, jason Gunthorpe, maor Gottlieb, kvm, idok, linux-mm

On 15/06/2022 17:14, Yi Liu wrote:
> Hi Alex,
>
> On 2022/6/15 22:02, Alex Williamson wrote:
>> On Wed, 15 Jun 2022 13:52:10 +0300
>> Yishai Hadas <yishaih@nvidia.com> wrote:
>>
>>> Adding some extra relevant people from the MM area.
>>>
>>> On 15/06/2022 13:43, Yishai Hadas wrote:
>>>> Hi All,
>>>>
>>>> Any idea what could cause the below break in 5.19 ? we run QEMU and
>>>> immediately the machine is stuck.
>>>>
>>>> Once I run, echo l > /proc/sysrq-trigger could see the below task
>>>> which seems to be stuck..
>>>>
>>>> This basic flow worked fine in 5.18.
>>
>> Spent Friday bisecting this and posted this fix:
>>
>> https://lore.kernel.org/all/165490039431.944052.12458624139225785964.stgit@omen/ 
>>
>>
>> I expect you're hotting the same.  Thanks,
>
> I also hit a hang at calling pin_user_pages_remote() in the
> vaddr_get_pfns(). With the fix in the link, the issue got fixed.
> You may add my test-by to your fix. :-)


Thanks Alex, it seems to be the same issue, with your fix I don't hit 
the problem.


>
>> Alex
>>
>>>>
>>>> [1162.056583] NMI backtrace for cpu 4
>>>> [ 1162.056585] CPU: 4 PID: 1979 Comm: qemu-system-x86 Not tainted
>>>> 5.19.0-rc1 #747
>>>> [ 1162.056587] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>>>> BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>>> [ 1162.056588] RIP: 0010:pmd_huge+0x0/0x20
>>>> [ 1162.056592] Code: 49 89 44 24 28 48 8b 47 30 49 89 44 24 30 31 c0
>>>> 41 5c c3 5b b8 01 00 00 00 5d 41 5c c3 cc cc cc cc cc cc cc cc cc cc
>>>> cc cc cc <0f> 1f 44 00 00 31 c0 48 f7 c7 9f ff ff ff 74 0f 81 e7 81 00
>>>> 00 00
>>>> [ 1162.056594] RSP: 0018:ffff888146253b38 EFLAGS: 00000202
>>>> [ 1162.056596] RAX: ffff888101461980 RBX: ffff888146253bc0 RCX:
>>>> 000ffffffffff000
>>>> [ 1162.056597] RDX: ffff88814fa22000 RSI: 00007f9f68231000 RDI:
>>>> 000000010a6b6067
>>>> [ 1162.056598] RBP: ffff888111b90dc0 R08: 000000000002f424 R09:
>>>> 0000000000000001
>>>> [ 1162.056599] R10: ffffffff825c2a40 R11: 0000000000000a08 R12:
>>>> ffff88814fa22a08
>>>> [ 1162.056600] R13: 000000010a6b6067 R14: 0000000000052202 R15:
>>>> 00007f9f68231000
>>>> [ 1162.056602] FS:  00007f9f6c228c40(0000) GS:ffff88885f900000(0000)
>>>> knlGS:0000000000000000
>>>> [ 1162.056605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [ 1162.056606] CR2: 00005643994fd0ed CR3: 00000001496da005 CR4:
>>>> 0000000000372ea0
>>>> [ 1162.056607] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> [ 1162.056609] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400
>>>> [ 1162.056610] Call Trace:
>>>> [ 1162.056611]  <TASK>
>>>> [ 1162.056611]  follow_page_mask+0x196/0x5e0
>>>> [ 1162.056615]  __get_user_pages+0x190/0x5d0
>>>> [ 1162.056617]  ? flush_workqueue_prep_pwqs+0x110/0x110
>>>> [ 1162.056620]  __gup_longterm_locked+0xaf/0x470
>>>> [ 1162.056624]  vaddr_get_pfns+0x8e/0x240 [vfio_iommu_type1]
>>>> [ 1162.056628]  ? qi_flush_iotlb+0x83/0xa0
>>>> [ 1162.056631]  vfio_pin_pages_remote+0x326/0x460 [vfio_iommu_type1]
>>>> [ 1162.056634]  vfio_iommu_type1_ioctl+0x421/0x14f0 [vfio_iommu_type1]
>>>> [ 1162.056638]  __x64_sys_ioctl+0x3e4/0x8e0
>>>> [ 1162.056641]  do_syscall_64+0x3d/0x90
>>>> [ 1162.056644]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>> [ 1162.056646] RIP: 0033:0x7f9f6d14317b
>>>> [ 1162.056648] Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00
>>>> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00
>>>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89
>>>> 01 48
>>>> [ 1162.056650] RSP: 002b:00007fff4fca15b8 EFLAGS: 00000246 ORIG_RAX:
>>>> 0000000000000010
>>>> [ 1162.056652] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
>>>> 00007f9f6d14317b
>>>> [ 1162.056653] RDX: 00007fff4fca1620 RSI: 0000000000003b71 RDI:
>>>> 000000000000001c
>>>> [ 1162.056654] RBP: 00007fff4fca1650 R08: 0000000000000001 R09:
>>>> 0000000000000000
>>>> [ 1162.056655] R10: 0000000100000000 R11: 0000000000000246 R12:
>>>> 0000000000000000
>>>> [ 1162.056656] R13: 0000000000000000 R14: 0000000000000000 R15:
>>>> 0000000000000000
>>>> [ 1162.056657]  </TASK>
>>>>
>>>> Yishai
>>>
>>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bug report: vfio over kernel 5.19 - mm area
  2022-06-15 14:02   ` Alex Williamson
  2022-06-15 14:14     ` Yi Liu
@ 2022-08-15 15:46     ` Yishai Hadas
  2022-08-15 17:52       ` Alex Williamson
  1 sibling, 1 reply; 8+ messages in thread
From: Yishai Hadas @ 2022-08-15 15:46 UTC (permalink / raw)
  To: Alex Williamson, alex.sierra
  Cc: akpm, jason Gunthorpe, maor Gottlieb, kvm, idok, linux-mm

On 15/06/2022 17:02, Alex Williamson wrote:
> On Wed, 15 Jun 2022 13:52:10 +0300
> Yishai Hadas <yishaih@nvidia.com> wrote:
>
>> Adding some extra relevant people from the MM area.
>>
>> On 15/06/2022 13:43, Yishai Hadas wrote:
>>> Hi All,
>>>
>>> Any idea what could cause the below break in 5.19 ? we run QEMU and
>>> immediately the machine is stuck.
>>>
>>> Once I run, echo l > /proc/sysrq-trigger could see the below task
>>> which seems to be stuck..
>>>
>>> This basic flow worked fine in 5.18.
> Spent Friday bisecting this and posted this fix:
>
> https://lore.kernel.org/all/165490039431.944052.12458624139225785964.stgit@omen/
>
> I expect you're hotting the same.  Thanks,
>
> Alex

Alex,

It seems that we got the same bug again in V6.0 RC1 ..

The below code [1] from commit [2], put back the 'is_zero_pfn()' under 
the !(..) and seems buggy.

I would expect the below fix for that [3].

Alex Sierra,

Can you please review the below suggested fix for your patch and send a 
patch for RC2 accordingly ?

Yishai

[1]

See: 
https://elixir.bootlin.com/linux/v6.0-rc1/source/include/linux/mm.h#L1549

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a2d01e49253b..64393ed3330a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -28,6 +28,7 @@
  #include <linux/sched.h>
  #include <linux/pgtable.h>
  #include <linux/kasan.h>
+#include <linux/memremap.h>

  struct mempolicy;
  struct anon_vma;
@@ -1537,7 +1538,9 @@ static inline bool 
is_longterm_pinnable_page(struct page *page)
         if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
                 return false;
  #endif
-       return !is_zone_movable_page(page) || 
is_zero_pfn(page_to_pfn(page));
+       return !(is_device_coherent_page(page) ||
+                is_zone_movable_page(page) ||
+                is_zero_pfn(page_to_pfn(page)));
  }

[2] f25cbb7a95a24ff9a2a3bebd308e303942ae6b2c
Author: Alex Sierra <alex.sierra@amd.com>
Date:   Fri Jul 15 10:05:10 2022 -0500

     mm: add zone device coherent type memory support


[3] Expected fix

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3bedc449c14d..b25f9886bd4c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1544,9 +1544,9 @@ static inline bool 
is_longterm_pinnable_page(struct page *page)
         if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
                 return false;
  #endif
-       return !(is_device_coherent_page(page) ||
-                is_zone_movable_page(page) ||
-                is_zero_pfn(page_to_pfn(page)));
+       return !is_device_coherent_page(page) ||
+              !is_zone_movable_page(page) ||
+              is_zero_pfn(page_to_pfn(page));
  }
  #else
  static inline bool is_longterm_pinnable_page(struct page *page)


>>> [1162.056583] NMI backtrace for cpu 4
>>> [ 1162.056585] CPU: 4 PID: 1979 Comm: qemu-system-x86 Not tainted
>>> 5.19.0-rc1 #747
>>> [ 1162.056587] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>>> BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>> [ 1162.056588] RIP: 0010:pmd_huge+0x0/0x20
>>> [ 1162.056592] Code: 49 89 44 24 28 48 8b 47 30 49 89 44 24 30 31 c0
>>> 41 5c c3 5b b8 01 00 00 00 5d 41 5c c3 cc cc cc cc cc cc cc cc cc cc
>>> cc cc cc <0f> 1f 44 00 00 31 c0 48 f7 c7 9f ff ff ff 74 0f 81 e7 81 00
>>> 00 00
>>> [ 1162.056594] RSP: 0018:ffff888146253b38 EFLAGS: 00000202
>>> [ 1162.056596] RAX: ffff888101461980 RBX: ffff888146253bc0 RCX:
>>> 000ffffffffff000
>>> [ 1162.056597] RDX: ffff88814fa22000 RSI: 00007f9f68231000 RDI:
>>> 000000010a6b6067
>>> [ 1162.056598] RBP: ffff888111b90dc0 R08: 000000000002f424 R09:
>>> 0000000000000001
>>> [ 1162.056599] R10: ffffffff825c2a40 R11: 0000000000000a08 R12:
>>> ffff88814fa22a08
>>> [ 1162.056600] R13: 000000010a6b6067 R14: 0000000000052202 R15:
>>> 00007f9f68231000
>>> [ 1162.056602] FS:  00007f9f6c228c40(0000) GS:ffff88885f900000(0000)
>>> knlGS:0000000000000000
>>> [ 1162.056605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1162.056606] CR2: 00005643994fd0ed CR3: 00000001496da005 CR4:
>>> 0000000000372ea0
>>> [ 1162.056607] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> [ 1162.056609] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>> 0000000000000400
>>> [ 1162.056610] Call Trace:
>>> [ 1162.056611]  <TASK>
>>> [ 1162.056611]  follow_page_mask+0x196/0x5e0
>>> [ 1162.056615]  __get_user_pages+0x190/0x5d0
>>> [ 1162.056617]  ? flush_workqueue_prep_pwqs+0x110/0x110
>>> [ 1162.056620]  __gup_longterm_locked+0xaf/0x470
>>> [ 1162.056624]  vaddr_get_pfns+0x8e/0x240 [vfio_iommu_type1]
>>> [ 1162.056628]  ? qi_flush_iotlb+0x83/0xa0
>>> [ 1162.056631]  vfio_pin_pages_remote+0x326/0x460 [vfio_iommu_type1]
>>> [ 1162.056634]  vfio_iommu_type1_ioctl+0x421/0x14f0 [vfio_iommu_type1]
>>> [ 1162.056638]  __x64_sys_ioctl+0x3e4/0x8e0
>>> [ 1162.056641]  do_syscall_64+0x3d/0x90
>>> [ 1162.056644]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>> [ 1162.056646] RIP: 0033:0x7f9f6d14317b
>>> [ 1162.056648] Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00
>>> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00
>>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89
>>> 01 48
>>> [ 1162.056650] RSP: 002b:00007fff4fca15b8 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000010
>>> [ 1162.056652] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
>>> 00007f9f6d14317b
>>> [ 1162.056653] RDX: 00007fff4fca1620 RSI: 0000000000003b71 RDI:
>>> 000000000000001c
>>> [ 1162.056654] RBP: 00007fff4fca1650 R08: 0000000000000001 R09:
>>> 0000000000000000
>>> [ 1162.056655] R10: 0000000100000000 R11: 0000000000000246 R12:
>>> 0000000000000000
>>> [ 1162.056656] R13: 0000000000000000 R14: 0000000000000000 R15:
>>> 0000000000000000
>>> [ 1162.056657]  </TASK>
>>>
>>> Yishai
>>>   



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Bug report: vfio over kernel 5.19 - mm area
  2022-08-15 15:46     ` Yishai Hadas
@ 2022-08-15 17:52       ` Alex Williamson
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Williamson @ 2022-08-15 17:52 UTC (permalink / raw)
  To: Yishai Hadas, idok
  Cc: alex.sierra, akpm, jason Gunthorpe, maor Gottlieb, kvm, linux-mm

On Mon, 15 Aug 2022 18:46:40 +0300
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 15/06/2022 17:02, Alex Williamson wrote:
> > On Wed, 15 Jun 2022 13:52:10 +0300
> > Yishai Hadas <yishaih@nvidia.com> wrote:
> >  
> >> Adding some extra relevant people from the MM area.
> >>
> >> On 15/06/2022 13:43, Yishai Hadas wrote:  
> >>> Hi All,
> >>>
> >>> Any idea what could cause the below break in 5.19 ? we run QEMU and
> >>> immediately the machine is stuck.
> >>>
> >>> Once I run, echo l > /proc/sysrq-trigger could see the below task
> >>> which seems to be stuck..
> >>>
> >>> This basic flow worked fine in 5.18.  
> > Spent Friday bisecting this and posted this fix:
> >
> > https://lore.kernel.org/all/165490039431.944052.12458624139225785964.stgit@omen/
> >
> > I expect you're hotting the same.  Thanks,
> >
> > Alex  
> 
> Alex,
> 
> It seems that we got the same bug again in V6.0 RC1 ..
> 
> The below code [1] from commit [2], put back the 'is_zero_pfn()' under 
> the !(..) and seems buggy.
> 
> I would expect the below fix for that [3].
> 
> Alex Sierra,
> 
> Can you please review the below suggested fix for your patch and send a 
> patch for RC2 accordingly ?
> 

https://lore.kernel.org/all/166015037385.760108.16881097713975517242.stgit@omen/

It's in the mm tree, hopefully it'll get pushed in an early rc.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-08-15 17:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-15 10:43 Bug report: vfio over kernel 5.19 - mm area Yishai Hadas
2022-06-15 10:52 ` Yishai Hadas
2022-06-15 13:59   ` Joao Martins
2022-06-15 14:02   ` Alex Williamson
2022-06-15 14:14     ` Yi Liu
2022-06-15 14:22       ` Yishai Hadas
2022-08-15 15:46     ` Yishai Hadas
2022-08-15 17:52       ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.