linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
@ 2016-03-17 18:38 Stefan Priebe
  2016-03-17 18:45 ` Greg KH
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Priebe @ 2016-03-17 18:38 UTC (permalink / raw)
  To: LKML, stable, linux-mm, linux-mm

Hi,

while running qemu 2.5 on a host running 4.4.6 the host system has 
crashed (load > 200) 3 times in the last 3 days.

Always with this stack trace: (copy left here: 
http://pastebin.com/raw/bCWTLKyt)

[69068.874268] divide error: 0000 [#1] SMP
[69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT 
nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter 
ip6_tables ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype 
xt_conntrack nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost 
macvtap macvlan kvm_intel nfnetlink_log kvm nfnetlink irqbypass 
netconsole dlm xt_multiport iptable_filter ip_tables x_tables iscsi_tcp 
libiscsi_tcp libiscsi scsi_transport_iscsi nfsd auth_rpcgss oid_registry 
bonding coretemp 8021q garp fuse i2c_i801 i7core_edac edac_core 
i5500_temp button btrfs xor raid6_pq dm_mod raid1 md_mod usb_storage 
ohci_hcd bcache sg usbhid sd_mod ata_generic uhci_hcd ehci_pci ehci_hcd 
usbcore ata_piix usb_common igb i2c_algo_bit mpt3sas raid_class ixgbe 
scsi_transport_sas i2c_core mdio ptp pps_core
[69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1
[69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 
03/17/2012
[69068.898578] task: ffff880fc7f28000 ti: ffff880fda2c4000 task.ti: 
ffff880fda2c4000
[69068.900377] RIP: 0010:[<ffffffff860b372c>]  [<ffffffff860b372c>] 
task_h_load+0xcc/0x100
[69068.961763] RSP: 0000:ffff880fda2c7b50  EFLAGS: 00010257
[69069.023910] RAX: 0000000000000000 RBX: ffff880fda2c7c10 RCX: 
0000000000000000
[69069.085953] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 
ffff880fc7f28000
[69069.151731] RBP: ffff880fda2c7bc8 R08: 00000001041955df R09: 
ffff880fffd153f8
[69069.213757] R10: 0000000000000009 R11: 0000000000000193 R12: 
ffff881f6832c780
[69069.274271] R13: ffff88203fc35380 R14: 0000000000000007 R15: 
00000000000000b6
[69069.334727] FS:  00007f578a3fb700(0000) GS:ffff880fffd00000(0000) 
knlGS:0000000000000000
[69069.396435] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[69069.458522] CR2: 00007f5784f18468 CR3: 0000001fe9738000 CR4: 
00000000000026e0
[69069.520799] Stack:
[69069.581430]  ffffffff860b6855 ffff880fda2c7b78 0000000000000000 
0000000000000005
[69069.642629]  ffff880fffd00000 0000000000000015 fffffffffffffe7d 
0000000000000015
[69069.702815]  0000000000015380 ffff880fda2c7bc8 ffff880fc7f28000 
00000000000001e9
[69069.761881] Call Trace:
[69069.819883]  [<ffffffff860b6855>] ? task_numa_find_cpu+0x225/0x670
[69069.878368]  [<ffffffff860b79f0>] task_numa_migrate+0x550/0x950
[69069.936059]  [<ffffffff863d9138>] ? find_next_bit+0x18/0x20
[69069.993262]  [<ffffffff860b7e6d>] numa_migrate_preferred+0x7d/0x90
[69070.050528]  [<ffffffff860b89a5>] task_numa_fault+0x7c5/0xaa0
[69070.106544]  [<ffffffff861a2c0b>] ? mpol_misplaced+0x16b/0x1b0
[69070.163705]  [<ffffffff8618104e>] __handle_mm_fault+0x9ae/0x11f0
[69070.220013]  [<ffffffff865e4c52>] ? inet_recvmsg+0x72/0x90
[69070.276558]  [<ffffffff8655240b>] ? SYSC_recvfrom+0x12b/0x170
[69070.332283]  [<ffffffff8618196f>] handle_mm_fault+0xdf/0x180
[69070.388515]  [<ffffffff8604f324>] __do_page_fault+0x164/0x380
[69070.443897]  [<ffffffff860b25c3>] ? account_user_time+0x73/0x80
[69070.498534]  [<ffffffff860b2b3e>] ? vtime_account_user+0x4e/0x70
[69070.552598]  [<ffffffff8604f5a7>] do_page_fault+0x37/0x90
[69070.605960]  [<ffffffff86002a23>] ? syscall_return_slowpath+0x83/0xf0
[69070.660705]  [<ffffffff866b32f8>] page_fault+0x28/0x30
[69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75 ca 
49 8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83 
c1 01 <48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86
[69070.835144] RIP  [<ffffffff860b372c>] task_h_load+0xcc/0x100
[69070.894095]  RSP <ffff880fda2c7b50>
[69070.953213] ---[ end trace 8d6f449a03dacfd4 ]---

Would be nice if we can fix this in 4.4?

Greets,
Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-03-17 18:38 divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6 Stefan Priebe
@ 2016-03-17 18:45 ` Greg KH
  2016-03-19 22:26   ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Greg KH @ 2016-03-17 18:45 UTC (permalink / raw)
  To: Stefan Priebe; +Cc: LKML, stable, linux-mm, linux-mm

On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
> Hi,
> 
> while running qemu 2.5 on a host running 4.4.6 the host system has crashed
> (load > 200) 3 times in the last 3 days.
> 
> Always with this stack trace: (copy left here:
> http://pastebin.com/raw/bCWTLKyt)
> 
> [69068.874268] divide error: 0000 [#1] SMP
> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm xt_multiport
> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp 8021q
> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid sd_mod
> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core mdio ptp
> pps_core
> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1
> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 03/17/2012
> [69068.898578] task: ffff880fc7f28000 ti: ffff880fda2c4000 task.ti:
> ffff880fda2c4000
> [69068.900377] RIP: 0010:[<ffffffff860b372c>]  [<ffffffff860b372c>]
> task_h_load+0xcc/0x100
> [69068.961763] RSP: 0000:ffff880fda2c7b50  EFLAGS: 00010257
> [69069.023910] RAX: 0000000000000000 RBX: ffff880fda2c7c10 RCX:
> 0000000000000000
> [69069.085953] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
> ffff880fc7f28000
> [69069.151731] RBP: ffff880fda2c7bc8 R08: 00000001041955df R09:
> ffff880fffd153f8
> [69069.213757] R10: 0000000000000009 R11: 0000000000000193 R12:
> ffff881f6832c780
> [69069.274271] R13: ffff88203fc35380 R14: 0000000000000007 R15:
> 00000000000000b6
> [69069.334727] FS:  00007f578a3fb700(0000) GS:ffff880fffd00000(0000)
> knlGS:0000000000000000
> [69069.396435] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [69069.458522] CR2: 00007f5784f18468 CR3: 0000001fe9738000 CR4:
> 00000000000026e0
> [69069.520799] Stack:
> [69069.581430]  ffffffff860b6855 ffff880fda2c7b78 0000000000000000
> 0000000000000005
> [69069.642629]  ffff880fffd00000 0000000000000015 fffffffffffffe7d
> 0000000000000015
> [69069.702815]  0000000000015380 ffff880fda2c7bc8 ffff880fc7f28000
> 00000000000001e9
> [69069.761881] Call Trace:
> [69069.819883]  [<ffffffff860b6855>] ? task_numa_find_cpu+0x225/0x670
> [69069.878368]  [<ffffffff860b79f0>] task_numa_migrate+0x550/0x950
> [69069.936059]  [<ffffffff863d9138>] ? find_next_bit+0x18/0x20
> [69069.993262]  [<ffffffff860b7e6d>] numa_migrate_preferred+0x7d/0x90
> [69070.050528]  [<ffffffff860b89a5>] task_numa_fault+0x7c5/0xaa0
> [69070.106544]  [<ffffffff861a2c0b>] ? mpol_misplaced+0x16b/0x1b0
> [69070.163705]  [<ffffffff8618104e>] __handle_mm_fault+0x9ae/0x11f0
> [69070.220013]  [<ffffffff865e4c52>] ? inet_recvmsg+0x72/0x90
> [69070.276558]  [<ffffffff8655240b>] ? SYSC_recvfrom+0x12b/0x170
> [69070.332283]  [<ffffffff8618196f>] handle_mm_fault+0xdf/0x180
> [69070.388515]  [<ffffffff8604f324>] __do_page_fault+0x164/0x380
> [69070.443897]  [<ffffffff860b25c3>] ? account_user_time+0x73/0x80
> [69070.498534]  [<ffffffff860b2b3e>] ? vtime_account_user+0x4e/0x70
> [69070.552598]  [<ffffffff8604f5a7>] do_page_fault+0x37/0x90
> [69070.605960]  [<ffffffff86002a23>] ? syscall_return_slowpath+0x83/0xf0
> [69070.660705]  [<ffffffff866b32f8>] page_fault+0x28/0x30
> [69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75 ca 49
> 8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83 c1 01
> <48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86
> [69070.835144] RIP  [<ffffffff860b372c>] task_h_load+0xcc/0x100
> [69070.894095]  RSP <ffff880fda2c7b50>
> [69070.953213] ---[ end trace 8d6f449a03dacfd4 ]---
> 
> Would be nice if we can fix this in 4.4?

Does this also happen in 4.5?  Did this work on 4.4.5?  Some previous
release?  If you can find the offending patch using 'git bisect', that
would be great.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-03-17 18:45 ` Greg KH
@ 2016-03-19 22:26   ` Vlastimil Babka
  2016-03-20 21:27     ` Stefan Priebe
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2016-03-19 22:26 UTC (permalink / raw)
  To: Greg KH, Stefan Priebe
  Cc: LKML, stable, linux-mm, linux-mm, Peter Zijlstra, Ingo Molnar,
	Rik van Riel

On 03/17/2016 07:45 PM, Greg KH wrote:
> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
>> Hi,
>>
>> while running qemu 2.5 on a host running 4.4.6 the host system has crashed
>> (load > 200) 3 times in the last 3 days.
>>
>> Always with this stack trace: (copy left here:
>> http://pastebin.com/raw/bCWTLKyt)
>>
>> [69068.874268] divide error: 0000 [#1] SMP
>> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
>> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
>> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
>> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
>> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm xt_multiport
>> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
>> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp 8021q
>> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
>> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid sd_mod
>> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
>> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core mdio ptp
>> pps_core
>> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted 4.4.6+7-ph #1
>> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1 03/17/2012
>> [69068.898578] task: ffff880fc7f28000 ti: ffff880fda2c4000 task.ti:
>> ffff880fda2c4000
>> [69068.900377] RIP: 0010:[<ffffffff860b372c>]  [<ffffffff860b372c>]
>> task_h_load+0xcc/0x100

decodecode says:

   27:   48 83 c1 01             add    $0x1,%rcx
   2b:*  48 f7 f1                div    %rcx             <-- trapping instruction

This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:

         update_cfs_rq_h_load(cfs_rq);
         return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
                         cfs_rq_load_avg(cfs_rq) + 1);

So the load avg is -1, thus after adding 1 we get division by 0, huh?

>> [69068.961763] RSP: 0000:ffff880fda2c7b50  EFLAGS: 00010257
>> [69069.023910] RAX: 0000000000000000 RBX: ffff880fda2c7c10 RCX:
>> 0000000000000000
>> [69069.085953] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
>> ffff880fc7f28000
>> [69069.151731] RBP: ffff880fda2c7bc8 R08: 00000001041955df R09:
>> ffff880fffd153f8
>> [69069.213757] R10: 0000000000000009 R11: 0000000000000193 R12:
>> ffff881f6832c780
>> [69069.274271] R13: ffff88203fc35380 R14: 0000000000000007 R15:
>> 00000000000000b6
>> [69069.334727] FS:  00007f578a3fb700(0000) GS:ffff880fffd00000(0000)
>> knlGS:0000000000000000
>> [69069.396435] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [69069.458522] CR2: 00007f5784f18468 CR3: 0000001fe9738000 CR4:
>> 00000000000026e0
>> [69069.520799] Stack:
>> [69069.581430]  ffffffff860b6855 ffff880fda2c7b78 0000000000000000
>> 0000000000000005
>> [69069.642629]  ffff880fffd00000 0000000000000015 fffffffffffffe7d
>> 0000000000000015
>> [69069.702815]  0000000000015380 ffff880fda2c7bc8 ffff880fc7f28000
>> 00000000000001e9
>> [69069.761881] Call Trace:
>> [69069.819883]  [<ffffffff860b6855>] ? task_numa_find_cpu+0x225/0x670
>> [69069.878368]  [<ffffffff860b79f0>] task_numa_migrate+0x550/0x950
>> [69069.936059]  [<ffffffff863d9138>] ? find_next_bit+0x18/0x20
>> [69069.993262]  [<ffffffff860b7e6d>] numa_migrate_preferred+0x7d/0x90
>> [69070.050528]  [<ffffffff860b89a5>] task_numa_fault+0x7c5/0xaa0
>> [69070.106544]  [<ffffffff861a2c0b>] ? mpol_misplaced+0x16b/0x1b0
>> [69070.163705]  [<ffffffff8618104e>] __handle_mm_fault+0x9ae/0x11f0
>> [69070.220013]  [<ffffffff865e4c52>] ? inet_recvmsg+0x72/0x90
>> [69070.276558]  [<ffffffff8655240b>] ? SYSC_recvfrom+0x12b/0x170
>> [69070.332283]  [<ffffffff8618196f>] handle_mm_fault+0xdf/0x180
>> [69070.388515]  [<ffffffff8604f324>] __do_page_fault+0x164/0x380
>> [69070.443897]  [<ffffffff860b25c3>] ? account_user_time+0x73/0x80
>> [69070.498534]  [<ffffffff860b2b3e>] ? vtime_account_user+0x4e/0x70
>> [69070.552598]  [<ffffffff8604f5a7>] do_page_fault+0x37/0x90
>> [69070.605960]  [<ffffffff86002a23>] ? syscall_return_slowpath+0x83/0xf0
>> [69070.660705]  [<ffffffff866b32f8>] page_fault+0x28/0x30
>> [69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75 ca 49
>> 8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83 c1 01
>> <48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86
>> [69070.835144] RIP  [<ffffffff860b372c>] task_h_load+0xcc/0x100
>> [69070.894095]  RSP <ffff880fda2c7b50>
>> [69070.953213] ---[ end trace 8d6f449a03dacfd4 ]---
>>
>> Would be nice if we can fix this in 4.4?
>
> Does this also happen in 4.5?  Did this work on 4.4.5?  Some previous
> release?  If you can find the offending patch using 'git bisect', that
> would be great.
>
> thanks,
>
> greg k-h
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-03-19 22:26   ` Vlastimil Babka
@ 2016-03-20 21:27     ` Stefan Priebe
  2016-03-20 21:41       ` Greg KH
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Priebe @ 2016-03-20 21:27 UTC (permalink / raw)
  To: Vlastimil Babka, Greg KH
  Cc: LKML, stable, linux-mm, linux-mm, Peter Zijlstra, Ingo Molnar,
	Rik van Riel


Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
> On 03/17/2016 07:45 PM, Greg KH wrote:
>> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
>>> Hi,
>>>
>>> while running qemu 2.5 on a host running 4.4.6 the host system has
>>> crashed
>>> (load > 200) 3 times in the last 3 days.
>>>
>>> Always with this stack trace: (copy left here:
>>> http://pastebin.com/raw/bCWTLKyt)
>>>
>>> [69068.874268] divide error: 0000 [#1] SMP
>>> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
>>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
>>> ip6_tables
>>> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
>>> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
>>> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
>>> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
>>> xt_multiport
>>> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
>>> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
>>> 8021q
>>> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
>>> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
>>> sd_mod
>>> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
>>> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
>>> mdio ptp
>>> pps_core
>>> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
>>> 4.4.6+7-ph #1
>>> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
>>> 03/17/2012
>>> [69068.898578] task: ffff880fc7f28000 ti: ffff880fda2c4000 task.ti:
>>> ffff880fda2c4000
>>> [69068.900377] RIP: 0010:[<ffffffff860b372c>]  [<ffffffff860b372c>]
>>> task_h_load+0xcc/0x100
>
> decodecode says:
>
>    27:   48 83 c1 01             add    $0x1,%rcx
>    2b:*  48 f7 f1                div    %rcx             <-- trapping
> instruction
>
> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>
>          update_cfs_rq_h_load(cfs_rq);
>          return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>                          cfs_rq_load_avg(cfs_rq) + 1);
>
> So the load avg is -1, thus after adding 1 we get division by 0, huh?

Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those 
commits up to 4.5 for fair.c:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5

It didn't happen again with v4.4.6 + 4.5 patches for fair.c

Greets,
Stefan

>
>>> [69068.961763] RSP: 0000:ffff880fda2c7b50  EFLAGS: 00010257
>>> [69069.023910] RAX: 0000000000000000 RBX: ffff880fda2c7c10 RCX:
>>> 0000000000000000
>>> [69069.085953] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
>>> ffff880fc7f28000
>>> [69069.151731] RBP: ffff880fda2c7bc8 R08: 00000001041955df R09:
>>> ffff880fffd153f8
>>> [69069.213757] R10: 0000000000000009 R11: 0000000000000193 R12:
>>> ffff881f6832c780
>>> [69069.274271] R13: ffff88203fc35380 R14: 0000000000000007 R15:
>>> 00000000000000b6
>>> [69069.334727] FS:  00007f578a3fb700(0000) GS:ffff880fffd00000(0000)
>>> knlGS:0000000000000000
>>> [69069.396435] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [69069.458522] CR2: 00007f5784f18468 CR3: 0000001fe9738000 CR4:
>>> 00000000000026e0
>>> [69069.520799] Stack:
>>> [69069.581430]  ffffffff860b6855 ffff880fda2c7b78 0000000000000000
>>> 0000000000000005
>>> [69069.642629]  ffff880fffd00000 0000000000000015 fffffffffffffe7d
>>> 0000000000000015
>>> [69069.702815]  0000000000015380 ffff880fda2c7bc8 ffff880fc7f28000
>>> 00000000000001e9
>>> [69069.761881] Call Trace:
>>> [69069.819883]  [<ffffffff860b6855>] ? task_numa_find_cpu+0x225/0x670
>>> [69069.878368]  [<ffffffff860b79f0>] task_numa_migrate+0x550/0x950
>>> [69069.936059]  [<ffffffff863d9138>] ? find_next_bit+0x18/0x20
>>> [69069.993262]  [<ffffffff860b7e6d>] numa_migrate_preferred+0x7d/0x90
>>> [69070.050528]  [<ffffffff860b89a5>] task_numa_fault+0x7c5/0xaa0
>>> [69070.106544]  [<ffffffff861a2c0b>] ? mpol_misplaced+0x16b/0x1b0
>>> [69070.163705]  [<ffffffff8618104e>] __handle_mm_fault+0x9ae/0x11f0
>>> [69070.220013]  [<ffffffff865e4c52>] ? inet_recvmsg+0x72/0x90
>>> [69070.276558]  [<ffffffff8655240b>] ? SYSC_recvfrom+0x12b/0x170
>>> [69070.332283]  [<ffffffff8618196f>] handle_mm_fault+0xdf/0x180
>>> [69070.388515]  [<ffffffff8604f324>] __do_page_fault+0x164/0x380
>>> [69070.443897]  [<ffffffff860b25c3>] ? account_user_time+0x73/0x80
>>> [69070.498534]  [<ffffffff860b2b3e>] ? vtime_account_user+0x4e/0x70
>>> [69070.552598]  [<ffffffff8604f5a7>] do_page_fault+0x37/0x90
>>> [69070.605960]  [<ffffffff86002a23>] ? syscall_return_slowpath+0x83/0xf0
>>> [69070.660705]  [<ffffffff866b32f8>] page_fault+0x28/0x30
>>> [69070.715707] Code: 86 b8 00 00 00 48 89 86 b0 00 00 00 48 85 c9 75
>>> ca 49
>>> 8b 81 b0 00 00 00 49 8b 49 78 31 d2 48 0f af 87 d8 01 00 00 5d 48 83
>>> c1 01
>>> <48> f7 f1 c3 4c 89 ce 48 8b 8e c0 00 00 00 48 8b 46 78 4c 89 86
>>> [69070.835144] RIP  [<ffffffff860b372c>] task_h_load+0xcc/0x100
>>> [69070.894095]  RSP <ffff880fda2c7b50>
>>> [69070.953213] ---[ end trace 8d6f449a03dacfd4 ]---
>>>
>>> Would be nice if we can fix this in 4.4?
>>
>> Does this also happen in 4.5?  Did this work on 4.4.5?  Some previous
>> release?  If you can find the offending patch using 'git bisect', that
>> would be great.
>>
>> thanks,
>>
>> greg k-h
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-03-20 21:27     ` Stefan Priebe
@ 2016-03-20 21:41       ` Greg KH
  2016-03-21 10:52         ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 19+ messages in thread
From: Greg KH @ 2016-03-20 21:41 UTC (permalink / raw)
  To: Stefan Priebe
  Cc: Vlastimil Babka, LKML, stable, linux-mm, linux-mm,
	Peter Zijlstra, Ingo Molnar, Rik van Riel

On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
> 
> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
> >On 03/17/2016 07:45 PM, Greg KH wrote:
> >>On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
> >>>Hi,
> >>>
> >>>while running qemu 2.5 on a host running 4.4.6 the host system has
> >>>crashed
> >>>(load > 200) 3 times in the last 3 days.
> >>>
> >>>Always with this stack trace: (copy left here:
> >>>http://pastebin.com/raw/bCWTLKyt)
> >>>
> >>>[69068.874268] divide error: 0000 [#1] SMP
> >>>[69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
> >>>nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> >>>ip6_tables
> >>>ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
> >>>nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
> >>>nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
> >>>kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
> >>>xt_multiport
> >>>iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
> >>>scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
> >>>8021q
> >>>garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
> >>>raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
> >>>sd_mod
> >>>ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
> >>>i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
> >>>mdio ptp
> >>>pps_core
> >>>[69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
> >>>4.4.6+7-ph #1
> >>>[69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
> >>>03/17/2012
> >>>[69068.898578] task: ffff880fc7f28000 ti: ffff880fda2c4000 task.ti:
> >>>ffff880fda2c4000
> >>>[69068.900377] RIP: 0010:[<ffffffff860b372c>]  [<ffffffff860b372c>]
> >>>task_h_load+0xcc/0x100
> >
> >decodecode says:
> >
> >   27:   48 83 c1 01             add    $0x1,%rcx
> >   2b:*  48 f7 f1                div    %rcx             <-- trapping
> >instruction
> >
> >This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
> >
> >         update_cfs_rq_h_load(cfs_rq);
> >         return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> >                         cfs_rq_load_avg(cfs_rq) + 1);
> >
> >So the load avg is -1, thus after adding 1 we get division by 0, huh?
> 
> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up
> to 4.5 for fair.c:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
> 
> It didn't happen again with v4.4.6 + 4.5 patches for fair.c

Ok, that's a lot of patches, how about figuring out which single patch,
or shortest number of patches, makes things work again?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-03-20 21:41       ` Greg KH
@ 2016-03-21 10:52         ` Stefan Priebe - Profihost AG
  2016-03-21 13:38           ` Greg KH
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Priebe - Profihost AG @ 2016-03-21 10:52 UTC (permalink / raw)
  To: Greg KH
  Cc: Vlastimil Babka, LKML, stable, linux-mm, linux-mm,
	Peter Zijlstra, Ingo Molnar, Rik van Riel


Am 20.03.2016 um 22:41 schrieb Greg KH:
> On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
>>
>> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
>>> On 03/17/2016 07:45 PM, Greg KH wrote:
>>>> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
>>>>> Hi,
>>>>>
>>>>> while running qemu 2.5 on a host running 4.4.6 the host system has
>>>>> crashed
>>>>> (load > 200) 3 times in the last 3 days.
>>>>>
>>>>> Always with this stack trace: (copy left here:
>>>>> http://pastebin.com/raw/bCWTLKyt)
>>>>>
>>>>> [69068.874268] divide error: 0000 [#1] SMP
>>>>> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
>>>>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
>>>>> ip6_tables
>>>>> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
>>>>> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
>>>>> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
>>>>> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
>>>>> xt_multiport
>>>>> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
>>>>> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
>>>>> 8021q
>>>>> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
>>>>> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
>>>>> sd_mod
>>>>> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
>>>>> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
>>>>> mdio ptp
>>>>> pps_core
>>>>> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
>>>>> 4.4.6+7-ph #1
>>>>> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
>>>>> 03/17/2012
>>>>> [69068.898578] task: ffff880fc7f28000 ti: ffff880fda2c4000 task.ti:
>>>>> ffff880fda2c4000
>>>>> [69068.900377] RIP: 0010:[<ffffffff860b372c>]  [<ffffffff860b372c>]
>>>>> task_h_load+0xcc/0x100
>>>
>>> decodecode says:
>>>
>>>   27:   48 83 c1 01             add    $0x1,%rcx
>>>   2b:*  48 f7 f1                div    %rcx             <-- trapping
>>> instruction
>>>
>>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>>>
>>>         update_cfs_rq_h_load(cfs_rq);
>>>         return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>>>                         cfs_rq_load_avg(cfs_rq) + 1);
>>>
>>> So the load avg is -1, thus after adding 1 we get division by 0, huh?
>>
>> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up
>> to 4.5 for fair.c:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
>>
>> It didn't happen again with v4.4.6 + 4.5 patches for fair.c
> 
> Ok, that's a lot of patches, how about figuring out which single patch,
> or shortest number of patches, makes things work again?

will do so but it seems most out of those 9 patches are based on each
other. So it wouldn't be easy.

Stefan

> 
> thanks,
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-03-21 10:52         ` Stefan Priebe - Profihost AG
@ 2016-03-21 13:38           ` Greg KH
  2016-05-17  6:01             ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 19+ messages in thread
From: Greg KH @ 2016-03-21 13:38 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Vlastimil Babka, LKML, stable, linux-mm, linux-mm,
	Peter Zijlstra, Ingo Molnar, Rik van Riel

On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG wrote:
> 
> Am 20.03.2016 um 22:41 schrieb Greg KH:
> > On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
> >>
> >> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
> >>> On 03/17/2016 07:45 PM, Greg KH wrote:
> >>>> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
> >>>>> Hi,
> >>>>>
> >>>>> while running qemu 2.5 on a host running 4.4.6 the host system has
> >>>>> crashed
> >>>>> (load > 200) 3 times in the last 3 days.
> >>>>>
> >>>>> Always with this stack trace: (copy left here:
> >>>>> http://pastebin.com/raw/bCWTLKyt)
> >>>>>
> >>>>> [69068.874268] divide error: 0000 [#1] SMP
> >>>>> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
> >>>>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> >>>>> ip6_tables
> >>>>> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
> >>>>> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
> >>>>> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
> >>>>> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
> >>>>> xt_multiport
> >>>>> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
> >>>>> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
> >>>>> 8021q
> >>>>> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
> >>>>> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
> >>>>> sd_mod
> >>>>> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
> >>>>> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
> >>>>> mdio ptp
> >>>>> pps_core
> >>>>> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
> >>>>> 4.4.6+7-ph #1
> >>>>> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
> >>>>> 03/17/2012
> >>>>> [69068.898578] task: ffff880fc7f28000 ti: ffff880fda2c4000 task.ti:
> >>>>> ffff880fda2c4000
> >>>>> [69068.900377] RIP: 0010:[<ffffffff860b372c>]  [<ffffffff860b372c>]
> >>>>> task_h_load+0xcc/0x100
> >>>
> >>> decodecode says:
> >>>
> >>>   27:   48 83 c1 01             add    $0x1,%rcx
> >>>   2b:*  48 f7 f1                div    %rcx             <-- trapping
> >>> instruction
> >>>
> >>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
> >>>
> >>>         update_cfs_rq_h_load(cfs_rq);
> >>>         return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> >>>                         cfs_rq_load_avg(cfs_rq) + 1);
> >>>
> >>> So the load avg is -1, thus after adding 1 we get division by 0, huh?
> >>
> >> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up
> >> to 4.5 for fair.c:
> >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
> >>
> >> It didn't happen again with v4.4.6 + 4.5 patches for fair.c
> > 
> > Ok, that's a lot of patches, how about figuring out which single patch,
> > or shortest number of patches, makes things work again?
> 
> will do so but it seems most out of those 9 patches are based on each
> other. So it wouldn't be easy.

Worst case, we take all 9.  Best case, we only need one :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-03-21 13:38           ` Greg KH
@ 2016-05-17  6:01             ` Stefan Priebe - Profihost AG
  2016-05-17  9:21               ` Campbell Steven
  0 siblings, 1 reply; 19+ messages in thread
From: Stefan Priebe - Profihost AG @ 2016-05-17  6:01 UTC (permalink / raw)
  To: Greg KH
  Cc: Vlastimil Babka, LKML, Campbell Steven, linux-mm, linux-mm,
	Peter Zijlstra, Ingo Molnar, Rik van Riel

Am 21.03.2016 um 14:38 schrieb Greg KH:
> On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG wrote:
>>
>> Am 20.03.2016 um 22:41 schrieb Greg KH:
>>> On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
>>>>
>>>> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
>>>>> On 03/17/2016 07:45 PM, Greg KH wrote:
>>>>>> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> while running qemu 2.5 on a host running 4.4.6 the host system has
>>>>>>> crashed
>>>>>>> (load > 200) 3 times in the last 3 days.
>>>>>>>
>>>>>>> Always with this stack trace: (copy left here:
>>>>>>> http://pastebin.com/raw/bCWTLKyt)
>>>>>>>
>>>>>>> [69068.874268] divide error: 0000 [#1] SMP
>>>>>>> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
>>>>>>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
>>>>>>> ip6_tables
>>>>>>> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
>>>>>>> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
>>>>>>> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
>>>>>>> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
>>>>>>> xt_multiport
>>>>>>> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
>>>>>>> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
>>>>>>> 8021q
>>>>>>> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
>>>>>>> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
>>>>>>> sd_mod
>>>>>>> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
>>>>>>> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
>>>>>>> mdio ptp
>>>>>>> pps_core
>>>>>>> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
>>>>>>> 4.4.6+7-ph #1
>>>>>>> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
>>>>>>> 03/17/2012
>>>>>>> [69068.898578] task: ffff880fc7f28000 ti: ffff880fda2c4000 task.ti:
>>>>>>> ffff880fda2c4000
>>>>>>> [69068.900377] RIP: 0010:[<ffffffff860b372c>]  [<ffffffff860b372c>]
>>>>>>> task_h_load+0xcc/0x100
>>>>>
>>>>> decodecode says:
>>>>>
>>>>>   27:   48 83 c1 01             add    $0x1,%rcx
>>>>>   2b:*  48 f7 f1                div    %rcx             <-- trapping
>>>>> instruction
>>>>>
>>>>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>>>>>
>>>>>         update_cfs_rq_h_load(cfs_rq);
>>>>>         return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>>>>>                         cfs_rq_load_avg(cfs_rq) + 1);
>>>>>
>>>>> So the load avg is -1, thus after adding 1 we get division by 0, huh?
>>>>
>>>> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up
>>>> to 4.5 for fair.c:
>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
>>>>
>>>> It didn't happen again with v4.4.6 + 4.5 patches for fair.c
>>>
>>> Ok, that's a lot of patches, how about figuring out which single patch,
>>> or shortest number of patches, makes things work again?
>>
>> will do so but it seems most out of those 9 patches are based on each
>> other. So it wouldn't be easy.
> 
> Worst case, we take all 9.  Best case, we only need one :)

Today i got another report from: casteven (at) gmail.com having the same
issue. I wasn't able to figure out which of those 9 patches helps - i
still can just tell that it does not happen with this series on top of 4.4.

Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-05-17  6:01             ` Stefan Priebe - Profihost AG
@ 2016-05-17  9:21               ` Campbell Steven
  2016-06-22  1:19                 ` Campbell Steven
  0 siblings, 1 reply; 19+ messages in thread
From: Campbell Steven @ 2016-05-17  9:21 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Greg KH, Vlastimil Babka, LKML, linux-mm, linux-mm,
	Peter Zijlstra, Ingo Molnar, Rik van Riel

Thanks Stefan,

I am seeing this on 4.5.0 and 4.5.4 both are compiled from mainline
neither include any patches over and above the tree. I ran for well
over a month in production on 4.5.0 with no issues at all on a single
socket server (E5-2670 v3 @ 2.30GHz) but as soon as we try to run
either 4.5.0 or 4.5.4 on a dual socket box we get these traces after
12 or so hours of uptime and the same sky rocketing load. Given you
have stopped your issues by applying those patches to 4.4 it's
possible I have different issues, but the symptoms are too similar to
ignore, our trace is here:

http://pastebin.com/Q7nd8nfP

I am using the cfg scheduler, and not the fair scheduler as Stefan
reported so that is another difference.

Happy to proceed with any suggestions or further information that may
help isolate this, use case on these servers are 12 x CEPH osd's with
bcache so they are seeing alot of disk activity via NVMe and SATA
buses.

Thanks

Campbell

On 17 May 2016 at 18:01, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> Am 21.03.2016 um 14:38 schrieb Greg KH:
>> On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG wrote:
>>>
>>> Am 20.03.2016 um 22:41 schrieb Greg KH:
>>>> On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
>>>>>
>>>>> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
>>>>>> On 03/17/2016 07:45 PM, Greg KH wrote:
>>>>>>> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> while running qemu 2.5 on a host running 4.4.6 the host system has
>>>>>>>> crashed
>>>>>>>> (load > 200) 3 times in the last 3 days.
>>>>>>>>
>>>>>>>> Always with this stack trace: (copy left here:
>>>>>>>> http://pastebin.com/raw/bCWTLKyt)
>>>>>>>>
>>>>>>>> [69068.874268] divide error: 0000 [#1] SMP
>>>>>>>> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
>>>>>>>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
>>>>>>>> ip6_tables
>>>>>>>> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
>>>>>>>> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
>>>>>>>> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
>>>>>>>> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
>>>>>>>> xt_multiport
>>>>>>>> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
>>>>>>>> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
>>>>>>>> 8021q
>>>>>>>> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
>>>>>>>> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
>>>>>>>> sd_mod
>>>>>>>> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
>>>>>>>> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
>>>>>>>> mdio ptp
>>>>>>>> pps_core
>>>>>>>> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
>>>>>>>> 4.4.6+7-ph #1
>>>>>>>> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
>>>>>>>> 03/17/2012
>>>>>>>> [69068.898578] task: ffff880fc7f28000 ti: ffff880fda2c4000 task.ti:
>>>>>>>> ffff880fda2c4000
>>>>>>>> [69068.900377] RIP: 0010:[<ffffffff860b372c>]  [<ffffffff860b372c>]
>>>>>>>> task_h_load+0xcc/0x100
>>>>>>
>>>>>> decodecode says:
>>>>>>
>>>>>>   27:   48 83 c1 01             add    $0x1,%rcx
>>>>>>   2b:*  48 f7 f1                div    %rcx             <-- trapping
>>>>>> instruction
>>>>>>
>>>>>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>>>>>>
>>>>>>         update_cfs_rq_h_load(cfs_rq);
>>>>>>         return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>>>>>>                         cfs_rq_load_avg(cfs_rq) + 1);
>>>>>>
>>>>>> So the load avg is -1, thus after adding 1 we get division by 0, huh?
>>>>>
>>>>> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up
>>>>> to 4.5 for fair.c:
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
>>>>>
>>>>> It didn't happen again with v4.4.6 + 4.5 patches for fair.c
>>>>
>>>> Ok, that's a lot of patches, how about figuring out which single patch,
>>>> or shortest number of patches, makes things work again?
>>>
>>> will do so but it seems most out of those 9 patches are based on each
>>> other. So it wouldn't be easy.
>>
>> Worst case, we take all 9.  Best case, we only need one :)
>
> Today i got another report from: casteven (at) gmail.com having the same
> issue. I wasn't able to figure out which of those 9 patches helps - i
> still can just tell that it does not happen with this series on top of 4.4.
>
> Stefan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-05-17  9:21               ` Campbell Steven
@ 2016-06-22  1:19                 ` Campbell Steven
  2016-06-22  6:13                   ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Campbell Steven @ 2016-06-22  1:19 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Greg KH, Vlastimil Babka, LKML, linux-mm, linux-mm,
	Peter Zijlstra, Ingo Molnar, Rik van Riel

On 17 May 2016 at 21:21, Campbell Steven <casteven@gmail.com> wrote:
> Thanks Stefan,
>
> I am seeing this on 4.5.0 and 4.5.4 both are compiled from mainline
> neither include any patches over and above the tree. I ran for well
> over a month in production on 4.5.0 with no issues at all on a single
> socket server (E5-2670 v3 @ 2.30GHz) but as soon as we try to run
> either 4.5.0 or 4.5.4 on a dual socket box we get these traces after
> 12 or so hours of uptime and the same sky rocketing load. Given you
> have stopped your issues by applying those patches to 4.4 it's
> possible I have different issues, but the symptoms are too similar to
> ignore, our trace is here:
>
> http://pastebin.com/Q7nd8nfP
>
> I am using the cfg scheduler, and not the fair scheduler as Stefan
> reported so that is another difference.
>
> Happy to proceed with any suggestions or further information that may
> help isolate this, use case on these servers are 12 x CEPH osd's with
> bcache so they are seeing alot of disk activity via NVMe and SATA
> buses.
>
> Thanks
>
> Campbell
>
> On 17 May 2016 at 18:01, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag> wrote:
>> Am 21.03.2016 um 14:38 schrieb Greg KH:
>>> On Mon, Mar 21, 2016 at 11:52:23AM +0100, Stefan Priebe - Profihost AG wrote:
>>>>
>>>> Am 20.03.2016 um 22:41 schrieb Greg KH:
>>>>> On Sun, Mar 20, 2016 at 10:27:23PM +0100, Stefan Priebe wrote:
>>>>>>
>>>>>> Am 19.03.2016 um 23:26 schrieb Vlastimil Babka:
>>>>>>> On 03/17/2016 07:45 PM, Greg KH wrote:
>>>>>>>> On Thu, Mar 17, 2016 at 07:38:03PM +0100, Stefan Priebe wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> while running qemu 2.5 on a host running 4.4.6 the host system has
>>>>>>>>> crashed
>>>>>>>>> (load > 200) 3 times in the last 3 days.
>>>>>>>>>
>>>>>>>>> Always with this stack trace: (copy left here:
>>>>>>>>> http://pastebin.com/raw/bCWTLKyt)
>>>>>>>>>
>>>>>>>>> [69068.874268] divide error: 0000 [#1] SMP
>>>>>>>>> [69068.875242] Modules linked in: ebtable_filter ebtables ip6t_REJECT
>>>>>>>>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
>>>>>>>>> ip6_tables
>>>>>>>>> ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment nf_conntrack_ipv4
>>>>>>>>> nf_defrag_ipv4 xt_tcpudp xt_mark xt_set xt_addrtype xt_conntrack
>>>>>>>>> nf_conntrack ip_set_hash_net ip_set vhost_net tun vhost macvtap macvlan
>>>>>>>>> kvm_intel nfnetlink_log kvm nfnetlink irqbypass netconsole dlm
>>>>>>>>> xt_multiport
>>>>>>>>> iptable_filter ip_tables x_tables iscsi_tcp libiscsi_tcp libiscsi
>>>>>>>>> scsi_transport_iscsi nfsd auth_rpcgss oid_registry bonding coretemp
>>>>>>>>> 8021q
>>>>>>>>> garp fuse i2c_i801 i7core_edac edac_core i5500_temp button btrfs xor
>>>>>>>>> raid6_pq dm_mod raid1 md_mod usb_storage ohci_hcd bcache sg usbhid
>>>>>>>>> sd_mod
>>>>>>>>> ata_generic uhci_hcd ehci_pci ehci_hcd usbcore ata_piix usb_common igb
>>>>>>>>> i2c_algo_bit mpt3sas raid_class ixgbe scsi_transport_sas i2c_core
>>>>>>>>> mdio ptp
>>>>>>>>> pps_core
>>>>>>>>> [69068.895604] CPU: 14 PID: 6673 Comm: ceph-osd Not tainted
>>>>>>>>> 4.4.6+7-ph #1
>>>>>>>>> [69068.897052] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1
>>>>>>>>> 03/17/2012
>>>>>>>>> [69068.898578] task: ffff880fc7f28000 ti: ffff880fda2c4000 task.ti:
>>>>>>>>> ffff880fda2c4000
>>>>>>>>> [69068.900377] RIP: 0010:[<ffffffff860b372c>]  [<ffffffff860b372c>]
>>>>>>>>> task_h_load+0xcc/0x100
>>>>>>>
>>>>>>> decodecode says:
>>>>>>>
>>>>>>>   27:   48 83 c1 01             add    $0x1,%rcx
>>>>>>>   2b:*  48 f7 f1                div    %rcx             <-- trapping
>>>>>>> instruction
>>>>>>>
>>>>>>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>>>>>>>
>>>>>>>         update_cfs_rq_h_load(cfs_rq);
>>>>>>>         return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>>>>>>>                         cfs_rq_load_avg(cfs_rq) + 1);
>>>>>>>
>>>>>>> So the load avg is -1, thus after adding 1 we get division by 0, huh?
>>>>>>
>>>>>> Yes CONFIG_FAIR_GROUP_SCHED is set. I cherry picked now all those commits up
>>>>>> to 4.5 for fair.c:
>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/kernel/sched/fair.c?h=v4.5
>>>>>>
>>>>>> It didn't happen again with v4.4.6 + 4.5 patches for fair.c
>>>>>
>>>>> Ok, that's a lot of patches, how about figuring out which single patch,
>>>>> or shortest number of patches, makes things work again?
>>>>
>>>> will do so but it seems most out of those 9 patches are based on each
>>>> other. So it wouldn't be easy.
>>>
>>> Worst case, we take all 9.  Best case, we only need one :)
>>
>> Today i got another report from: casteven (at) gmail.com having the same
>> issue. I wasn't able to figure out which of those 9 patches helps - i
>> still can just tell that it does not happen with this series on top of 4.4.
>>
>> Stefan

Just reinvigorating this thread based off yesterdays post by Yannis
here: https://lkml.org/lkml/headers/2016/6/21/491 of which the trace
is more or less is identical to what we see.

In addition to this we have since tested the 4.6.x releases and have
seen the same issues as we have in all the 4.4 and 4.5 releases we
tried. I have noticed a few commits in this area of the kernel into
the 4.7 series, curious if this may be solved already in 4.7 and if
there is a patch to 4.6 we could possibly test, our constraint is that
we are only able to reproduce this in our production environment it
seems, we never see it in our test/dev systems.

We have also found we can work around the issue by disabling all cores
on the second CPU socket (e.g. echo 1 >
/sys/devices/system/cpu/cpuX/online), doing this "fixes" this problem
for us but is obviously not ideal or long term but gives stability if
like us you need to run kernels in these series for various other
unrelated reasons and can do without your second socket.

Campbell

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-06-22  1:19                 ` Campbell Steven
@ 2016-06-22  6:13                   ` Peter Zijlstra
  2016-07-06 23:20                     ` Campbell Steven
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2016-06-22  6:13 UTC (permalink / raw)
  To: Campbell Steven
  Cc: Stefan Priebe - Profihost AG, Greg KH, Vlastimil Babka, LKML,
	linux-mm, linux-mm, Ingo Molnar, Rik van Riel

On Wed, Jun 22, 2016 at 01:19:54PM +1200, Campbell Steven wrote:
> >>>>>>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
> >>>>>>>
> >>>>>>>         update_cfs_rq_h_load(cfs_rq);
> >>>>>>>         return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> >>>>>>>                         cfs_rq_load_avg(cfs_rq) + 1);
> >>>>>>>


---
commit 8974189222159154c55f24ddad33e3613960521a
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Jun 16 10:50:40 2016 +0200

    sched/fair: Fix cfs_rq avg tracking underflow
    
    As per commit:
    
      b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")
    
    > the code generated from update_cfs_rq_load_avg():
    >
    > 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
    > 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
    > 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
    > 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
    > 		removed_load = 1;
    > 	}
    >
    > turns into:
    >
    > ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
    > ffffffff8108706b:       48 85 c0                test   %rax,%rax
    > ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
    > ffffffff81087070:       4c 89 f8                mov    %r15,%rax
    > ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
    > ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
    > ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
    > ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
    > ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
    > ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
    >
    > Which you'll note ends up with sa->load_avg -= r in memory at
    > ffffffff8108707a.
    
    So I _should_ have looked at other unserialized users of ->load_avg,
    but alas. Luckily nikbor reported a similar /0 from task_h_load() which
    instantly triggered recollection of this here problem.
    
    Aside from the intermediate value hitting memory and causing problems,
    there's another problem: the underflow detection relies on the signed
    bit. This reduces the effective width of the variables, IOW its
    effectively the same as having these variables be of signed type.
    
    This patch changes to a different means of unsigned underflow
    detection to not rely on the signed bit. This allows the variables to
    use the 'full' unsigned range. And it does so with explicit LOAD -
    STORE to ensure any intermediate value will never be visible in
    memory, allowing these unserialized loads.
    
    Note: GCC generates crap code for this, might warrant a look later.
    
    Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
           maybe we should do clamping on add too.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Yuyang Du <yuyang.du@intel.com>
    Cc: bsegall@google.com
    Cc: kernel@kyup.com
    Cc: morten.rasmussen@arm.com
    Cc: pjt@google.com
    Cc: steve.muckle@linaro.org
    Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
    Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2348deab7a3..2ae68f0e3bf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2904,6 +2904,23 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 	}
 }
 
+/*
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {				\
+	typeof(_ptr) ptr = (_ptr);				\
+	typeof(*ptr) val = (_val);				\
+	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
+	res = var - val;					\
+	if (res > var)						\
+		res = 0;					\
+	WRITE_ONCE(*ptr, res);					\
+} while (0)
+
 /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
@@ -2913,15 +2930,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
-		sa->load_avg = max_t(long, sa->load_avg - r, 0);
-		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->load_avg, r);
+		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
 		removed_load = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
-		sa->util_avg = max_t(long, sa->util_avg - r, 0);
-		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->util_avg, r);
+		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
 		removed_util = 1;
 	}
 
@@ -2994,10 +3011,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
 
-	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
-	cfs_rq->avg.load_sum = max_t(s64,  cfs_rq->avg.load_sum - se->avg.load_sum, 0);
-	cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
-	cfs_rq->avg.util_sum = max_t(s32,  cfs_rq->avg.util_sum - se->avg.util_sum, 0);
+	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
+	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
+	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
+	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 
 	cfs_rq_util_change(cfs_rq);
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-06-22  6:13                   ` Peter Zijlstra
@ 2016-07-06 23:20                     ` Campbell Steven
  2016-07-07  7:42                       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Campbell Steven @ 2016-07-06 23:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stefan Priebe - Profihost AG, Greg KH, Vlastimil Babka, LKML,
	linux-mm, linux-mm, Ingo Molnar, Rik van Riel

On 22 June 2016 at 18:13, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 22, 2016 at 01:19:54PM +1200, Campbell Steven wrote:
>> >>>>>>> This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:
>> >>>>>>>
>> >>>>>>>         update_cfs_rq_h_load(cfs_rq);
>> >>>>>>>         return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>> >>>>>>>                         cfs_rq_load_avg(cfs_rq) + 1);
>> >>>>>>>
>
>
> ---
> commit 8974189222159154c55f24ddad33e3613960521a
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Thu Jun 16 10:50:40 2016 +0200
>
>     sched/fair: Fix cfs_rq avg tracking underflow
>
>     As per commit:
>
>       b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")
>
>     > the code generated from update_cfs_rq_load_avg():
>     >
>     >   if (atomic_long_read(&cfs_rq->removed_load_avg)) {
>     >           s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
>     >           sa->load_avg = max_t(long, sa->load_avg - r, 0);
>     >           sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
>     >           removed_load = 1;
>     >   }


Hi Peter,

I just wanted to report back to say thanks for this, and we have (and
others) have tested this out in 4.7 rc6 and have not been able to
repeat the issue. It seems that anyone running busy ceph osd's or high
load KVM instances is able to trigger this on a dual socket box pretty
easily.

Since these early reports from Stefan and I it looks like it's been
hit but alot more folks now so I'd like to ask what the process is for
getting this backported into 4.6, 4.5 and 4.4 as in our testing all
those versions for their latest point release seem to have the same
problem.

Thanks

Campbell

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-07-06 23:20                     ` Campbell Steven
@ 2016-07-07  7:42                       ` Peter Zijlstra
  2016-07-09  5:21                         ` Greg KH
  2016-07-11 22:33                         ` Greg KH
  0 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2016-07-07  7:42 UTC (permalink / raw)
  To: Campbell Steven
  Cc: Stefan Priebe - Profihost AG, Greg KH, Vlastimil Babka, LKML,
	linux-mm, linux-mm, Ingo Molnar, Rik van Riel

On Thu, Jul 07, 2016 at 11:20:36AM +1200, Campbell Steven wrote:

> > commit 8974189222159154c55f24ddad33e3613960521a
> > Author: Peter Zijlstra <peterz@infradead.org>
> > Date:   Thu Jun 16 10:50:40 2016 +0200

> Since these early reports from Stefan and I it looks like it's been
> hit but alot more folks now so I'd like to ask what the process is for
> getting this backported into 4.6, 4.5 and 4.4 as in our testing all
> those versions for their latest point release seem to have the same
> problem.

I think this should do; Greg is on Cc and will mark the commit
somewhere. It is already in Linus' tree and should indeed be sufficient.

It has a Fixes tag referring the commit that introduced it, which IIRC
is somewhere around v4.2.

Greg, anything else required?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-07-07  7:42                       ` Peter Zijlstra
@ 2016-07-09  5:21                         ` Greg KH
  2016-07-11 22:33                         ` Greg KH
  1 sibling, 0 replies; 19+ messages in thread
From: Greg KH @ 2016-07-09  5:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Campbell Steven, Stefan Priebe - Profihost AG, Vlastimil Babka,
	LKML, linux-mm, linux-mm, Ingo Molnar, Rik van Riel

On Thu, Jul 07, 2016 at 09:42:32AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 07, 2016 at 11:20:36AM +1200, Campbell Steven wrote:
> 
> > > commit 8974189222159154c55f24ddad33e3613960521a
> > > Author: Peter Zijlstra <peterz@infradead.org>
> > > Date:   Thu Jun 16 10:50:40 2016 +0200
> 
> > Since these early reports from Stefan and I it looks like it's been
> > hit but alot more folks now so I'd like to ask what the process is for
> > getting this backported into 4.6, 4.5 and 4.4 as in our testing all
> > those versions for their latest point release seem to have the same
> > problem.
> 
> I think this should do; Greg is on Cc and will mark the commit
> somewhere. It is already in Linus' tree and should indeed be sufficient.
> 
> It has a Fixes tag referring the commit that introduced it, which IIRC
> is somewhere around v4.2.
> 
> Greg, anything else required?

Nope, that should be fine.

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-07-07  7:42                       ` Peter Zijlstra
  2016-07-09  5:21                         ` Greg KH
@ 2016-07-11 22:33                         ` Greg KH
  2016-07-12 13:12                           ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Greg KH @ 2016-07-11 22:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Campbell Steven, Stefan Priebe - Profihost AG, Vlastimil Babka,
	LKML, linux-mm, linux-mm, Ingo Molnar, Rik van Riel

On Thu, Jul 07, 2016 at 09:42:32AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 07, 2016 at 11:20:36AM +1200, Campbell Steven wrote:
> 
> > > commit 8974189222159154c55f24ddad33e3613960521a
> > > Author: Peter Zijlstra <peterz@infradead.org>
> > > Date:   Thu Jun 16 10:50:40 2016 +0200
> 
> > Since these early reports from Stefan and I it looks like it's been
> > hit but alot more folks now so I'd like to ask what the process is for
> > getting this backported into 4.6, 4.5 and 4.4 as in our testing all
> > those versions for their latest point release seem to have the same
> > problem.
> 
> I think this should do; Greg is on Cc and will mark the commit
> somewhere. It is already in Linus' tree and should indeed be sufficient.
> 
> It has a Fixes tag referring the commit that introduced it, which IIRC
> is somewhere around v4.2.
> 
> Greg, anything else required?

Oops, this commit does not apply cleanly to 4.6 or 4.4-stable trees.
Can someone send me the backported verision that they have tested to
work properly so I can queue it up?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-07-11 22:33                         ` Greg KH
@ 2016-07-12 13:12                           ` Peter Zijlstra
  2016-07-13  0:26                             ` Greg KH
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2016-07-12 13:12 UTC (permalink / raw)
  To: Greg KH
  Cc: Campbell Steven, Stefan Priebe - Profihost AG, Vlastimil Babka,
	LKML, linux-mm, linux-mm, Ingo Molnar, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 454 bytes --]

On Mon, Jul 11, 2016 at 03:33:53PM -0700, Greg KH wrote:

> Oops, this commit does not apply cleanly to 4.6 or 4.4-stable trees.
> Can someone send me the backported verision that they have tested to
> work properly so I can queue it up?

I've never actually been able to reproduce, but the attached patches
apply, the reject was trivial.

They seem to compile and boot on my main test rig, but nothing else was
done but build the next kernel with it.



[-- Attachment #2: sched-stable-4.4.patch --]
[-- Type: text/x-diff, Size: 5885 bytes --]

commit 8974189222159154c55f24ddad33e3613960521a
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Jun 16 10:50:40 2016 +0200

    sched/fair: Fix cfs_rq avg tracking underflow
    
    As per commit:
    
      b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")
    
    > the code generated from update_cfs_rq_load_avg():
    >
    > 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
    > 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
    > 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
    > 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
    > 		removed_load = 1;
    > 	}
    >
    > turns into:
    >
    > ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
    > ffffffff8108706b:       48 85 c0                test   %rax,%rax
    > ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
    > ffffffff81087070:       4c 89 f8                mov    %r15,%rax
    > ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
    > ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
    > ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
    > ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
    > ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
    > ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
    >
    > Which you'll note ends up with sa->load_avg -= r in memory at
    > ffffffff8108707a.
    
    So I _should_ have looked at other unserialized users of ->load_avg,
    but alas. Luckily nikbor reported a similar /0 from task_h_load() which
    instantly triggered recollection of this here problem.
    
    Aside from the intermediate value hitting memory and causing problems,
    there's another problem: the underflow detection relies on the signed
    bit. This reduces the effective width of the variables, IOW its
    effectively the same as having these variables be of signed type.
    
    This patch changes to a different means of unsigned underflow
    detection to not rely on the signed bit. This allows the variables to
    use the 'full' unsigned range. And it does so with explicit LOAD -
    STORE to ensure any intermediate value will never be visible in
    memory, allowing these unserialized loads.
    
    Note: GCC generates crap code for this, might warrant a look later.
    
    Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
           maybe we should do clamping on add too.
    
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Yuyang Du <yuyang.du@intel.com>
    Cc: bsegall@google.com
    Cc: kernel@kyup.com
    Cc: morten.rasmussen@arm.com
    Cc: pjt@google.com
    Cc: steve.muckle@linaro.org
    Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
    Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

---
 kernel/sched/fair.c |   33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2682,6 +2682,23 @@ static inline void update_tg_load_avg(st
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
 
+/*
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {				\
+	typeof(_ptr) ptr = (_ptr);				\
+	typeof(*ptr) val = (_val);				\
+	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
+	res = var - val;					\
+	if (res > var)						\
+		res = 0;					\
+	WRITE_ONCE(*ptr, res);					\
+} while (0)
+
 /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
 static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
@@ -2690,15 +2707,15 @@ static inline int update_cfs_rq_load_avg
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
-		sa->load_avg = max_t(long, sa->load_avg - r, 0);
-		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->load_avg, r);
+		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
 		removed = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
-		sa->util_avg = max_t(long, sa->util_avg - r, 0);
-		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->util_avg, r);
+		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
 	}
 
 	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
@@ -2764,10 +2781,10 @@ static void detach_entity_load_avg(struc
 			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
 
-	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
-	cfs_rq->avg.load_sum = max_t(s64,  cfs_rq->avg.load_sum - se->avg.load_sum, 0);
-	cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
-	cfs_rq->avg.util_sum = max_t(s32,  cfs_rq->avg.util_sum - se->avg.util_sum, 0);
+	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
+	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
+	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
+	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 }
 
 /* Add the load generated by se into cfs_rq's load average */

[-- Attachment #3: sched-stable-4.6.patch --]
[-- Type: text/x-diff, Size: 5732 bytes --]

commit 8974189222159154c55f24ddad33e3613960521a
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Jun 16 10:50:40 2016 +0200

    sched/fair: Fix cfs_rq avg tracking underflow

    As per commit:

      b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

    > the code generated from update_cfs_rq_load_avg():
    >
    > 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
    > 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
    > 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
    > 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
    > 		removed_load = 1;
    > 	}
    >
    > turns into:
    >
    > ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
    > ffffffff8108706b:       48 85 c0                test   %rax,%rax
    > ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
    > ffffffff81087070:       4c 89 f8                mov    %r15,%rax
    > ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
    > ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
    > ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
    > ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
    > ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
    > ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
    >
    > Which you'll note ends up with sa->load_avg -= r in memory at
    > ffffffff8108707a.

    So I _should_ have looked at other unserialized users of ->load_avg,
    but alas. Luckily nikbor reported a similar /0 from task_h_load() which
    instantly triggered recollection of this here problem.

    Aside from the intermediate value hitting memory and causing problems,
    there's another problem: the underflow detection relies on the signed
    bit. This reduces the effective width of the variables, IOW its
    effectively the same as having these variables be of signed type.

    This patch changes to a different means of unsigned underflow
    detection to not rely on the signed bit. This allows the variables to
    use the 'full' unsigned range. And it does so with explicit LOAD -
    STORE to ensure any intermediate value will never be visible in
    memory, allowing these unserialized loads.

    Note: GCC generates crap code for this, might warrant a look later.

    Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
           maybe we should do clamping on add too.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Yuyang Du <yuyang.du@intel.com>
    Cc: bsegall@google.com
    Cc: kernel@kyup.com
    Cc: morten.rasmussen@arm.com
    Cc: pjt@google.com
    Cc: steve.muckle@linaro.org
    Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
    Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2821,6 +2821,23 @@ static inline void update_tg_load_avg(st
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
 
+/*
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {				\
+	typeof(_ptr) ptr = (_ptr);				\
+	typeof(*ptr) val = (_val);				\
+	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
+	res = var - val;					\
+	if (res > var)						\
+		res = 0;					\
+	WRITE_ONCE(*ptr, res);					\
+} while (0)
+
 /* Group cfs_rq's load_avg is used for task_h_load and update_cfs_share */
 static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
@@ -2829,15 +2846,15 @@ static inline int update_cfs_rq_load_avg
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
-		sa->load_avg = max_t(long, sa->load_avg - r, 0);
-		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->load_avg, r);
+		sub_positive(&sa->load_sum, r * LOAD_AVG_MAX);
 		removed = 1;
 	}
 
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
-		sa->util_avg = max_t(long, sa->util_avg - r, 0);
-		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+		sub_positive(&sa->util_avg, r);
+		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
 	}
 
 	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
@@ -2927,10 +2944,10 @@ static void detach_entity_load_avg(struc
 			  &se->avg, se->on_rq * scale_load_down(se->load.weight),
 			  cfs_rq->curr == se, NULL);
 
-	cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
-	cfs_rq->avg.load_sum = max_t(s64,  cfs_rq->avg.load_sum - se->avg.load_sum, 0);
-	cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
-	cfs_rq->avg.util_sum = max_t(s32,  cfs_rq->avg.util_sum - se->avg.util_sum, 0);
+	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
+	sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
+	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
+	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 }
 
 /* Add the load generated by se into cfs_rq's load average */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-07-12 13:12                           ` Peter Zijlstra
@ 2016-07-13  0:26                             ` Greg KH
  0 siblings, 0 replies; 19+ messages in thread
From: Greg KH @ 2016-07-13  0:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Campbell Steven, Stefan Priebe - Profihost AG, Vlastimil Babka,
	LKML, linux-mm, linux-mm, Ingo Molnar, Rik van Riel

On Tue, Jul 12, 2016 at 03:12:35PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 11, 2016 at 03:33:53PM -0700, Greg KH wrote:
> 
> > Oops, this commit does not apply cleanly to 4.6 or 4.4-stable trees.
> > Can someone send me the backported verision that they have tested to
> > work properly so I can queue it up?
> 
> I've never actually been able to reproduce, but the attached patches
> apply, the reject was trivial.
> 
> They seem to compile and boot on my main test rig, but nothing else was
> done but build the next kernel with it.

Thanks for these, now applied.

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6
  2016-06-21 12:13 Yannis Aribaud
@ 2016-06-22 15:42 ` Yannis Aribaud
  0 siblings, 0 replies; 19+ messages in thread
From: Yannis Aribaud @ 2016-06-22 15:42 UTC (permalink / raw)
  To: linux-kernel

21 juin 2016 14:13 "Yannis Aribaud" <bugs@d6bell.net> a écrit:
> Hi everyone,
> 
> I recently it this bug in the kernel using a vanilla 4.6.2 release.
> It seems that somewhere in the load average calculation a division by 0 occurs (see the stack trace
> at the end).
>
> [snipped]
> 
> I'm not an expert at all but I suspect that is the issue's origin. Shouldn't the function
> cfs_rq_load_avg use an atomic_long_read() to avoid this ?

After digging a bit more, this can't be the problem as this function obviously can't return negative value.

I found that it can maybe come from the update_cfs_rq_load_avg function in the following block:

	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
		sa->load_avg = max_t(long, sa->load_avg - r, 0);
		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
		removed_load = 1;
	}

The max_t(long, sa->load_avg - r, 0) can result in a negative value keeped by the max_t function as the long would wrap up then generate a division by zero in task_h_load function.

Best regards,
--
Yannis Aribaud

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: divide error: 0000 [#1] SMP in task_numa_migrate -   handle_mm_fault vanilla 4.4.6
@ 2016-06-21 12:13 Yannis Aribaud
  2016-06-22 15:42 ` Yannis Aribaud
  0 siblings, 1 reply; 19+ messages in thread
From: Yannis Aribaud @ 2016-06-21 12:13 UTC (permalink / raw)
  To: linux-kernel

Hi everyone,

I recently it this bug in the kernel using a vanilla 4.6.2 release.
It seems that somewhere in the load average calculation a division by 0 occurs (see the stack trace
at the end).

After digging a bit (be fair it's my first time) in the kernel sources, I found that we "recently"
added the function cfs_rq_load_avg (commit 6f2b04524f0b38bfbb8413f98d2d6af234508309) and started
using it in the function task_h_load which do a division with the value returned
(kernel/sched/fair.c) like this:

static unsigned long task_h_load(struct task_struct *p)
{
    struct cfs_rq *cfs_rq = task_cfs_rq(p);

    update_cfs_rq_h_load(cfs_rq);
    return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
        cfs_rq_load_avg(cfs_rq) + 1);
}

But the load_avg filed from sched_avg struct is an atomic_long_t and the cfs_rq_load_avg returns
this field as an unsigned long without doing any type conversion.

static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
{
    return cfs_rq->avg.load_avg;
}

I'm not an expert at all but I suspect that is the issue's origin. Shouldn't the function
cfs_rq_load_avg use an atomic_long_read() to avoid this ?

Here is the stack trace:

[534814.112500] divide error: 0000 [#1] SMP
[534814.112550] Modules linked in: vhost_net vhost macvtap macvlan ipmi_si mpt3sas raid_class
scsi_transport_sas ipmi_devintf dell_rbu tun nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace
fscache sunrpc bridge 8021q garp mrp stp llc bonding xfs libcrc32c bcache usbhid hid uhci_hcd
ohci_hcd x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass ghash_clmulni_intel iTCO_wdt
iTCO_vendor_support sha256_generic hmac drbg dcdbas ansi_cprng aesni_intel aes_x86_64 ablk_helper
cryptd lrw gf128mul glue_helper shpchp evdev sb_edac edac_core ehci_pci ehci_hcd lpc_ich usbcore
mfd_core usb_common ipmi_msghandler acpi_cpufreq wmi tpm_tis tpm processor acpi_power_meter button
ext4 crc16 jbd2 mbcache sg sd_mod dm_mod crc32c_intel igb megaraid_sas i2c_algo_bit i2c_core dca
ptp scsi_mod pps_core [last unloaded: ipmi_si]
[534814.113345] CPU: 10 PID: 38568 Comm: ceph-osd Not tainted 4.6.2-ig1virt #16
[534814.113390] Hardware name: Dell Inc. PowerEdge R730xd/0H21J3, BIOS 1.1.4 11/03/2014
[534814.113458] task: ffff88100cf5ef00 ti: ffff8814827e0000 task.ti: ffff8814827e0000
[534814.113525] RIP: 0010:[<ffffffff8106cfd7>] [<ffffffff8106cfd7>] task_h_load+0x4f/0xc7
[534814.113613] RSP: 0000:ffff8814827e3c00 EFLAGS: 00010256
[534814.113654] RAX: 0000000000000000 RBX: 00000000000000d7 RCX: 0000000000000000
[534814.113720] RDX: 0000000000000000 RSI: ffff88103d8a5f00 RDI: ffff88100cf5ef00
[534814.113786] RBP: ffff8814827e3c90 R08: 0000000107f70c76 R09: 0000000000000000
[534814.113851] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
[534814.113917] R13: 0000000000000015 R14: 0000000000000000 R15: ffff88207ec14580
[534814.113984] FS: 00007eff83cbb700(0000) GS:ffff88107f4a0000(0000) knlGS:0000000000000000
[534814.114053] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[534814.114095] CR2: 00000000146b07c0 CR3: 000000103f605000 CR4: 00000000001426e0
[534814.119456] Stack:
[534814.119488] ffffffff8106fc4c ffff88103d639400 00000000000000d7 0000000000014580
[534814.119571] fffffffffffffe19 ffff88107f4a0000 00000000000000d7 0000000000000027
[534814.119653] ffff88100cf5ef00 000000000000025f 0000000000000100 0000000000000188
[534814.119736] Call Trace:
[534814.119774] [<ffffffff8106fc4c>] ? task_numa_find_cpu+0x1d2/0x2ec
[534814.119819] [<ffffffff8106fe86>] ? task_numa_migrate+0x120/0x328
[534814.119864] [<ffffffff81067829>] ? ttwu_do_wakeup+0xf/0xcd
[534814.119907] [<ffffffff81071176>] ? task_numa_fault+0x912/0x9a9
[534814.119954] [<ffffffff81128568>] ? mpol_misplaced+0x138/0x14a
[534814.120001] [<ffffffff8110f39d>] ? handle_mm_fault+0xe28/0xf31
[534814.120046] [<ffffffff8113db0b>] ? fput+0xd/0x81
[534814.120087] [<ffffffff8103cd91>] ? __do_page_fault+0x425/0x485
[534814.120131] [<ffffffff813c85a2>] ? page_fault+0x22/0x30
[534814.120171] Code: 63 92 38 09 00 00 48 8b 80 b8 00 00 00 48 8b 04 d0 75 1c 48 8b 86 b0 00 00 00
48 8b 4e 78 31 d2 48 0f af 87 58 01 00 00 48 ff c1 <48> f7 f1 c3 48 c7 86 c0 00 00 00 00 00 00 00
48 89 f1 eb 18 48
[534814.120582] RIP [<ffffffff8106cfd7>] task_h_load+0x4f/0xc7
[534814.120628] RSP <ffff8814827e3c00>
[534814.121242] ---[ end trace ca72a3c25fb6f0dc ]---

Best regards,
--
Yannis Aribaud
-- 
Yannis Aribaud

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-07-13  0:28 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-17 18:38 divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault vanilla 4.4.6 Stefan Priebe
2016-03-17 18:45 ` Greg KH
2016-03-19 22:26   ` Vlastimil Babka
2016-03-20 21:27     ` Stefan Priebe
2016-03-20 21:41       ` Greg KH
2016-03-21 10:52         ` Stefan Priebe - Profihost AG
2016-03-21 13:38           ` Greg KH
2016-05-17  6:01             ` Stefan Priebe - Profihost AG
2016-05-17  9:21               ` Campbell Steven
2016-06-22  1:19                 ` Campbell Steven
2016-06-22  6:13                   ` Peter Zijlstra
2016-07-06 23:20                     ` Campbell Steven
2016-07-07  7:42                       ` Peter Zijlstra
2016-07-09  5:21                         ` Greg KH
2016-07-11 22:33                         ` Greg KH
2016-07-12 13:12                           ` Peter Zijlstra
2016-07-13  0:26                             ` Greg KH
2016-06-21 12:13 Yannis Aribaud
2016-06-22 15:42 ` Yannis Aribaud

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).