All of lore.kernel.org
 help / color / mirror / Atom feed
* some 4.4 issues on 4S Xeon servers
@ 2015-12-07 18:48 Luck, Tony
  0 siblings, 0 replies; only message in thread
From: Luck, Tony @ 2015-12-07 18:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: ville.syrjala, jmoyer, airlied

4.4 isn't going smoothly on my 4 socket Xeon servers (18 core per
socket if that is important). User space is RHEL 7.2. Kernel config
is the RHEL one (with whatever mods happen running "make oldconfig"
and hitting <RETURN> to every question.)

1) there was a problem in drm_calc_timestamping_constants(),
this was nominally fixed in:
  commit 0c545ac4815657e0b062344c690ea35a11eeaec8
  drm: Don't oops in drm_calc_timestamping_constants() if drm_vblank_init() wasn't called

but even with this commit I see issues when rebooting:

[90740.859356] mgag200 0000:08:00.0: Video card doesn't support cursors with partial transparency.^M
[90740.869076] mgag200 0000:08:00.0: Not enabling hardware cursor.^M
[90741.160249] ixgbe 0000:03:00.1: removed PHC on ens1f1^M
[90742.533306] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060^M
[90742.533318] IP: [<ffffffff816921fc>] _raw_spin_lock+0xc/0x30^M
[90742.533319] PGD 0 ^M
[90742.533321] Oops: 0002 [#1] SMP ^M
[90742.533345] Modules linked in: fuse(E) xt_CHECKSUM(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) tun(E) ip6t_rpfilter(E) ip6t_REJECT(E) nf_reject_ipv6(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_conntrack(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) ebtable_filter(E) ebtables(E) ip6table_nat(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) nf_nat_ipv6(E) ip6table_mangle(E) ip6table_security(E) ip6table_raw(E) ip6table_filter(E) ip6_tables(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) iptable_mangle(E) iptable_security(E) iptable_raw(E) iptable_filter(E) vfat(E) fat(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) aesni_intel(E) lrw(E) iTCO_wdt(E) gf128mul(E) glue_helper(E) iTCO_vendor_support(E) mei_me(E) ablk_helper(E) cryptd(E) sb_edac(E) lpc_ich(E) ipmi_ssif(E) pcspkr(E) edac_core(E) mei(E) sg(E) shpchp(E) i2c_i801(E) mfd_core(E) wmi(E) ipmi_si(E) ipmi_msghandler(E) acpi_pad(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ip_tables(E) xfs(E) libcrc32c(E) sr_mod(E) sd_mod(E) cdrom(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ixgbe(E) ttm(E) ahci(E) mdio(E) drm(E) ptp(E) libahci(E) crc32c_intel(E) mpt2sas(E) pps_core(E) raid_class(E) libata(E) i2c_core(E) scsi_transport_sas(E) dca(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)^M
[90742.533368] CPU: 93 PID: 3641 Comm: Xorg Tainted: G            E   4.3.0 #1^M
[90742.533369] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRBDXSD1.86B.0329.R00.1510031403 10/03/2015^M
[90742.533370] task: ffff885fe8ad6a40 ti: ffff885fef220000 task.ti: ffff885fef220000^M
[90742.533373] RIP: 0010:[<ffffffff816921fc>]  [<ffffffff816921fc>] _raw_spin_lock+0xc/0x30^M
[90742.533374] RSP: 0018:ffff885fef223930  EFLAGS: 00010246^M
[90742.533374] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000^M
[90742.533375] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000060^M
[90742.533376] RBP: ffff885fef223950 R08: 0000000000000000 R09: ffff887fed175600^M
[90742.533376] R10: ffffffffa063fe37 R11: ffff883fecd30538 R12: 0000000000000060^M
[90742.533376] R13: 0000000000000000 R14: ffff885fee041418 R15: 0000000000000080^M
[90742.533377] FS:  00007f9847d5da00(0000) GS:ffff887ffef40000(0000) knlGS:0000000000000000^M
[90742.533378] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[90742.533379] CR2: 0000000000000060 CR3: 0000001fde949000 CR4: 00000000003406e0^M
[90742.533380] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[90742.533380] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
[90742.533380] Stack:^M
[90742.533382]  ffffffffa0625b10 ffff883fee322800 0000000000000000 ffff885ff1de0b60^M
[90742.533383]  ffff885fef223a40 ffffffffa024974a 0000000000000000 0000000002408240^M
[90742.533384]  ffff881ffec07ac0 ffff885ff1de0800 ffff883fee322800 ffff883fee322800^M
[90742.533384] Call Trace:^M
[90742.533403]  [<ffffffffa0625b10>] ? drm_gem_object_lookup+0x20/0xb0 [drm]^M
[90742.533408]  [<ffffffffa024974a>] mga_crtc_cursor_set+0xda/0xbb0 [mgag200]^M
[90742.533410]  [<ffffffff8169040b>] ? __ww_mutex_lock+0x1b/0x97^M
[90742.533419]  [<ffffffffa02213e3>] restore_fbdev_mode+0xf3/0x2b0 [drm_kms_helper]^M
[90742.533423]  [<ffffffffa0223673>] drm_fb_helper_restore_fbdev_mode_unlocked+0x33/0x80 [drm_kms_helper]^M
[90742.533425]  [<ffffffffa02236ec>] drm_fb_helper_set_par+0x2c/0x50 [drm_kms_helper]^M
[90742.533429]  [<ffffffff81395627>] fb_set_var+0x197/0x410^M
[90742.533458]  [<ffffffffa02b83c6>] ? xfs_trans_add_item+0x46/0x60 [xfs]^M
[90742.533472]  [<ffffffffa02c8015>] ? _xfs_trans_bjoin+0x45/0x60 [xfs]^M
[90742.533484]  [<ffffffffa02c841b>] ? xfs_trans_read_buf_map+0x16b/0x2a0 [xfs]^M
[90742.533487]  [<ffffffff8138c41e>] fbcon_blank+0x1de/0x2e0^M
[90742.533487]  [<ffffffff8138c41e>] fbcon_blank+0x1de/0x2e0^M
[90742.533499]  [<ffffffffa02bf88b>] ? xfs_buf_item_unlock+0xdb/0x1d0 [xfs]^M
[90742.533504]  [<ffffffff8140d388>] do_unblank_screen+0xb8/0x1c0^M
[90742.533506]  [<ffffffff81404748>] vt_ioctl+0x1078/0x1100^M
[90742.533511]  [<ffffffff811e2f89>] ? __slab_free+0x69/0x230^M
[90742.533521]  [<ffffffffa029123a>] ? xfs_perag_get+0x2a/0xa0 [xfs]^M
[90742.533523]  [<ffffffff813f7c24>] tty_ioctl+0x3d4/0xbb0^M
[90742.533526]  [<ffffffff8144b3d4>] ? vga_arb_release+0xc4/0x120^M
[90742.533530]  [<ffffffff81216ee2>] do_vfs_ioctl+0x2d2/0x4b0^M
[90742.533535]  [<ffffffff8112874f>] ? __audit_syscall_entry+0xaf/0x100^M
[90742.533539]  [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70^M
[90742.533540]  [<ffffffff81217139>] SyS_ioctl+0x79/0x90^M
[90742.533542]  [<ffffffff816925ae>] entry_SYSCALL_64_fastpath+0x12/0x71^M
[90742.533552] Code: 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 cd 27 af ff 5d ^M
[90742.533554] RIP  [<ffffffff816921fc>] _raw_spin_lock+0xc/0x30^M
[90742.533555]  RSP <ffff885fef223930>^M
[90742.533555] CR2: 0000000000000060^M
[90742.533557] ---[ end trace cf82a3bf5f6d06ee ]---^M
[90742.533559] Kernel panic - not syncing: Fatal exception^M

2) commit 0c545ac4815657e0b062344c690ea35a11eeaec8 is the newest
kernel that I can actually boot.  Anything newer (that I have tried) hangs at:

[   13.767231] ixgbe 0000:03:00.1: a0:36:9f:24:43:c2
[   13.923091] ixgbe 0000:03:00.1: Intel(R) 10 Gigabit Network Connection
[   13.932878] ixgbe 0000:03:00.1 ens1f1: renamed from eth1
[   13.949557] ixgbe 0000:03:00.0 ens1f0: renamed from eth0
[  138.402071] dracut-initqueue[1469]: Warning: dracut-initqueue timeout - starting timeout scripts

Note that in v4.3 there is a six second pause at this point.
[   14.825759] ixgbe 0000:03:00.1: a0:36:9f:24:43:c2
[   14.882227] mpt2sas0: host_add: handle(0x0001), sas_addr(0x500605b006c4e4a0), phys(8)
[   15.077844] ixgbe 0000:03:00.1: Intel(R) 10 Gigabit Network Connection
[   15.087688] ixgbe 0000:03:00.0 ens1f0: renamed from eth0
[   15.103608] ixgbe 0000:03:00.1 ens1f1: renamed from eth1
[   21.013069] mpt2sas0: port enable: SUCCESS
[   21.019024] scsi 0:1:0:0: Direct-Access     LSI      Logical Volume   3000 PQ: 0 ANSI: 6

I tried to bisect this ... but everything was "bad" (hung at this point)
and the blame was placed on 2dbe5495 "brd: Refuse improperly aligned discard requests"
which looks rather improbable (i.e. I messed up the bisection).


3) This one predates v4.3 (it might have been OK in v4.2 ... not sure). The
system hangs during reboot:

[   43.509007] Adjusting hpet more than 11% (2075134332 vs 2601267907)
[  284.274254] mgag200 0000:08:00.0: Video card doesn't support cursors with partial transparency.
[  284.283982] mgag200 0000:08:00.0: Not enabling hardware cursor.
[  284.309770] XFS (dm-2): Unmounting Filesystem
[  284.359860] ixgbe 0000:03:00.1: removed PHC on ens1f1

4) Even when I'm working from some base kernel that does boot, I'll
sometimes an innocuous change will result in:

[   14.193566] usb 2-1.4: new full-speed USB device number 4 using ehci-pci^M
[   14.224008] md: ... autorun DONE.^M
[   14.240038] List of all partitions:^M
[   14.256245] No filesystem could mount root, tried: ^M
[   14.273929] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)^M

Adding some extra code/data:

void perturb(void)
{
	printk("This is never used, but will make kernel boot\n");
}

will usually make things work again. So something is weird with
alignment or something.

-Tony

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2015-12-07 18:48 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-07 18:48 some 4.4 issues on 4S Xeon servers Luck, Tony

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.