All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
       [not found] <HE1PR0401MB25389DFE7BFEC86453C3CB21EDBE0@HE1PR0401MB2538.eurprd04.prod.outlook.com>
@ 2018-04-11  9:45 ` Daniel Sobe
  2018-04-11 10:43   ` Larry Chen
  2018-04-13  1:54   ` [Ocfs2-devel] " Changwei Ge
  0 siblings, 2 replies; 32+ messages in thread
From: Daniel Sobe @ 2018-04-11  9:45 UTC (permalink / raw)
  To: ocfs2-devel

Hi,

having used OCFS2 successfully for a while using Debian 8 with its default kernel "3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08)", I'm now facing issues trying to accomplish the same with newer kernels and Debian 9. Below are the problems that occur, they seem to be the same although the kernel is different.

One trace is from the stock kernel of Debian 9 (at that time), the other is from a very fresh kernel (4.16-rc6). In the latter case, the OOM killer was triggered "shortly" before the bug appeared - it maybe related. The call trace is appended below.

In both cases, only one machine was active. The cluster is configured for 2 machines, but the cluster is not even configured yet at the 2nd system. Only one OCFS2 file system was mounted, and the mount shared to several namespaces (using LXC). Although the mount was R/W, the users/containers just read from this file system.

Please let me know what I can do to get rid of this issue. I can provide more information about my use case if required.

I already posted to ocfs2-users, only then I saw that it is now recommended to post bugs on ocfs2-devel.

Regards,

Daniel


Mar 22 19:26:55 drs1s005 kernel: [ 7545.707568] ------------[ cut here ]------------
Mar 22 19:26:55 drs1s005 kernel: [ 7545.707600] kernel BUG at /build/linux-YDazDa/linux-4.9.82/fs/ocfs2/dlmglue.c:825!
Mar 22 19:26:55 drs1s005 kernel: [ 7545.707635] invalid opcode: 0000 [#1] SMP
Mar 22 19:26:55 drs1s005 kernel: [ 7545.707654] Modules linked in: appletalk ax25 ipx p8023 p8022 psnap veth ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree nls_ut
f8 cifs sha256_ssse3 cmac md4 des_generic arc4 dns_resolver fscache iptable_filter bridge stp llc bonding fuse intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt iTCO_vendor_suppor
t kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nls_ascii nls_cp437 vfat fat intel_cstate intel_uncore intel_rapl_perf efi_pstore efivars pcspkr mgag200 ttm sg drm_kms_helper lpc_ich mfd
_core drm i2c_algo_bit hpwdt hpilo ioatdma evdev dca shpchp wmi ipmi_si acpi_power_meter ipmi_msghandler pcc_cpufreq button drbd lru_cache libcrc32c efivarfs ip_tables x_tables autofs4 uas usb_storage
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708060]  ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache dm_mod sd_mod crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd xhci_pci uhci_hcd e
hci_pci xhci_hcd ehci_hcd i2c_i801 i2c_smbus i40e tg3 hpsa usbcore ptp scsi_transport_sas usb_common pps_core libphy scsi_mod [last unloaded: configfs]
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708231] CPU: 24 PID: 64700 Comm: perl Not tainted 4.9.0-6-amd64 #1 Debian 4.9.82-1+deb9u3
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708268] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708304] task: ffff990fda6ef100 task.stack: ffffb62f36464000
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708331] RIP: 0010:[<ffffffffc0b1180d>]  [<ffffffffc0b1180d>] __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708422] RSP: 0018:ffffb62f36467b38  EFLAGS: 00010046
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708447] RAX: 0000000000000292 RBX: ffff990fda6c5618 RCX: 0000000000000001
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708479] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff990fda6c5694
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708510] RBP: 0000000000000003 R08: 0000000000000101 R09: 0000000000000000
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708541] R10: 0000000000000038 R11: 000000000000007c R12: ffff990fda6c5694
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708572] R13: ffff991bb0f76000 R14: 0000000000000000 R15: ffffffffc0ba5080
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708604] FS:  0000000000000000(0000) GS:ffff991bbea80000(0063) knlGS:00000000f7462700
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708640] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708666] CR2: ffffffffff600000 CR3: 000000341a7b6000 CR4: 0000000000360670
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708697] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708728] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708759] Stack:
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708771]  ffffffffc0b12b45 0000000000000000 ffff99101a537300 ffff99101a51c4c8
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708812]  ffff990fda6c5e00 ffffffffc0b02274 ffff99101a537180 ffff99101a4e04c8
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708849]  0000000000000000 ffff99101a537300 dad51186f40d61bf ffff99101a4e04c8
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708886] Call Trace:
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708919]  [<ffffffffc0b12b45>] ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
Mar 22 19:26:55 drs1s005 kernel: [ 7545.708964]  [<ffffffffc0b02274>] ? ocfs2_dentry_attach_lock+0x2d4/0x430 [ocfs2]
Mar 22 19:26:55 drs1s005 kernel: [ 7545.709014]  [<ffffffffc0b2b6f1>] ? ocfs2_lookup+0x1a1/0x2e0 [ocfs2]
Mar 22 19:26:55 drs1s005 kernel: [ 7545.709046]  [<ffffffff8861eb56>] ? d_invalidate+0xb6/0x120
Mar 22 19:26:55 drs1s005 kernel: [ 7545.710690]  [<ffffffff88611a79>] ? lookup_slow+0xa9/0x170
Mar 22 19:26:55 drs1s005 kernel: [ 7545.713068]  [<ffffffff88612199>] ? walk_component+0x1f9/0x330
Mar 22 19:26:55 drs1s005 kernel: [ 7545.717146]  [<ffffffff88612d62>] ? link_path_walk+0x1b2/0x670
Mar 22 19:26:55 drs1s005 kernel: [ 7545.718276]  [<ffffffff88613326>] ? path_lookupat+0x86/0x120
Mar 22 19:26:55 drs1s005 kernel: [ 7545.720158]  [<ffffffff88615d81>] ? filename_lookup+0xb1/0x180
Mar 22 19:26:55 drs1s005 kernel: [ 7545.721266]  [<ffffffff886018da>] ? __check_object_size+0xfa/0x1d8
Mar 22 19:26:55 drs1s005 kernel: [ 7545.722799]  [<ffffffff8875c678>] ? strncpy_from_user+0x48/0x160
Mar 22 19:26:55 drs1s005 kernel: [ 7545.725064]  [<ffffffff886159ba>] ? getname_flags+0x6a/0x1e0
Mar 22 19:26:55 drs1s005 kernel: [ 7545.729544]  [<ffffffff8860a899>] ? vfs_fstatat+0x59/0xb0
Mar 22 19:26:55 drs1s005 kernel: [ 7545.732126]  [<ffffffff8846ccc5>] ? sys32_stat64+0x25/0x60
Mar 22 19:26:55 drs1s005 kernel: [ 7545.733194]  [<ffffffff884033e7>] ? syscall_trace_enter+0x117/0x2c0
Mar 22 19:26:55 drs1s005 kernel: [ 7545.734247]  [<ffffffff88403d5c>] ? do_fast_syscall_32+0xac/0x180
Mar 22 19:26:55 drs1s005 kernel: [ 7545.737170]  [<ffffffff88a12cdf>] ? entry_SYSENTER_compat+0x6f/0x7e
Mar 22 19:26:55 drs1s005 kernel: [ 7545.739819] Code: 89 c6 5b 5d 41 5c 41 5d e9 61 f7 ef c7 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66
2e 0f 1f 84 00 00 00 00 00 0f 1f
Mar 22 19:26:55 drs1s005 kernel: [ 7545.742738] RIP  [<ffffffffc0b1180d>] __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
Mar 22 19:26:55 drs1s005 kernel: [ 7545.745017]  RSP <ffffb62f36467b38>
Mar 22 19:26:55 drs1s005 kernel: [ 7545.753445] ---[ end trace 9a87ef237c626c21 ]---



Apr 10 13:01:13 drs1s006 kernel: [338194.523354] ------------[ cut here ]------------

Apr 10 13:01:13 drs1s006 kernel: [338194.523358] kernel BUG at /build/linux-sEiGsi/linux-4.16~rc6/fs/ocfs2/dlmglue.c:848!

Apr 10 13:01:13 drs1s006 kernel: [338194.533415] invalid opcode: 0000 [#1] SMP NOPTI

Apr 10 13:01:13 drs1s006 kernel: [338194.538568] Modules linked in: nfnetlink_log nfnetlink cmac arc4 md4 nls_utf8 cifs ccm dns_resolver fscache appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree btrfs zstd_decompress zstd_compress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter bridge stp llc joydev hid_generic usbhid hid fuse nls_ascii nls_cp437 vfat fat uas usb_storage kvm mgag200 ttm drm_kms_helper irqbypass drm efi_pstore crct10dif_pclmul crc32_pclmul pcspkr efivars evdev sg ghash_clmulni_intel ccp(+) hpilo i2c_algo_bit k10temp hpwdt rng_core sp5100_tco shpchp ipmi_si wmi ipmi_devintf ipmi_msghandler acpi_cpufreq button drbd lru_cache libcrc32c efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2

Apr 10 13:01:13 drs1s006 kernel: [338194.613006]  crc32c_generic fscrypto ecb dm_mod ses sd_mod enclosure crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper xhci_pci ehci_pci smartpqi xhci_hcd ehci_hcd scsi_transport_sas scsi_mod tg3 usbcore i2c_piix4 usb_common i40e libphy

Apr 10 13:01:13 drs1s006 kernel: [338194.637598] CPU: 24 PID: 53861 Comm: java Tainted: G         C       4.16.0-rc6-amd64 #1 Debian 4.16~rc6-1~exp1

Apr 10 13:01:13 drs1s006 kernel: [338194.648582] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 12/12/2017

Apr 10 13:01:13 drs1s006 kernel: [338194.659815] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9c/0xb0 [ocfs2]

Apr 10 13:01:13 drs1s006 kernel: [338194.667246] RSP: 0018:ffffb25924153ac8 EFLAGS: 00010046

Apr 10 13:01:13 drs1s006 kernel: [338194.673320] RAX: 0000000000000292 RBX: ffff9b1220245e18 RCX: 0000000000000001

Apr 10 13:01:13 drs1s006 kernel: [338194.682963] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9b1220245e94

Apr 10 13:01:13 drs1s006 kernel: [338194.692527] RBP: ffff9b1220245e94 R08: 0000000000000101 R09: 0000000000000000

Apr 10 13:01:13 drs1s006 kernel: [338194.700939] R10: ffffb25924153ab0 R11: 0000000000000073 R12: 0000000000000003

Apr 10 13:01:13 drs1s006 kernel: [338194.709376] R13: ffff9b181850b000 R14: 0000000000000000 R15: ffffffffc1157900

Apr 10 13:01:13 drs1s006 kernel: [338194.717571] FS:  00007f3f3f3f6700(0000) GS:ffff9af81fa00000(0000) knlGS:0000000000000000

Apr 10 13:01:13 drs1s006 kernel: [338194.726842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Apr 10 13:01:13 drs1s006 kernel: [338194.735192] CR2: 00007f3f9aa9e330 CR3: 00000003483c0000 CR4: 00000000003406e0

Apr 10 13:01:13 drs1s006 kernel: [338194.747998] Call Trace:

Apr 10 13:01:13 drs1s006 kernel: [338194.754363]  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]

Apr 10 13:01:13 drs1s006 kernel: [338194.763538]  ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2]

Apr 10 13:01:13 drs1s006 kernel: [338194.771745]  ocfs2_lookup+0x233/0x2c0 [ocfs2]

Apr 10 13:01:13 drs1s006 kernel: [338194.778048]  lookup_slow+0xa9/0x170

Apr 10 13:01:13 drs1s006 kernel: [338194.782898]  walk_component+0x1c4/0x470

Apr 10 13:01:13 drs1s006 kernel: [338194.788000]  ? inode_permission+0xbe/0x180

Apr 10 13:01:13 drs1s006 kernel: [338194.793238]  link_path_walk+0x2a6/0x510

Apr 10 13:01:13 drs1s006 kernel: [338194.799117]  ? path_init+0x177/0x2f0

Apr 10 13:01:13 drs1s006 kernel: [338194.804087]  path_lookupat+0x56/0x1f0

Apr 10 13:01:13 drs1s006 kernel: [338194.808830]  ? page_cache_tree_insert+0xe0/0xe0

Apr 10 13:01:13 drs1s006 kernel: [338194.814529]  filename_lookup+0xb6/0x190

Apr 10 13:01:13 drs1s006 kernel: [338194.819516]  ? filemap_map_pages+0x228/0x340

Apr 10 13:01:13 drs1s006 kernel: [338194.824993]  ? seccomp_run_filters+0x59/0xc0

Apr 10 13:01:13 drs1s006 kernel: [338194.830396]  ? __check_object_size+0xa7/0x1a0

Apr 10 13:01:13 drs1s006 kernel: [338194.836156]  ? strncpy_from_user+0x48/0x160

Apr 10 13:01:13 drs1s006 kernel: [338194.841458]  ? getname_flags+0x6a/0x1e0

Apr 10 13:01:13 drs1s006 kernel: [338194.846457]  ? vfs_statx+0x73/0xe0

Apr 10 13:01:13 drs1s006 kernel: [338194.851105]  vfs_statx+0x73/0xe0

Apr 10 13:01:13 drs1s006 kernel: [338194.855426]  SYSC_newstat+0x39/0x70

Apr 10 13:01:13 drs1s006 kernel: [338194.860227]  ? syscall_trace_enter+0x145/0x2e0

Apr 10 13:01:13 drs1s006 kernel: [338194.865909]  do_syscall_64+0x6c/0x130

Apr 10 13:01:13 drs1s006 kernel: [338194.870977]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Apr 10 13:01:13 drs1s006 kernel: [338194.877075] RIP: 0033:0x7f3f9b54a4c5

Apr 10 13:01:13 drs1s006 kernel: [338194.882556] RSP: 002b:00007f3f3f3f50c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000004

Apr 10 13:01:13 drs1s006 kernel: [338194.892712] RAX: ffffffffffffffda RBX: 00007f3f945fb9e0 RCX: 00007f3f9b54a4c5

Apr 10 13:01:13 drs1s006 kernel: [338194.901419] RDX: 00007f3f3f3f50d0 RSI: 00007f3f3f3f50d0 RDI: 00007f3f300491e0

Apr 10 13:01:13 drs1s006 kernel: [338194.909732] RBP: 00007f3f3f3f5190 R08: 00007f3f300491e0 R09: 0000000000000028

Apr 10 13:01:13 drs1s006 kernel: [338194.918291] R10: 00007f3f85052380 R11: 0000000000000246 R12: 0000000000000000

Apr 10 13:01:13 drs1s006 kernel: [338194.926582] R13: 00007f3f300491e0 R14: 00007f3f30000a30 R15: 00007f3f945fb800

Apr 10 13:01:13 drs1s006 kernel: [338194.935371] Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 be 9c e1 d4 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f

Apr 10 13:01:13 drs1s006 kernel: [338194.957500] RIP: __ocfs2_cluster_unlock.isra.36+0x9c/0xb0 [ocfs2] RSP: ffffb25924153ac8

Apr 10 13:01:13 drs1s006 kernel: [338194.967146] ---[ end trace 3743be945c8eeed8 ]---


Apr 10 12:59:47 drs1s006 kernel: [338109.416347] Call Trace:

Apr 10 12:59:47 drs1s006 kernel: [338109.416358]  dump_stack+0x5c/0x85

Apr 10 12:59:47 drs1s006 kernel: [338109.416363]  dump_header+0x6b/0x289

Apr 10 12:59:47 drs1s006 kernel: [338109.416367]  ? apparmor_capable+0xa4/0xe0

Apr 10 12:59:47 drs1s006 kernel: [338109.416369]  oom_kill_process+0x228/0x470

Apr 10 12:59:47 drs1s006 kernel: [338109.416372]  out_of_memory+0x2ab/0x4b0

Apr 10 12:59:47 drs1s006 kernel: [338109.416374]  __alloc_pages_slowpath+0x9f5/0xd80

Apr 10 12:59:47 drs1s006 kernel: [338109.416376]  __alloc_pages_nodemask+0x236/0x250

Apr 10 12:59:47 drs1s006 kernel: [338109.416378]  filemap_fault+0x206/0x650

Apr 10 12:59:47 drs1s006 kernel: [338109.416382]  ? recalc_sigpending+0x17/0x50

Apr 10 12:59:47 drs1s006 kernel: [338109.416384]  ? __set_task_blocked+0x38/0x90

Apr 10 12:59:47 drs1s006 kernel: [338109.416385]  ? __set_current_blocked+0x3d/0x60

Apr 10 12:59:47 drs1s006 kernel: [338109.416426]  ocfs2_fault+0x39/0xe0 [ocfs2]

Apr 10 12:59:47 drs1s006 kernel: [338109.416431]  __do_fault+0x1f/0xb0

Apr 10 12:59:47 drs1s006 kernel: [338109.416433]  __handle_mm_fault+0xca6/0x1220

Apr 10 12:59:47 drs1s006 kernel: [338109.416436]  handle_mm_fault+0xdc/0x210

Apr 10 12:59:47 drs1s006 kernel: [338109.416438]  __do_page_fault+0x256/0x4e0

Apr 10 12:59:47 drs1s006 kernel: [338109.416442]  ? page_fault+0x2f/0x50

Apr 10 12:59:47 drs1s006 kernel: [338109.416444]  page_fault+0x45/0x50

Apr 10 12:59:47 drs1s006 kernel: [338109.416446] RIP: 2dc42c40:0x7fc528cc4401

Apr 10 12:59:47 drs1s006 kernel: [338109.416447] RSP: 28009490:00007fc4eb3f28e0 EFLAGS: 7fc528cc4400

Apr 10 12:59:47 drs1s006 kernel: [338109.416449] Mem-Info:

Apr 10 12:59:47 drs1s006 kernel: [338109.416475] active_anon:7327303 inactive_anon:201509 isolated_anon:0

Apr 10 12:59:47 drs1s006 kernel: [338109.416475]  active_file:24608843 inactive_file:24653281 isolated_file:0

Apr 10 12:59:47 drs1s006 kernel: [338109.416475]  unevictable:0 dirty:5 writeback:0 unstable:0

Apr 10 12:59:47 drs1s006 kernel: [338109.416475]  slab_reclaimable:3415816 slab_unreclaimable:1736671

Apr 10 12:59:47 drs1s006 kernel: [338109.416475]  mapped:210421 shmem:211438 pagetables:30887 bounce:0

Apr 10 12:59:47 drs1s006 kernel: [338109.416475]  free:3783897 free_pcp:100 free_cma:0


Daniel Sobe
Engineer

BL Car Infotainment & Driver Assistance
NXP Semiconductors Germany GmbH

Am Waldschl?sschen 1, 01099 Dresden, Germany
Tel: +49  351 32023 504, Fax: +49 351 32023 760
eMail: daniel.sobe at nxp.com<mailto:daniel.sobe@nxp.com>, https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nxp.com&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=21DWxktsYzGU9fHPi2AkHKG-cGIEg0SKlI1W8jJ-KdM&e=<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nxp.com_&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=5KhvIKFT5ft-iBtSDc9Jxs0VltXYXylYJ2cnnj_9dQ0&e=>
Gesch?ftsf?hrung: Ruediger Stroh (Vors.), Kurt Sievers, Torsten Spinty, Michael Hoffmann / Aufsichtsratsvorsitzender: Gernot Fiedler / Sitz: Hamburg / Registergericht: Hamburg HRB 84 865

You can also find us on:

[cid:image002.png at 01CCCA30.7EED3EA0]<https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_-23-21_NXP&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=z9hK7j4HK_34HpeTQ6FKHD9rueEIq8cpwl7nZM-eoqs&e=>  [cid:image003.png at 01CCCA30.7EED3EA0] <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.youtube.com_user_NXPSemiconductors&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=yitF8gfDg32ikjnN5K3vk_l0iX1VVAwAzVDU_waa-gs&e=>   [cid:image004.png at 01CCCA30.7EED3EA0] <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_pages_NXP-2DSemiconductors_172907906413&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=rPV59SQpmf1sGbMk7XriP7GIzn4JUL8rKFREqZ2kpnA&e=>   [cid:image005.png at 01CCCA30.7EED3EA0] <https://urldefense.proofpoint.com/v2/url?u=https-3A__plus.google.com_112031734932782342682_posts&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=VPJTIilh4hEEcbNg9Z0u_i3M0hJ3u3kooxo1KwTjZKg&e=>   [cid:image006.png at 01CCCA30.7EED3EA0] <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_groups-3Fgid-3D671067&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=HnIffnO-GfeNo--T9bDr3XesM_lNcydll8hfcJgW5Vo&e=>   [cid:image007.png at 01CCCA30.7EED3EA0] <https://urldefense.proofpoint.com/v2/url?u=http-3A__weibo.com_nxpsemiconductors&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=stPkqaGv8HSZKZwJAkN6Z5EJc8L_rLarWz3ouN0rDLY&e=>

The information contained in this message is confidential and may be legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Unless otherwise recorded in a written agreement, all sales transactions by NXP Semiconductors are subject to our general terms and conditions of commercial sale. These are published at: www.nxp.com/profile/terms/index.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nxp.com_profile_terms_index.html&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=wUJAX3H8pEbFPrJdGwnCGdtB6q2Qir9GywAIsTDwFQw&e=>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20180411/9cc39c16/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 455 bytes
Desc: image001.png
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20180411/9cc39c16/attachment-0006.png 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 490 bytes
Desc: image002.png
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20180411/9cc39c16/attachment-0007.png 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 478 bytes
Desc: image003.png
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20180411/9cc39c16/attachment-0008.png 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 482 bytes
Desc: image004.png
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20180411/9cc39c16/attachment-0009.png 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 488 bytes
Desc: image005.png
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20180411/9cc39c16/attachment-0010.png 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image006.png
Type: image/png
Size: 500 bytes
Desc: image006.png
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20180411/9cc39c16/attachment-0011.png 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-04-11  9:45 ` [Ocfs2-devel] OCFS2 BUG with 2 different kernels Daniel Sobe
@ 2018-04-11 10:43   ` Larry Chen
  2018-04-11 11:17     ` Daniel Sobe
  2018-04-13  1:54   ` [Ocfs2-devel] " Changwei Ge
  1 sibling, 1 reply; 32+ messages in thread
From: Larry Chen @ 2018-04-11 10:43 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,
If you execute mkfs and mount that fs on only one node,
and then share the mount to several namespaces, will the
issue recur?

And could you please show us how you shared the mount to
other namespaces?

Thanks
Larry

On 04/11/2018 05:45 PM, Daniel Sobe wrote:
>
> Hi,
>
> having used OCFS2 successfully for a while using Debian 8 with its 
> default kernel ?3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 
> (2018-01-08)?, I?m now facing issues trying to accomplish the same 
> with newer kernels and Debian 9. Below are the problems that occur, 
> they seem to be the same although the kernel is different.
>
> One trace is from the stock kernel of Debian 9 (at that time), the 
> other is from a very fresh kernel (4.16-rc6). In the latter case, the 
> OOM killer was triggered ?shortly? before the bug appeared ? it maybe 
> related. The call trace is appended below.
>
> In both cases, only one machine was active. The cluster is configured 
> for 2 machines, but the cluster is not even configured yet at the 2^nd 
> system. Only one OCFS2 file system was mounted, and the mount shared 
> to several namespaces (using LXC). Although the mount was R/W, the 
> users/containers just read from this file system.
>
> Please let me know what I can do to get rid of this issue. I can 
> provide more information about my use case if required.
>
> I already posted to ocfs2-users, only then I saw that it is now 
> recommended to post bugs on ocfs2-devel.
>
> Regards,
>
> Daniel
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707568] ------------[ cut here 
> ]------------
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707600] kernel BUG at 
> /build/linux-YDazDa/linux-4.9.82/fs/ocfs2/dlmglue.c:825!
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707635] invalid opcode: 0000 
> [#1] SMP
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707654] Modules linked in: 
> appletalk ax25 ipx p8023 p8022 psnap veth ocfs2_dlmfs ocfs2_stack_o2cb 
> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree 
> nls_ut
>
> f8 cifs sha256_ssse3 cmac md4 des_generic arc4 dns_resolver fscache 
> iptable_filter bridge stp llc bonding fuse intel_rapl sb_edac 
> edac_core x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt 
> iTCO_vendor_suppor
>
> t kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel nls_ascii nls_cp437 vfat fat intel_cstate 
> intel_uncore intel_rapl_perf efi_pstore efivars pcspkr mgag200 ttm sg 
> drm_kms_helper lpc_ich mfd
>
> _core drm i2c_algo_bit hpwdt hpilo ioatdma evdev dca shpchp wmi 
> ipmi_si acpi_power_meter ipmi_msghandler pcc_cpufreq button drbd 
> lru_cache libcrc32c efivarfs ip_tables x_tables autofs4 uas usb_storage
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708060]? ext4 crc16 jbd2 
> crc32c_generic fscrypto ecb mbcache dm_mod sd_mod crc32c_intel 
> aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd 
> xhci_pci uhci_hcd e
>
> hci_pci xhci_hcd ehci_hcd i2c_i801 i2c_smbus i40e tg3 hpsa usbcore ptp 
> scsi_transport_sas usb_common pps_core libphy scsi_mod [last unloaded: 
> configfs]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708231] CPU: 24 PID: 64700 
> Comm: perl Not tainted 4.9.0-6-amd64 #1 Debian 4.9.82-1+deb9u3
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708268] Hardware name: HP 
> ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708304] task: ffff990fda6ef100 
> task.stack: ffffb62f36464000
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708331] RIP: 
> 0010:[<ffffffffc0b1180d>]? [<ffffffffc0b1180d>] 
> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708422] RSP: 
> 0018:ffffb62f36467b38? EFLAGS: 00010046
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708447] RAX: 0000000000000292 
> RBX: ffff990fda6c5618 RCX: 0000000000000001
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708479] RDX: 0000000000000000 
> RSI: 0000000000000001 RDI: ffff990fda6c5694
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708510] RBP: 0000000000000003 
> R08: 0000000000000101 R09: 0000000000000000
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708541] R10: 0000000000000038 
> R11: 000000000000007c R12: ffff990fda6c5694
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708572] R13: ffff991bb0f76000 
> R14: 0000000000000000 R15: ffffffffc0ba5080
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708604] FS: 
> 0000000000000000(0000) GS:ffff991bbea80000(0063) knlGS:00000000f7462700
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708640] CS:? 0010 DS: 002b ES: 
> 002b CR0: 0000000080050033
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708666] CR2: ffffffffff600000 
> CR3: 000000341a7b6000 CR4: 0000000000360670
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708697] DR0: 0000000000000000 
> DR1: 0000000000000000 DR2: 0000000000000000
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708728] DR3: 0000000000000000 
> DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708759] Stack:
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708771]? ffffffffc0b12b45 
> 0000000000000000 ffff99101a537300 ffff99101a51c4c8
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708812]? ffff990fda6c5e00 
> ffffffffc0b02274 ffff99101a537180 ffff99101a4e04c8
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708849]? 0000000000000000 
> ffff99101a537300 dad51186f40d61bf ffff99101a4e04c8
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708886] Call Trace:
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708919] [<ffffffffc0b12b45>] ? 
> ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708964] [<ffffffffc0b02274>] ? 
> ocfs2_dentry_attach_lock+0x2d4/0x430 [ocfs2]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.709014] [<ffffffffc0b2b6f1>] ? 
> ocfs2_lookup+0x1a1/0x2e0 [ocfs2]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.709046] [<ffffffff8861eb56>] ? 
> d_invalidate+0xb6/0x120
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.710690] [<ffffffff88611a79>] ? 
> lookup_slow+0xa9/0x170
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.713068] [<ffffffff88612199>] ? 
> walk_component+0x1f9/0x330
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.717146] [<ffffffff88612d62>] ? 
> link_path_walk+0x1b2/0x670
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.718276] [<ffffffff88613326>] ? 
> path_lookupat+0x86/0x120
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.720158] [<ffffffff88615d81>] ? 
> filename_lookup+0xb1/0x180
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.721266] [<ffffffff886018da>] ? 
> __check_object_size+0xfa/0x1d8
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.722799] [<ffffffff8875c678>] ? 
> strncpy_from_user+0x48/0x160
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.725064] [<ffffffff886159ba>] ? 
> getname_flags+0x6a/0x1e0
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.729544] [<ffffffff8860a899>] ? 
> vfs_fstatat+0x59/0xb0
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.732126] [<ffffffff8846ccc5>] ? 
> sys32_stat64+0x25/0x60
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.733194] [<ffffffff884033e7>] ? 
> syscall_trace_enter+0x117/0x2c0
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.734247] [<ffffffff88403d5c>] ? 
> do_fast_syscall_32+0xac/0x180
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.737170] [<ffffffff88a12cdf>] ? 
> entry_SYSENTER_compat+0x6f/0x7e
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.739819] Code: 89 c6 5b 5d 41 
> 5c 41 5d e9 61 f7 ef c7 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 
> eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 
> 00 66
>
> 2e 0f 1f 84 00 00 00 00 00 0f 1f
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.742738] RIP 
> [<ffffffffc0b1180d>] __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.745017]? RSP <ffffb62f36467b38>
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.753445] ---[ end trace 
> 9a87ef237c626c21 ]---
>
> Apr 10 13:01:13 drs1s006 kernel: [338194.523354] ------------[ cut 
> here ]------------
> Apr 10 13:01:13 drs1s006 kernel: [338194.523358] kernel BUG at 
> /build/linux-sEiGsi/linux-4.16~rc6/fs/ocfs2/dlmglue.c:848!
> Apr 10 13:01:13 drs1s006 kernel: [338194.533415] invalid opcode: 0000 
> [#1] SMP NOPTI
> Apr 10 13:01:13 drs1s006 kernel: [338194.538568] Modules linked in: 
> nfnetlink_log nfnetlink cmac arc4 md4 nls_utf8 cifs ccm dns_resolver 
> fscache appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree 
> btrfs zstd_decompress zstd_compress xxhash xor raid6_pq ufs qnx4 
> hfsplus hfs minix ntfs msdos jfs xfs ocfs2_dlmfs ocfs2_stack_o2cb 
> ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter 
> bridge stp llc joydev hid_generic usbhid hid fuse nls_ascii nls_cp437 
> vfat fat uas usb_storage kvm mgag200 ttm drm_kms_helper irqbypass drm 
> efi_pstore crct10dif_pclmul crc32_pclmul pcspkr efivars evdev sg 
> ghash_clmulni_intel ccp(+) hpilo i2c_algo_bit k10temp hpwdt rng_core 
> sp5100_tco shpchp ipmi_si wmi ipmi_devintf ipmi_msghandler 
> acpi_cpufreq button drbd lru_cache libcrc32c efivarfs ip_tables 
> x_tables autofs4 ext4 crc16 mbcache jbd2
> Apr 10 13:01:13 drs1s006 kernel: [338194.613006]? crc32c_generic 
> fscrypto ecb dm_mod ses sd_mod enclosure crc32c_intel aesni_intel 
> aes_x86_64 crypto_simd cryptd glue_helper xhci_pci ehci_pci smartpqi 
> xhci_hcd ehci_hcd scsi_transport_sas scsi_mod tg3 usbcore i2c_piix4 
> usb_common i40e libphy
> Apr 10 13:01:13 drs1s006 kernel: [338194.637598] CPU: 24 PID: 53861 
> Comm: java Tainted: G???????? C?????? 4.16.0-rc6-amd64 #1 Debian 
> 4.16~rc6-1~exp1
> Apr 10 13:01:13 drs1s006 kernel: [338194.648582] Hardware name: HPE 
> ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 12/12/2017
> Apr 10 13:01:13 drs1s006 kernel: [338194.659815] RIP: 
> 0010:__ocfs2_cluster_unlock.isra.36+0x9c/0xb0 [ocfs2]
> Apr 10 13:01:13 drs1s006 kernel: [338194.667246] RSP: 
> 0018:ffffb25924153ac8 EFLAGS: 00010046
> Apr 10 13:01:13 drs1s006 kernel: [338194.673320] RAX: 0000000000000292 
> RBX: ffff9b1220245e18 RCX: 0000000000000001
> Apr 10 13:01:13 drs1s006 kernel: [338194.682963] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9b1220245e94
> Apr 10 13:01:13 drs1s006 kernel: [338194.692527] RBP: ffff9b1220245e94 R08: 0000000000000101 R09: 0000000000000000
> Apr 10 13:01:13 drs1s006 kernel: [338194.700939] R10: ffffb25924153ab0 R11: 0000000000000073 R12: 0000000000000003
> Apr 10 13:01:13 drs1s006 kernel: [338194.709376] R13: ffff9b181850b000 
> R14: 0000000000000000 R15: ffffffffc1157900
> Apr 10 13:01:13 drs1s006 kernel: [338194.717571] FS:? 
> 00007f3f3f3f6700(0000) GS:ffff9af81fa00000(0000) knlGS:0000000000000000
> Apr 10 13:01:13 drs1s006 kernel: [338194.726842] CS:? 0010 DS: 0000 
> ES: 0000 CR0: 0000000080050033
> Apr 10 13:01:13 drs1s006 kernel: [338194.735192] CR2: 00007f3f9aa9e330 
> CR3: 00000003483c0000 CR4: 00000000003406e0
> Apr 10 13:01:13 drs1s006 kernel: [338194.747998] Call Trace:
> Apr 10 13:01:13 drs1s006 kernel: [338194.754363]? ? 
> ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Apr 10 13:01:13 drs1s006 kernel: [338194.763538]? 
> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2]
> Apr 10 13:01:13 drs1s006 kernel: [338194.771745]? 
> ocfs2_lookup+0x233/0x2c0 [ocfs2]
> Apr 10 13:01:13 drs1s006 kernel: [338194.778048]? lookup_slow+0xa9/0x170
> Apr 10 13:01:13 drs1s006 kernel: [338194.782898]? 
> walk_component+0x1c4/0x470
> Apr 10 13:01:13 drs1s006 kernel: [338194.788000]? ? 
> inode_permission+0xbe/0x180
> Apr 10 13:01:13 drs1s006 kernel: [338194.793238]? 
> link_path_walk+0x2a6/0x510
> Apr 10 13:01:13 drs1s006 kernel: [338194.799117]? ? path_init+0x177/0x2f0
> Apr 10 13:01:13 drs1s006 kernel: [338194.804087]? path_lookupat+0x56/0x1f0
> Apr 10 13:01:13 drs1s006 kernel: [338194.808830]? ? 
> page_cache_tree_insert+0xe0/0xe0
> Apr 10 13:01:13 drs1s006 kernel: [338194.814529]? 
> filename_lookup+0xb6/0x190
> Apr 10 13:01:13 drs1s006 kernel: [338194.819516]? ? 
> filemap_map_pages+0x228/0x340
> Apr 10 13:01:13 drs1s006 kernel: [338194.824993]? ? 
> seccomp_run_filters+0x59/0xc0
> Apr 10 13:01:13 drs1s006 kernel: [338194.830396]? ? 
> __check_object_size+0xa7/0x1a0
> Apr 10 13:01:13 drs1s006 kernel: [338194.836156]? ? 
> strncpy_from_user+0x48/0x160
> Apr 10 13:01:13 drs1s006 kernel: [338194.841458]? ? getname_flags+0x6a/0x1e0
> Apr 10 13:01:13 drs1s006 kernel: [338194.846457]? ? vfs_statx+0x73/0xe0
> Apr 10 13:01:13 drs1s006 kernel: [338194.851105]? vfs_statx+0x73/0xe0
> Apr 10 13:01:13 drs1s006 kernel: [338194.855426]? SYSC_newstat+0x39/0x70
> Apr 10 13:01:13 drs1s006 kernel: [338194.860227]? ? 
> syscall_trace_enter+0x145/0x2e0
> Apr 10 13:01:13 drs1s006 kernel: [338194.865909]? do_syscall_64+0x6c/0x130
> Apr 10 13:01:13 drs1s006 kernel: [338194.870977]? 
> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> Apr 10 13:01:13 drs1s006 kernel: [338194.877075] RIP: 0033:0x7f3f9b54a4c5
> Apr 10 13:01:13 drs1s006 kernel: [338194.882556] RSP: 
> 002b:00007f3f3f3f50c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
> Apr 10 13:01:13 drs1s006 kernel: [338194.892712] RAX: ffffffffffffffda 
> RBX: 00007f3f945fb9e0 RCX: 00007f3f9b54a4c5
> Apr 10 13:01:13 drs1s006 kernel: [338194.901419] RDX: 00007f3f3f3f50d0 
> RSI: 00007f3f3f3f50d0 RDI: 00007f3f300491e0
> Apr 10 13:01:13 drs1s006 kernel: [338194.909732] RBP: 00007f3f3f3f5190 
> R08: 00007f3f300491e0 R09: 0000000000000028
> Apr 10 13:01:13 drs1s006 kernel: [338194.918291] R10: 00007f3f85052380 
> R11: 0000000000000246 R12: 0000000000000000
> Apr 10 13:01:13 drs1s006 kernel: [338194.926582] R13: 00007f3f300491e0 
> R14: 00007f3f30000a30 R15: 00007f3f945fb800
> Apr 10 13:01:13 drs1s006 kernel: [338194.935371] Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 be 9c e1 d4 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f
> Apr 10 13:01:13 drs1s006 kernel: [338194.957500] RIP: 
> __ocfs2_cluster_unlock.isra.36+0x9c/0xb0 [ocfs2] RSP: ffffb25924153ac8
> Apr 10 13:01:13 drs1s006 kernel: [338194.967146] ---[ end trace 
> 3743be945c8eeed8 ]---
>
> Apr 10 12:59:47 drs1s006 kernel: [338109.416347] Call Trace:
> Apr 10 12:59:47 drs1s006 kernel: [338109.416358]? dump_stack+0x5c/0x85
> Apr 10 12:59:47 drs1s006 kernel: [338109.416363]? dump_header+0x6b/0x289
> Apr 10 12:59:47 drs1s006 kernel: [338109.416367]? ? 
> apparmor_capable+0xa4/0xe0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416369]? 
> oom_kill_process+0x228/0x470
> Apr 10 12:59:47 drs1s006 kernel: [338109.416372]? 
> out_of_memory+0x2ab/0x4b0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416374]? 
> __alloc_pages_slowpath+0x9f5/0xd80
> Apr 10 12:59:47 drs1s006 kernel: [338109.416376]? 
> __alloc_pages_nodemask+0x236/0x250
> Apr 10 12:59:47 drs1s006 kernel: [338109.416378]? 
> filemap_fault+0x206/0x650
> Apr 10 12:59:47 drs1s006 kernel: [338109.416382]? ? 
> recalc_sigpending+0x17/0x50
> Apr 10 12:59:47 drs1s006 kernel: [338109.416384]? ? 
> __set_task_blocked+0x38/0x90
> Apr 10 12:59:47 drs1s006 kernel: [338109.416385]? ? 
> __set_current_blocked+0x3d/0x60
> Apr 10 12:59:47 drs1s006 kernel: [338109.416426]? 
> ocfs2_fault+0x39/0xe0 [ocfs2]
> Apr 10 12:59:47 drs1s006 kernel: [338109.416431]? __do_fault+0x1f/0xb0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416433]? 
> __handle_mm_fault+0xca6/0x1220
> Apr 10 12:59:47 drs1s006 kernel: [338109.416436]? 
> handle_mm_fault+0xdc/0x210
> Apr 10 12:59:47 drs1s006 kernel: [338109.416438]? 
> __do_page_fault+0x256/0x4e0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416442]? ? page_fault+0x2f/0x50
> Apr 10 12:59:47 drs1s006 kernel: [338109.416444]? page_fault+0x45/0x50
> Apr 10 12:59:47 drs1s006 kernel: [338109.416446] RIP: 
> 2dc42c40:0x7fc528cc4401
> Apr 10 12:59:47 drs1s006 kernel: [338109.416447] RSP: 28009490:00007fc4eb3f28e0 EFLAGS: 7fc528cc4400
> Apr 10 12:59:47 drs1s006 kernel: [338109.416449] Mem-Info:
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475] active_anon:7327303 
> inactive_anon:201509 isolated_anon:0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? active_file:24608843 
> inactive_file:24653281 isolated_file:0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? unevictable:0 
> dirty:5 writeback:0 unstable:0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? 
> slab_reclaimable:3415816 slab_unreclaimable:1736671
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? mapped:210421 
> shmem:211438 pagetables:30887 bounce:0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? free:3783897 
> free_pcp:100 free_cma:0
>
> Daniel Sobe
>
> Engineer
>
> BL Car Infotainment & Driver Assistance
>
> NXP Semiconductors Germany GmbH
>
> Am Waldschl?sschen 1, 01099 Dresden, Germany
>
> Tel: +49? 351 32023 504, Fax: +49 351 32023 760
>
> eMail: daniel.sobe at nxp.com <mailto:daniel.sobe@nxp.com>, 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nxp.com&d=DwID-g&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=L68db-3hNCWXl26oxvvE7vhivv2Gs_N-P4cBV7Xi_8w&s=iVywZhnhHiOja7uILOA5ESjMt6zUOtc28DU7ntRN3zU&e= 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nxp.com_&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=5KhvIKFT5ft-iBtSDc9Jxs0VltXYXylYJ2cnnj_9dQ0&e=>
>
> Gesch?ftsf?hrung: Ruediger Stroh (Vors.), Kurt Sievers, Torsten 
> Spinty, Michael Hoffmann / Aufsichtsratsvorsitzender: Gernot Fiedler / 
> Sitz: Hamburg / Registergericht: Hamburg HRB 84 865
>
>
> You can also find us on:
>
> cid:image002.png at 01CCCA30.7EED3EA0 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_-23-21_NXP&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=z9hK7j4HK_34HpeTQ6FKHD9rueEIq8cpwl7nZM-eoqs&e=>cid:image003.png at 01CCCA30.7EED3EA0 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.youtube.com_user_NXPSemiconductors&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=yitF8gfDg32ikjnN5K3vk_l0iX1VVAwAzVDU_waa-gs&e=>cid:image004.png at 01CCCA30.7EED3EA0 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_pages_NXP-2DSemiconductors_172907906413&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=rPV59SQpmf1sGbMk7XriP7GIzn4JUL8rKFREqZ2kpnA&e=>cid:image005.png at 01CCCA30.7EED3EA0 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__plus.google.com_112031734932782342682_posts&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=VPJTIilh4hEEcbNg9Z0u_i3M0hJ3u3kooxo1KwTjZKg&e=>cid:image006.png at 01CCCA30.7EED3EA0 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_groups-3Fgid-3D671067&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=HnIffnO-GfeNo--T9bDr3XesM_lNcydll8hfcJgW5Vo&e=>cid:image007.png at 01CCCA30.7EED3EA0 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__weibo.com_nxpsemiconductors&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=stPkqaGv8HSZKZwJAkN6Z5EJc8L_rLarWz3ouN0rDLY&e=>_
> _
> The information contained in this message is confidential and may be 
> legally privileged. The message is intended solely for the 
> addressee(s). If you are not the intended recipient, you are hereby 
> notified that any use, dissemination, or reproduction is strictly 
> prohibited and may be unlawful. If you are not the intended recipient, 
> please contact the sender by return e-mail and destroy all copies of 
> the original message.
>
> Unless otherwise recorded in a written agreement, all sales 
> transactions by NXP Semiconductors are subject to our general terms 
> and conditions of commercial sale. These are published at: 
> www.nxp.com/profile/terms/index.html 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nxp.com_profile_terms_index.html&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=wUJAX3H8pEbFPrJdGwnCGdtB6q2Qir9GywAIsTDwFQw&e=>__
>
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-04-11 10:43   ` Larry Chen
@ 2018-04-11 11:17     ` Daniel Sobe
  2018-04-11 11:31       ` Larry Chen
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2018-04-11 11:17 UTC (permalink / raw)
  To: ocfs2-devel

Hi Larry,

this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.

I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).

Regards,

Daniel

-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com] 
Sent: Mittwoch, 11. April 2018 12:43
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,
If you execute mkfs and mount that fs on only one node, and then share the mount to several namespaces, will the issue recur?

And could you please show us how you shared the mount to other namespaces?

Thanks
Larry

On 04/11/2018 05:45 PM, Daniel Sobe wrote:
>
> Hi,
>
> having used OCFS2 successfully for a while using Debian 8 with its 
> default kernel "3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 
> (2018-01-08)", I'm now facing issues trying to accomplish the same 
> with newer kernels and Debian 9. Below are the problems that occur, 
> they seem to be the same although the kernel is different.
>
> One trace is from the stock kernel of Debian 9 (at that time), the 
> other is from a very fresh kernel (4.16-rc6). In the latter case, the 
> OOM killer was triggered "shortly" before the bug appeared - it maybe 
> related. The call trace is appended below.
>
> In both cases, only one machine was active. The cluster is configured 
> for 2 machines, but the cluster is not even configured yet at the 2^nd 
> system. Only one OCFS2 file system was mounted, and the mount shared 
> to several namespaces (using LXC). Although the mount was R/W, the 
> users/containers just read from this file system.
>
> Please let me know what I can do to get rid of this issue. I can 
> provide more information about my use case if required.
>
> I already posted to ocfs2-users, only then I saw that it is now 
> recommended to post bugs on ocfs2-devel.
>
> Regards,
>
> Daniel
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707568] ------------[ cut here
> ]------------
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707600] kernel BUG at 
> /build/linux-YDazDa/linux-4.9.82/fs/ocfs2/dlmglue.c:825!
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707635] invalid opcode: 0000 
> [#1] SMP
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707654] Modules linked in: 
> appletalk ax25 ipx p8023 p8022 psnap veth ocfs2_dlmfs ocfs2_stack_o2cb 
> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree 
> nls_ut
>
> f8 cifs sha256_ssse3 cmac md4 des_generic arc4 dns_resolver fscache 
> iptable_filter bridge stp llc bonding fuse intel_rapl sb_edac 
> edac_core x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt 
> iTCO_vendor_suppor
>
> t kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel nls_ascii nls_cp437 vfat fat intel_cstate 
> intel_uncore intel_rapl_perf efi_pstore efivars pcspkr mgag200 ttm sg 
> drm_kms_helper lpc_ich mfd
>
> _core drm i2c_algo_bit hpwdt hpilo ioatdma evdev dca shpchp wmi 
> ipmi_si acpi_power_meter ipmi_msghandler pcc_cpufreq button drbd 
> lru_cache libcrc32c efivarfs ip_tables x_tables autofs4 uas 
> usb_storage
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708060]? ext4 crc16 jbd2 
> crc32c_generic fscrypto ecb mbcache dm_mod sd_mod crc32c_intel 
> aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd 
> xhci_pci uhci_hcd e
>
> hci_pci xhci_hcd ehci_hcd i2c_i801 i2c_smbus i40e tg3 hpsa usbcore ptp 
> scsi_transport_sas usb_common pps_core libphy scsi_mod [last unloaded:
> configfs]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708231] CPU: 24 PID: 64700
> Comm: perl Not tainted 4.9.0-6-amd64 #1 Debian 4.9.82-1+deb9u3
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708268] Hardware name: HP 
> ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708304] task: ffff990fda6ef100
> task.stack: ffffb62f36464000
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708331] RIP: 
> 0010:[<ffffffffc0b1180d>]? [<ffffffffc0b1180d>]
> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708422] RSP: 
> 0018:ffffb62f36467b38? EFLAGS: 00010046
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708447] RAX: 0000000000000292
> RBX: ffff990fda6c5618 RCX: 0000000000000001
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708479] RDX: 0000000000000000
> RSI: 0000000000000001 RDI: ffff990fda6c5694
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708510] RBP: 0000000000000003
> R08: 0000000000000101 R09: 0000000000000000
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708541] R10: 0000000000000038
> R11: 000000000000007c R12: ffff990fda6c5694
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708572] R13: ffff991bb0f76000
> R14: 0000000000000000 R15: ffffffffc0ba5080
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708604] FS: 
> 0000000000000000(0000) GS:ffff991bbea80000(0063) 
> knlGS:00000000f7462700
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708640] CS:? 0010 DS: 002b ES: 
> 002b CR0: 0000000080050033
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708666] CR2: ffffffffff600000
> CR3: 000000341a7b6000 CR4: 0000000000360670
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708697] DR0: 0000000000000000
> DR1: 0000000000000000 DR2: 0000000000000000
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708728] DR3: 0000000000000000
> DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708759] Stack:
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708771]? ffffffffc0b12b45
> 0000000000000000 ffff99101a537300 ffff99101a51c4c8
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708812]? ffff990fda6c5e00
> ffffffffc0b02274 ffff99101a537180 ffff99101a4e04c8
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708849]? 0000000000000000
> ffff99101a537300 dad51186f40d61bf ffff99101a4e04c8
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708886] Call Trace:
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708919] [<ffffffffc0b12b45>] ? 
> ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708964] [<ffffffffc0b02274>] ? 
> ocfs2_dentry_attach_lock+0x2d4/0x430 [ocfs2]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.709014] [<ffffffffc0b2b6f1>] ? 
> ocfs2_lookup+0x1a1/0x2e0 [ocfs2]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.709046] [<ffffffff8861eb56>] ? 
> d_invalidate+0xb6/0x120
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.710690] [<ffffffff88611a79>] ? 
> lookup_slow+0xa9/0x170
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.713068] [<ffffffff88612199>] ? 
> walk_component+0x1f9/0x330
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.717146] [<ffffffff88612d62>] ? 
> link_path_walk+0x1b2/0x670
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.718276] [<ffffffff88613326>] ? 
> path_lookupat+0x86/0x120
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.720158] [<ffffffff88615d81>] ? 
> filename_lookup+0xb1/0x180
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.721266] [<ffffffff886018da>] ? 
> __check_object_size+0xfa/0x1d8
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.722799] [<ffffffff8875c678>] ? 
> strncpy_from_user+0x48/0x160
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.725064] [<ffffffff886159ba>] ? 
> getname_flags+0x6a/0x1e0
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.729544] [<ffffffff8860a899>] ? 
> vfs_fstatat+0x59/0xb0
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.732126] [<ffffffff8846ccc5>] ? 
> sys32_stat64+0x25/0x60
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.733194] [<ffffffff884033e7>] ? 
> syscall_trace_enter+0x117/0x2c0
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.734247] [<ffffffff88403d5c>] ? 
> do_fast_syscall_32+0xac/0x180
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.737170] [<ffffffff88a12cdf>] ? 
> entry_SYSENTER_compat+0x6f/0x7e
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.739819] Code: 89 c6 5b 5d 41 
> 5c 41 5d e9 61 f7 ef c7 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 
> eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00
> 00 66
>
> 2e 0f 1f 84 00 00 00 00 00 0f 1f
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.742738] RIP 
> [<ffffffffc0b1180d>] __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.745017]? RSP 
> <ffffb62f36467b38>
>
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.753445] ---[ end trace
> 9a87ef237c626c21 ]---
>
> Apr 10 13:01:13 drs1s006 kernel: [338194.523354] ------------[ cut 
> here ]------------ Apr 10 13:01:13 drs1s006 kernel: [338194.523358] 
> kernel BUG at 
> /build/linux-sEiGsi/linux-4.16~rc6/fs/ocfs2/dlmglue.c:848!
> Apr 10 13:01:13 drs1s006 kernel: [338194.533415] invalid opcode: 0000 
> [#1] SMP NOPTI Apr 10 13:01:13 drs1s006 kernel: [338194.538568] 
> Modules linked in:
> nfnetlink_log nfnetlink cmac arc4 md4 nls_utf8 cifs ccm dns_resolver 
> fscache appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree 
> btrfs zstd_decompress zstd_compress xxhash xor raid6_pq ufs qnx4 
> hfsplus hfs minix ntfs msdos jfs xfs ocfs2_dlmfs ocfs2_stack_o2cb 
> ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter 
> bridge stp llc joydev hid_generic usbhid hid fuse nls_ascii nls_cp437 
> vfat fat uas usb_storage kvm mgag200 ttm drm_kms_helper irqbypass drm 
> efi_pstore crct10dif_pclmul crc32_pclmul pcspkr efivars evdev sg 
> ghash_clmulni_intel ccp(+) hpilo i2c_algo_bit k10temp hpwdt rng_core 
> sp5100_tco shpchp ipmi_si wmi ipmi_devintf ipmi_msghandler 
> acpi_cpufreq button drbd lru_cache libcrc32c efivarfs ip_tables 
> x_tables autofs4 ext4 crc16 mbcache jbd2 Apr 10 13:01:13 drs1s006 
> kernel: [338194.613006]? crc32c_generic fscrypto ecb dm_mod ses sd_mod 
> enclosure crc32c_intel aesni_intel
> aes_x86_64 crypto_simd cryptd glue_helper xhci_pci ehci_pci smartpqi 
> xhci_hcd ehci_hcd scsi_transport_sas scsi_mod tg3 usbcore i2c_piix4 
> usb_common i40e libphy Apr 10 13:01:13 drs1s006 kernel: 
> [338194.637598] CPU: 24 PID: 53861
> Comm: java Tainted: G???????? C?????? 4.16.0-rc6-amd64 #1 Debian
> 4.16~rc6-1~exp1
> Apr 10 13:01:13 drs1s006 kernel: [338194.648582] Hardware name: HPE 
> ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 12/12/2017 Apr 10 
> 13:01:13 drs1s006 kernel: [338194.659815] RIP:
> 0010:__ocfs2_cluster_unlock.isra.36+0x9c/0xb0 [ocfs2] Apr 10 13:01:13 
> drs1s006 kernel: [338194.667246] RSP:
> 0018:ffffb25924153ac8 EFLAGS: 00010046 Apr 10 13:01:13 drs1s006 
> kernel: [338194.673320] RAX: 0000000000000292
> RBX: ffff9b1220245e18 RCX: 0000000000000001 Apr 10 13:01:13 drs1s006 
> kernel: [338194.682963] RDX: 0000000000000000 RSI: 0000000000000001 
> RDI: ffff9b1220245e94 Apr 10 13:01:13 drs1s006 kernel: [338194.692527] 
> RBP: ffff9b1220245e94 R08: 0000000000000101 R09: 0000000000000000 Apr 
> 10 13:01:13 drs1s006 kernel: [338194.700939] R10: ffffb25924153ab0 
> R11: 0000000000000073 R12: 0000000000000003 Apr 10 13:01:13 drs1s006 
> kernel: [338194.709376] R13: ffff9b181850b000
> R14: 0000000000000000 R15: ffffffffc1157900 Apr 10 13:01:13 drs1s006 
> kernel: [338194.717571] FS:
> 00007f3f3f3f6700(0000) GS:ffff9af81fa00000(0000) 
> knlGS:0000000000000000 Apr 10 13:01:13 drs1s006 kernel: 
> [338194.726842] CS:? 0010 DS: 0000
> ES: 0000 CR0: 0000000080050033
> Apr 10 13:01:13 drs1s006 kernel: [338194.735192] CR2: 00007f3f9aa9e330
> CR3: 00000003483c0000 CR4: 00000000003406e0 Apr 10 13:01:13 drs1s006 
> kernel: [338194.747998] Call Trace:
> Apr 10 13:01:13 drs1s006 kernel: [338194.754363]? ? 
> ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Apr 10 13:01:13 drs1s006 kernel: [338194.763538]
> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Apr 10 13:01:13 drs1s006 
> kernel: [338194.771745]
> ocfs2_lookup+0x233/0x2c0 [ocfs2]
> Apr 10 13:01:13 drs1s006 kernel: [338194.778048]? 
> lookup_slow+0xa9/0x170 Apr 10 13:01:13 drs1s006 kernel: 
> [338194.782898]
> walk_component+0x1c4/0x470
> Apr 10 13:01:13 drs1s006 kernel: [338194.788000]? ? 
> inode_permission+0xbe/0x180
> Apr 10 13:01:13 drs1s006 kernel: [338194.793238]
> link_path_walk+0x2a6/0x510
> Apr 10 13:01:13 drs1s006 kernel: [338194.799117]? ? 
> path_init+0x177/0x2f0 Apr 10 13:01:13 drs1s006 kernel: [338194.804087]? 
> path_lookupat+0x56/0x1f0 Apr 10 13:01:13 drs1s006 kernel: [338194.808830]? ?
> page_cache_tree_insert+0xe0/0xe0
> Apr 10 13:01:13 drs1s006 kernel: [338194.814529]
> filename_lookup+0xb6/0x190
> Apr 10 13:01:13 drs1s006 kernel: [338194.819516]? ? 
> filemap_map_pages+0x228/0x340
> Apr 10 13:01:13 drs1s006 kernel: [338194.824993]? ? 
> seccomp_run_filters+0x59/0xc0
> Apr 10 13:01:13 drs1s006 kernel: [338194.830396]? ? 
> __check_object_size+0xa7/0x1a0
> Apr 10 13:01:13 drs1s006 kernel: [338194.836156]? ? 
> strncpy_from_user+0x48/0x160
> Apr 10 13:01:13 drs1s006 kernel: [338194.841458]? ? 
> getname_flags+0x6a/0x1e0 Apr 10 13:01:13 drs1s006 kernel: 
> [338194.846457]? ? vfs_statx+0x73/0xe0 Apr 10 13:01:13 drs1s006 
> kernel: [338194.851105]? vfs_statx+0x73/0xe0 Apr 10 13:01:13 drs1s006 
> kernel: [338194.855426]? SYSC_newstat+0x39/0x70 Apr 10 13:01:13 drs1s006 kernel: [338194.860227]? ?
> syscall_trace_enter+0x145/0x2e0
> Apr 10 13:01:13 drs1s006 kernel: [338194.865909]? 
> do_syscall_64+0x6c/0x130 Apr 10 13:01:13 drs1s006 kernel: 
> [338194.870977]
> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> Apr 10 13:01:13 drs1s006 kernel: [338194.877075] RIP: 
> 0033:0x7f3f9b54a4c5 Apr 10 13:01:13 drs1s006 kernel: [338194.882556] RSP:
> 002b:00007f3f3f3f50c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000004 Apr 
> 10 13:01:13 drs1s006 kernel: [338194.892712] RAX: ffffffffffffffda
> RBX: 00007f3f945fb9e0 RCX: 00007f3f9b54a4c5 Apr 10 13:01:13 drs1s006 
> kernel: [338194.901419] RDX: 00007f3f3f3f50d0
> RSI: 00007f3f3f3f50d0 RDI: 00007f3f300491e0 Apr 10 13:01:13 drs1s006 
> kernel: [338194.909732] RBP: 00007f3f3f3f5190
> R08: 00007f3f300491e0 R09: 0000000000000028 Apr 10 13:01:13 drs1s006 
> kernel: [338194.918291] R10: 00007f3f85052380
> R11: 0000000000000246 R12: 0000000000000000 Apr 10 13:01:13 drs1s006 
> kernel: [338194.926582] R13: 00007f3f300491e0
> R14: 00007f3f30000a30 R15: 00007f3f945fb800 Apr 10 13:01:13 drs1s006 
> kernel: [338194.935371] Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 
> be 9c e1 d4 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f Apr 10 13:01:13 drs1s006 kernel: [338194.957500] RIP:
> __ocfs2_cluster_unlock.isra.36+0x9c/0xb0 [ocfs2] RSP: ffffb25924153ac8 
> Apr 10 13:01:13 drs1s006 kernel: [338194.967146] ---[ end trace
> 3743be945c8eeed8 ]---
>
> Apr 10 12:59:47 drs1s006 kernel: [338109.416347] Call Trace:
> Apr 10 12:59:47 drs1s006 kernel: [338109.416358]? dump_stack+0x5c/0x85 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416363]? 
> dump_header+0x6b/0x289 Apr 10 12:59:47 drs1s006 kernel: [338109.416367]? ?
> apparmor_capable+0xa4/0xe0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416369]
> oom_kill_process+0x228/0x470
> Apr 10 12:59:47 drs1s006 kernel: [338109.416372]
> out_of_memory+0x2ab/0x4b0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416374]
> __alloc_pages_slowpath+0x9f5/0xd80
> Apr 10 12:59:47 drs1s006 kernel: [338109.416376]
> __alloc_pages_nodemask+0x236/0x250
> Apr 10 12:59:47 drs1s006 kernel: [338109.416378]
> filemap_fault+0x206/0x650
> Apr 10 12:59:47 drs1s006 kernel: [338109.416382]? ? 
> recalc_sigpending+0x17/0x50
> Apr 10 12:59:47 drs1s006 kernel: [338109.416384]? ? 
> __set_task_blocked+0x38/0x90
> Apr 10 12:59:47 drs1s006 kernel: [338109.416385]? ? 
> __set_current_blocked+0x3d/0x60
> Apr 10 12:59:47 drs1s006 kernel: [338109.416426]
> ocfs2_fault+0x39/0xe0 [ocfs2]
> Apr 10 12:59:47 drs1s006 kernel: [338109.416431]? __do_fault+0x1f/0xb0 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416433]
> __handle_mm_fault+0xca6/0x1220
> Apr 10 12:59:47 drs1s006 kernel: [338109.416436]
> handle_mm_fault+0xdc/0x210
> Apr 10 12:59:47 drs1s006 kernel: [338109.416438]
> __do_page_fault+0x256/0x4e0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416442]? ? 
> page_fault+0x2f/0x50 Apr 10 12:59:47 drs1s006 kernel: [338109.416444]? 
> page_fault+0x45/0x50 Apr 10 12:59:47 drs1s006 kernel: [338109.416446] RIP:
> 2dc42c40:0x7fc528cc4401
> Apr 10 12:59:47 drs1s006 kernel: [338109.416447] RSP: 
> 28009490:00007fc4eb3f28e0 EFLAGS: 7fc528cc4400 Apr 10 12:59:47 drs1s006 kernel: [338109.416449] Mem-Info:
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475] active_anon:7327303
> inactive_anon:201509 isolated_anon:0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? active_file:24608843
> inactive_file:24653281 isolated_file:0 Apr 10 12:59:47 drs1s006 
> kernel: [338109.416475]? unevictable:0
> dirty:5 writeback:0 unstable:0
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]
> slab_reclaimable:3415816 slab_unreclaimable:1736671 Apr 10 12:59:47 
> drs1s006 kernel: [338109.416475]? mapped:210421
> shmem:211438 pagetables:30887 bounce:0 Apr 10 12:59:47 drs1s006 
> kernel: [338109.416475]? free:3783897
> free_pcp:100 free_cma:0
>
> Daniel Sobe
>
> Engineer
>
> BL Car Infotainment & Driver Assistance
>
> NXP Semiconductors Germany GmbH
>
> Am Waldschl?sschen 1, 01099 Dresden, Germany
>
> Tel: +49? 351 32023 504, Fax: +49 351 32023 760
>
> eMail: daniel.sobe at nxp.com <mailto:daniel.sobe@nxp.com>,
> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttp-253A-252F-252Fwww&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=hqnIft1NVWOzrmQiH8kA-VggibU68D89M2ON6EbsWlg&e=.
> nxp.com&data=02%7C01%7Cdaniel.sobe%40nxp.com%7C7cf319e6eda04cff836708d
> 59f990f4c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636590402107038
> 121&sdata=61LCE2f89huN9z4cM1DX8nl2jzL2a5IE7KMY1evnfnc%3D&reserved=0
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=3KkLhA6plILtb2r8aGfjNxA8s-TH8EdR7PFiJ2mFIok&e=
> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__www.nxp.com_%26d%3DD
> wMFAw%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uD
> xlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DQ30Z8J9hmQC7gxk63OxdHtyQv73U
> XHlOrqxod0srOwA%26s%3D5KhvIKFT5ft-iBtSDc9Jxs0VltXYXylYJ2cnnj_9dQ0%26e&
> data=02%7C01%7Cdaniel.sobe%40nxp.com%7C7cf319e6eda04cff836708d59f990f4
> c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636590402107038121&sdat
> a=%2Bm7SGl1ssnPBvmwyEt1PcC7jRAGLe6q35I4%2Bv4pHn5M%3D&reserved=0=>
>
> Gesch?ftsf?hrung: Ruediger Stroh (Vors.), Kurt Sievers, Torsten 
> Spinty, Michael Hoffmann / Aufsichtsratsvorsitzender: Gernot Fiedler /
> Sitz: Hamburg / Registergericht: Hamburg HRB 84 865
>
>
> You can also find us on:
>
> cid:image002.png at 01CCCA30.7EED3EA0
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=3KkLhA6plILtb2r8aGfjNxA8s-TH8EdR7PFiJ2mFIok&e=
> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__twitter.com_-23-21_N
> XP%26d%3DDwMFAw%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%
> 3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DQ30Z8J9hmQC7gxk63O
> xdHtyQv73UXHlOrqxod0srOwA%26s%3Dz9hK7j4HK_34HpeTQ6FKHD9rueEIq8cpwl7nZM
> -eoqs%26e&data=02%7C01%7Cdaniel.sobe%40nxp.com%7C7cf319e6eda04cff83670
> 8d59f990f4c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6365904021070
> 38121&sdata=jv2FuzE%2FtJLHSC%2FmuBaVwK0T5yO3ZNwF2hT%2FSYB%2FvJg%3D&res
> erved=0=>cid:image003.png at 01CCCA30.7EED3EA0
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=3KkLhA6plILtb2r8aGfjNxA8s-TH8EdR7PFiJ2mFIok&e=
> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__www.youtube.com_user
> _NXPSemiconductors%26d%3DDwMFAw%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMU
> B65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DQ3
> 0Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA%26s%3DyitF8gfDg32ikjnN5K3vk_
> l0iX1VVAwAzVDU_waa-gs%26e&data=02%7C01%7Cdaniel.sobe%40nxp.com%7C7cf31
> 9e6eda04cff836708d59f990f4c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0
> %7C636590402107194368&sdata=UilbMNGcrIqKxJ4q%2Fpe85rxvgv3wtG2CEFqQcPoH
> 390%3D&reserved=0=>cid:image004.png at 01CCCA30.7EED3EA0
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=3KkLhA6plILtb2r8aGfjNxA8s-TH8EdR7PFiJ2mFIok&e=
> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__www.facebook.com_pag
> es_NXP-2DSemiconductors_172907906413%26d%3DDwMFAw%26c%3DRoP1YumCXCgaWH
> vlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-H
> D0qT6Fo7Y%26m%3DQ30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA%26s%3DrPV5
> 9SQpmf1sGbMk7XriP7GIzn4JUL8rKFREqZ2kpnA%26e&data=02%7C01%7Cdaniel.sobe
> %40nxp.com%7C7cf319e6eda04cff836708d59f990f4c%7C686ea1d3bc2b4c6fa92cd9
> 9c5c301635%7C0%7C0%7C636590402107194368&sdata=0EhzDqfXkWW3O%2BD0%2BZYr
> FxqKL5OhAgU5J71WcHfFFQ4%3D&reserved=0=>cid:image005.png at 01CCCA30.7EED3
> EA0
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=3KkLhA6plILtb2r8aGfjNxA8s-TH8EdR7PFiJ2mFIok&e=
> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__plus.google.com_112
> 031734932782342682_posts%26d%3DDwMFAw%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv
> 7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26
> m%3DQ30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA%26s%3DVPJTIilh4hEEcbNg
> 9Z0u_i3M0hJ3u3kooxo1KwTjZKg%26e&data=02%7C01%7Cdaniel.sobe%40nxp.com%7
> C7cf319e6eda04cff836708d59f990f4c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7
> C0%7C0%7C636590402107194368&sdata=lLbhgbT1nzcR4YwloNUdT9%2FkTOvyRwAPbM
> wUFGBE1kc%3D&reserved=0=>cid:image006.png at 01CCCA30.7EED3EA0
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=3KkLhA6plILtb2r8aGfjNxA8s-TH8EdR7PFiJ2mFIok&e=
> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__www.linkedin.com_gro
> ups-3Fgid-3D671067%26d%3DDwMFAw%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMU
> B65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DQ3
> 0Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA%26s%3DHnIffnO-GfeNo--T9bDr3X
> esM_lNcydll8hfcJgW5Vo%26e&data=02%7C01%7Cdaniel.sobe%40nxp.com%7C7cf31
> 9e6eda04cff836708d59f990f4c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0
> %7C636590402107194368&sdata=yW8m4rw38vg698CUBckJSqja2somvQxuMLDeiQ3L4H
> 0%3D&reserved=0=>cid:image007.png at 01CCCA30.7EED3EA0
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=3KkLhA6plILtb2r8aGfjNxA8s-TH8EdR7PFiJ2mFIok&e=
> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__weibo.com_nxpsemicon
> ductors%26d%3DDwMFAw%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE
> %26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DQ30Z8J9hmQC7g
> xk63OxdHtyQv73UXHlOrqxod0srOwA%26s%3DstPkqaGv8HSZKZwJAkN6Z5EJc8L_rLarW
> z3ouN0rDLY%26e&data=02%7C01%7Cdaniel.sobe%40nxp.com%7C7cf319e6eda04cff
> 836708d59f990f4c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63659040
> 2107194368&sdata=KK8RelHNsHZRjzTMqiVs4mjZ2R1fYtTzWi7yQhoF1O4%3D&reserv
> ed=0=>_
> _
> The information contained in this message is confidential and may be 
> legally privileged. The message is intended solely for the 
> addressee(s). If you are not the intended recipient, you are hereby 
> notified that any use, dissemination, or reproduction is strictly 
> prohibited and may be unlawful. If you are not the intended recipient, 
> please contact the sender by return e-mail and destroy all copies of 
> the original message.
>
> Unless otherwise recorded in a written agreement, all sales 
> transactions by NXP Semiconductors are subject to our general terms 
> and conditions of commercial sale. These are published at:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dwww.nxp.com-252Fpro&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=sIwpUlFomPTGm82vMXP3sTRrw3L17ScyRIsAu9YVpsI&e=
> file%2Fterms%2Findex.html&data=02%7C01%7Cdaniel.sobe%40nxp.com%7C7cf31
> 9e6eda04cff836708d59f990f4c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0
> %7C636590402107194368&sdata=r9f%2Bodi5ZxCtCYzqlE%2F26U%2FucR8i4RlCl%2B
> kGX8APD0Y%3D&reserved=0 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=3KkLhA6plILtb2r8aGfjNxA8s-TH8EdR7PFiJ2mFIok&e=
> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__www.nxp.com_profile_
> terms_index.html%26d%3DDwMFAw%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB6
> 5eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DQ30Z
> 8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA%26s%3DwUJAX3H8pEbFPrJdGwnCGdtB
> 6q2Qir9GywAIsTDwFQw%26e&data=02%7C01%7Cdaniel.sobe%40nxp.com%7C7cf319e
> 6eda04cff836708d59f990f4c%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7
> C636590402107194368&sdata=gK%2BhVbvB6Q3sVglSEOUBXq3XEqNkewLOGDUU3%2BB%
> 2F3U0%3D&reserved=0=>__
>
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=cKAlLA8HjQPeauiHhI0SBUbTaFrM7MhEckm_EkKfVWA&s=SMTpSINi0gGZFO2H-D4TiRR7YMD_221Sh2WDotom_Y8&e=
> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&data=02%7C01%7Cdaniel.s
> obe%40nxp.com%7C7cf319e6eda04cff836708d59f990f4c%7C686ea1d3bc2b4c6fa92
> cd99c5c301635%7C0%7C0%7C636590402107194368&sdata=8jx8BNWXabhtnnQn6jWI%
> 2FwMiH6KOV4BUExVnyC24CNc%3D&reserved=0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-04-11 11:17     ` Daniel Sobe
@ 2018-04-11 11:31       ` Larry Chen
  2018-04-11 12:24         ` Daniel Sobe
  0 siblings, 1 reply; 32+ messages in thread
From: Larry Chen @ 2018-04-11 11:31 UTC (permalink / raw)
  To: ocfs2-devel



On 04/11/2018 07:17 PM, Daniel Sobe wrote:
> Hi Larry,
>
> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>
> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
I thought of there is a way to share a directory between host and docker 
container, like
 ?? docker run -v /host/directory:/container/directory -other -options 
image_name command_to_run
That's different from yours.

How did you setup your lxc or container?

If you could, show me the procedure, I'll try to reproduce it.

And by the way, if you get rid of lxc, and just mount ocfs2 on several 
different mount point of local host, will the problem recur?

Regards,
Larry
> Regards,
>
> Daniel
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-04-11 11:31       ` Larry Chen
@ 2018-04-11 12:24         ` Daniel Sobe
  2018-04-12  3:17           ` Larry Chen
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2018-04-11 12:24 UTC (permalink / raw)
  To: ocfs2-devel

Hi Larry,

below is an example config file like I use it for LXC containers. I followed the instructions (https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.debian.org_LXC&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=BmWCVeE72QTY9ubXpj4I5tnxoA7khmxQhKu6cPriu-Y&s=XWKvduHietaYbL3xzVzkxDF9-WncOOXJneQ7413qJP0&e=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.

Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens. 

Regards,

Daniel

----

# Distribution configuration
lxc.include = /usr/share/lxc/config/debian.common.conf
lxc.include = /usr/share/lxc/config/debian.userns.conf
lxc.arch = x86_64

# Container specific configuration
lxc.id_map = u 0 624288 65536
lxc.id_map = g 0 624288 65536

lxc.utsname = container1
lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs

lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = bridge1
lxc.network.name = eth0
lxc.network.veth.pair = aabbccddeeff
lxc.network.ipv4 = XX.XX.XX.XX/YY
lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ

lxc.cgroup.cpuset.cpus = 63-86

lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0

lxc.cgroup.memory.limit_in_bytes       = 240G
lxc.cgroup.memory.memsw.limit_in_bytes = 240G

lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf

----




-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com] 
Sent: Mittwoch, 11. April 2018 13:31
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels



On 04/11/2018 07:17 PM, Daniel Sobe wrote:
> Hi Larry,
>
> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>
> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
I thought of there is a way to share a directory between host and docker container, like
 ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.

How did you setup your lxc or container?

If you could, show me the procedure, I'll try to reproduce it.

And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?

Regards,
Larry
> Regards,
>
> Daniel
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-04-11 12:24         ` Daniel Sobe
@ 2018-04-12  3:17           ` Larry Chen
  2018-04-12  7:45             ` Daniel Sobe
  0 siblings, 1 reply; 32+ messages in thread
From: Larry Chen @ 2018-04-12  3:17 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,

Thanks for your report.
I'll try to reproduce this bug as you did.

I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.

Thanks
Larry


On 04/11/2018 08:24 PM, Daniel Sobe wrote:
> Hi Larry,
>
> below is an example config file like I use it for LXC containers. I followed the instructions (https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.debian.org_LXC&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=g0D3je5kgCEJiDPFKQ1Yw-c8S8eNY8KJhFC8PNVcGZM&s=k1_NjIjuXW6KE2FAAuAd77CTAy09r-nVBvnfMYcsAEw&e=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>
> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>
> Regards,
>
> Daniel
>
> ----
>
> # Distribution configuration
> lxc.include = /usr/share/lxc/config/debian.common.conf
> lxc.include = /usr/share/lxc/config/debian.userns.conf
> lxc.arch = x86_64
>
> # Container specific configuration
> lxc.id_map = u 0 624288 65536
> lxc.id_map = g 0 624288 65536
>
> lxc.utsname = container1
> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>
> lxc.network.type = veth
> lxc.network.flags = up
> lxc.network.link = bridge1
> lxc.network.name = eth0
> lxc.network.veth.pair = aabbccddeeff
> lxc.network.ipv4 = XX.XX.XX.XX/YY
> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>
> lxc.cgroup.cpuset.cpus = 63-86
>
> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>
> lxc.cgroup.memory.limit_in_bytes       = 240G
> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>
> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>
> ----
>
>
>
>
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Mittwoch, 11. April 2018 13:31
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
>
>
> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>
>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
> I thought of there is a way to share a directory between host and docker container, like
>   ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>
> How did you setup your lxc or container?
>
> If you could, show me the procedure, I'll try to reproduce it.
>
> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>
> Regards,
> Larry
>> Regards,
>>
>> Daniel
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-04-12  3:17           ` Larry Chen
@ 2018-04-12  7:45             ` Daniel Sobe
       [not found]               ` <8c66f7cd-2de3-ad9e-8ec5-dc6ab934f16a@suse.com>
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2018-04-12  7:45 UTC (permalink / raw)
  To: ocfs2-devel

Hi Larry,

not sure if it helps, the issue wasn't there with Debian 8 and kernel 3.16 - but that's a long history. Unfortunately, the only machine where I could try to bisect, does not run any kernel < 4.16 without other issues ?

Regards,

Daniel


-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com] 
Sent: Donnerstag, 12. April 2018 05:17
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

Thanks for your report.
I'll try to reproduce this bug as you did.

I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.

Thanks
Larry


On 04/11/2018 08:24 PM, Daniel Sobe wrote:
> Hi Larry,
>
> below is an example config file like I use it for LXC containers. I followed the instructions (https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=-31a8tQ2mImxoLrIuvf9-uFENMmmyGFq4K2E3HQHRn8&s=jKxhrrNPswhDaCa-acK9qU2LqZh1Mv9aEWHK9vfnLTM&e=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>
> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>
> Regards,
>
> Daniel
>
> ----
>
> # Distribution configuration
> lxc.include = /usr/share/lxc/config/debian.common.conf
> lxc.include = /usr/share/lxc/config/debian.userns.conf
> lxc.arch = x86_64
>
> # Container specific configuration
> lxc.id_map = u 0 624288 65536
> lxc.id_map = g 0 624288 65536
>
> lxc.utsname = container1
> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>
> lxc.network.type = veth
> lxc.network.flags = up
> lxc.network.link = bridge1
> lxc.network.name = eth0
> lxc.network.veth.pair = aabbccddeeff
> lxc.network.ipv4 = XX.XX.XX.XX/YY
> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>
> lxc.cgroup.cpuset.cpus = 63-86
>
> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>
> lxc.cgroup.memory.limit_in_bytes       = 240G
> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>
> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>
> ----
>
>
>
>
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Mittwoch, 11. April 2018 13:31
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
>
>
> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>
>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
> I thought of there is a way to share a directory between host and docker container, like
>   ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>
> How did you setup your lxc or container?
>
> If you could, show me the procedure, I'll try to reproduce it.
>
> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>
> Regards,
> Larry
>> Regards,
>>
>> Daniel
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-04-11  9:45 ` [Ocfs2-devel] OCFS2 BUG with 2 different kernels Daniel Sobe
  2018-04-11 10:43   ` Larry Chen
@ 2018-04-13  1:54   ` Changwei Ge
  2018-04-13  7:10     ` Daniel Sobe
  1 sibling, 1 reply; 32+ messages in thread
From: Changwei Ge @ 2018-04-13  1:54 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,

It's not easy to analyze your problem unless you can provide *ocfs2_lock_res* *dlm_lock_resource* *dlm_ctxt* through _crash tool_

Thanks,
Changwei

On 2018/4/11 17:46, Daniel Sobe wrote:
> Hi,
> 
> having used OCFS2 successfully for a while using Debian 8 with its default kernel ?3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08)?, I?m now facing issues trying to accomplish the same with newer kernels and Debian 9. Below are the problems that occur, they seem to be the same although the kernel is different.
> 
> One trace is from the stock kernel of Debian 9 (at that time), the other is from a very fresh kernel (4.16-rc6). In the latter case, the OOM killer was triggered ?shortly? before the bug appeared ? it maybe related. The call trace is appended below.
> 
> In both cases, only one machine was active. The cluster is configured for 2 machines, but the cluster is not even configured yet at the 2^nd system. Only one OCFS2 file system was mounted, and the mount shared to several namespaces (using LXC). Although the mount was R/W, the users/containers just read from this file system.
> 
> Please let me know what I can do to get rid of this issue. I can provide more information about my use case if required.
> 
> I already posted to ocfs2-users, only then I saw that it is now recommended to post bugs on ocfs2-devel.
> 
> Regards,
> 
> Daniel
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707568] ------------[ cut here ]------------
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707600] kernel BUG at /build/linux-YDazDa/linux-4.9.82/fs/ocfs2/dlmglue.c:825!
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707635] invalid opcode: 0000 [#1] SMP
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707654] Modules linked in: appletalk ax25 ipx p8023 p8022 psnap veth ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree nls_ut
> 
> f8 cifs sha256_ssse3 cmac md4 des_generic arc4 dns_resolver fscache iptable_filter bridge stp llc bonding fuse intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt iTCO_vendor_suppor
> 
> t kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nls_ascii nls_cp437 vfat fat intel_cstate intel_uncore intel_rapl_perf efi_pstore efivars pcspkr mgag200 ttm sg drm_kms_helper lpc_ich mfd
> 
> _core drm i2c_algo_bit hpwdt hpilo ioatdma evdev dca shpchp wmi ipmi_si acpi_power_meter ipmi_msghandler pcc_cpufreq button drbd lru_cache libcrc32c efivarfs ip_tables x_tables autofs4 uas usb_storage
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708060]? ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache dm_mod sd_mod crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd xhci_pci uhci_hcd e
> 
> hci_pci xhci_hcd ehci_hcd i2c_i801 i2c_smbus i40e tg3 hpsa usbcore ptp scsi_transport_sas usb_common pps_core libphy scsi_mod [last unloaded: configfs]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708231] CPU: 24 PID: 64700 Comm: perl Not tainted 4.9.0-6-amd64 #1 Debian 4.9.82-1+deb9u3
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708268] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708304] task: ffff990fda6ef100 task.stack: ffffb62f36464000
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708331] RIP: 0010:[<ffffffffc0b1180d>]? [<ffffffffc0b1180d>] __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708422] RSP: 0018:ffffb62f36467b38? EFLAGS: 00010046
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708447] RAX: 0000000000000292 RBX: ffff990fda6c5618 RCX: 0000000000000001
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708479] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff990fda6c5694
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708510] RBP: 0000000000000003 R08: 0000000000000101 R09: 0000000000000000
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708541] R10: 0000000000000038 R11: 000000000000007c R12: ffff990fda6c5694
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708572] R13: ffff991bb0f76000 R14: 0000000000000000 R15: ffffffffc0ba5080
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708604] FS:? 0000000000000000(0000) GS:ffff991bbea80000(0063) knlGS:00000000f7462700
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708640] CS:? 0010 DS: 002b ES: 002b CR0: 0000000080050033
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708666] CR2: ffffffffff600000 CR3: 000000341a7b6000 CR4: 0000000000360670
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708697] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708728] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708759] Stack:
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708771]? ffffffffc0b12b45 0000000000000000 ffff99101a537300 ffff99101a51c4c8
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708812]? ffff990fda6c5e00 ffffffffc0b02274 ffff99101a537180 ffff99101a4e04c8
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708849]? 0000000000000000 ffff99101a537300 dad51186f40d61bf ffff99101a4e04c8
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708886] Call Trace:
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708919]? [<ffffffffc0b12b45>] ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708964]? [<ffffffffc0b02274>] ? ocfs2_dentry_attach_lock+0x2d4/0x430 [ocfs2]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.709014]? [<ffffffffc0b2b6f1>] ? ocfs2_lookup+0x1a1/0x2e0 [ocfs2]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.709046]? [<ffffffff8861eb56>] ? d_invalidate+0xb6/0x120
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.710690]? [<ffffffff88611a79>] ? lookup_slow+0xa9/0x170
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.713068]? [<ffffffff88612199>] ? walk_component+0x1f9/0x330
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.717146]? [<ffffffff88612d62>] ? link_path_walk+0x1b2/0x670
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.718276]? [<ffffffff88613326>] ? path_lookupat+0x86/0x120
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.720158]? [<ffffffff88615d81>] ? filename_lookup+0xb1/0x180
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.721266]? [<ffffffff886018da>] ? __check_object_size+0xfa/0x1d8
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.722799]? [<ffffffff8875c678>] ? strncpy_from_user+0x48/0x160
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.725064]? [<ffffffff886159ba>] ? getname_flags+0x6a/0x1e0
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.729544]? [<ffffffff8860a899>] ? vfs_fstatat+0x59/0xb0
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.732126]? [<ffffffff8846ccc5>] ? sys32_stat64+0x25/0x60
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.733194]? [<ffffffff884033e7>] ? syscall_trace_enter+0x117/0x2c0
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.734247]? [<ffffffff88403d5c>] ? do_fast_syscall_32+0xac/0x180
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.737170]? [<ffffffff88a12cdf>] ? entry_SYSENTER_compat+0x6f/0x7e
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.739819] Code: 89 c6 5b 5d 41 5c 41 5d e9 61 f7 ef c7 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66
> 
> 2e 0f 1f 84 00 00 00 00 00 0f 1f
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.742738] RIP? [<ffffffffc0b1180d>] __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.745017]? RSP <ffffb62f36467b38>
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.753445] ---[ end trace 9a87ef237c626c21 ]---
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.523354] ------------[ cut here ]------------
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.523358] kernel BUG at /build/linux-sEiGsi/linux-4.16~rc6/fs/ocfs2/dlmglue.c:848!
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.533415] invalid opcode: 0000 [#1] SMP NOPTI
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.538568] Modules linked in: nfnetlink_log nfnetlink cmac arc4 md4 nls_utf8 cifs ccm dns_resolver fscache appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree btrfs zstd_decompress zstd_compress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter bridge stp llc joydev hid_generic usbhid hid fuse nls_ascii nls_cp437 vfat fat uas usb_storage kvm mgag200 ttm drm_kms_helper irqbypass drm efi_pstore crct10dif_pclmul crc32_pclmul pcspkr efivars evdev sg ghash_clmulni_intel ccp(+) hpilo i2c_algo_bit k10temp hpwdt rng_core sp5100_tco shpchp ipmi_si wmi ipmi_devintf ipmi_msghandler acpi_cpufreq button drbd lru_cache libcrc32c efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.613006]? crc32c_generic fscrypto ecb dm_mod ses sd_mod enclosure crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper xhci_pci ehci_pci smartpqi xhci_hcd ehci_hcd scsi_transport_sas scsi_mod tg3 usbcore i2c_piix4 usb_common i40e libphy
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.637598] CPU: 24 PID: 53861 Comm: java Tainted: G???????? C?????? 4.16.0-rc6-amd64 #1 Debian 4.16~rc6-1~exp1
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.648582] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 12/12/2017
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.659815] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9c/0xb0 [ocfs2]
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.667246] RSP: 0018:ffffb25924153ac8 EFLAGS: 00010046
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.673320] RAX: 0000000000000292 RBX: ffff9b1220245e18 RCX: 0000000000000001
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.682963] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9b1220245e94
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.692527] RBP: ffff9b1220245e94 R08: 0000000000000101 R09: 0000000000000000
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.700939] R10: ffffb25924153ab0 R11: 0000000000000073 R12: 0000000000000003
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.709376] R13: ffff9b181850b000 R14: 0000000000000000 R15: ffffffffc1157900
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.717571] FS:? 00007f3f3f3f6700(0000) GS:ffff9af81fa00000(0000) knlGS:0000000000000000
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.726842] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.735192] CR2: 00007f3f9aa9e330 CR3: 00000003483c0000 CR4: 00000000003406e0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.747998] Call Trace:
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.754363]? ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.763538]? ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2]
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.771745]? ocfs2_lookup+0x233/0x2c0 [ocfs2]
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.778048]? lookup_slow+0xa9/0x170
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.782898]? walk_component+0x1c4/0x470
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.788000]? ? inode_permission+0xbe/0x180
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.793238]? link_path_walk+0x2a6/0x510
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.799117]? ? path_init+0x177/0x2f0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.804087]? path_lookupat+0x56/0x1f0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.808830]? ? page_cache_tree_insert+0xe0/0xe0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.814529]? filename_lookup+0xb6/0x190
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.819516]? ? filemap_map_pages+0x228/0x340
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.824993]? ? seccomp_run_filters+0x59/0xc0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.830396]? ? __check_object_size+0xa7/0x1a0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.836156]? ? strncpy_from_user+0x48/0x160
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.841458]? ? getname_flags+0x6a/0x1e0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.846457]? ? vfs_statx+0x73/0xe0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.851105]? vfs_statx+0x73/0xe0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.855426]? SYSC_newstat+0x39/0x70
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.860227]? ? syscall_trace_enter+0x145/0x2e0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.865909]? do_syscall_64+0x6c/0x130
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.870977]? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.877075] RIP: 0033:0x7f3f9b54a4c5
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.882556] RSP: 002b:00007f3f3f3f50c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.892712] RAX: ffffffffffffffda RBX: 00007f3f945fb9e0 RCX: 00007f3f9b54a4c5
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.901419] RDX: 00007f3f3f3f50d0 RSI: 00007f3f3f3f50d0 RDI: 00007f3f300491e0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.909732] RBP: 00007f3f3f3f5190 R08: 00007f3f300491e0 R09: 0000000000000028
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.918291] R10: 00007f3f85052380 R11: 0000000000000246 R12: 0000000000000000
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.926582] R13: 00007f3f300491e0 R14: 00007f3f30000a30 R15: 00007f3f945fb800
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.935371] Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 be 9c e1 d4 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.957500] RIP: __ocfs2_cluster_unlock.isra.36+0x9c/0xb0 [ocfs2] RSP: ffffb25924153ac8
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.967146] ---[ end trace 3743be945c8eeed8 ]---
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416347] Call Trace:
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416358]? dump_stack+0x5c/0x85
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416363]? dump_header+0x6b/0x289
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416367]? ? apparmor_capable+0xa4/0xe0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416369]? oom_kill_process+0x228/0x470
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416372]? out_of_memory+0x2ab/0x4b0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416374]? __alloc_pages_slowpath+0x9f5/0xd80
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416376]? __alloc_pages_nodemask+0x236/0x250
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416378]? filemap_fault+0x206/0x650
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416382]? ? recalc_sigpending+0x17/0x50
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416384]? ? __set_task_blocked+0x38/0x90
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416385]? ? __set_current_blocked+0x3d/0x60
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416426]? ocfs2_fault+0x39/0xe0 [ocfs2]
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416431]? __do_fault+0x1f/0xb0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416433]? __handle_mm_fault+0xca6/0x1220
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416436]? handle_mm_fault+0xdc/0x210
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416438]? __do_page_fault+0x256/0x4e0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416442]? ? page_fault+0x2f/0x50
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416444]? page_fault+0x45/0x50
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416446] RIP: 2dc42c40:0x7fc528cc4401
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416447] RSP: 28009490:00007fc4eb3f28e0 EFLAGS: 7fc528cc4400
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416449] Mem-Info:
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475] active_anon:7327303 inactive_anon:201509 isolated_anon:0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? active_file:24608843 inactive_file:24653281 isolated_file:0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? unevictable:0 dirty:5 writeback:0 unstable:0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? slab_reclaimable:3415816 slab_unreclaimable:1736671
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? mapped:210421 shmem:211438 pagetables:30887 bounce:0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? free:3783897 free_pcp:100 free_cma:0
> 
> Daniel Sobe
> 
> Engineer
> 
> BL Car Infotainment & Driver Assistance
> 
> NXP Semiconductors Germany GmbH
> 
> Am Waldschl?sschen 1, 01099 Dresden, Germany
> 
> Tel: +49? 351 32023 504, Fax: +49 351 32023 760
> 
> eMail: daniel.sobe at nxp.com <mailto:daniel.sobe@nxp.com>, https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nxp.com&d=DwIF-g&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Ue6rpILmk5r_ecpl4At6rO78WW99l3w-b360rWP8ILs&s=ctruH60eHa2y_W63BexsD6U-kjBt3IJyeq6p3ZuFwow&e= <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nxp.com_&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=5KhvIKFT5ft-iBtSDc9Jxs0VltXYXylYJ2cnnj_9dQ0&e=>
> 
> Gesch?ftsf?hrung: Ruediger Stroh (Vors.), Kurt Sievers, Torsten Spinty, Michael Hoffmann / Aufsichtsratsvorsitzender: Gernot Fiedler / Sitz: Hamburg / Registergericht: Hamburg HRB 84 865
> 
> 
> You can also find us on:
> 
> cid:image002.png at 01CCCA30.7EED3EA0 <https://urldefense.proofpoint.com/v2/url?u=http-3A__twitter.com_-23-21_NXP&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=z9hK7j4HK_34HpeTQ6FKHD9rueEIq8cpwl7nZM-eoqs&e=>cid:image003.png at 01CCCA30.7EED3EA0 <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.youtube.com_user_NXPSemiconductors&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=yitF8gfDg32ikjnN5K3vk_l0iX1VVAwAzVDU_waa-gs&e=>cid:image004.png at 01CCCA30.7EED3EA0 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.facebook.com_pages_NXP-2DSemiconductors_172907906413&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=rPV59SQpmf1sGbMk7XriP7GIzn4JUL8rKFREqZ2kpnA&e=>cid:image005.png at 01CCCA30.7EED3EA0 <https://urldefense.proofpoint.com/v2/url?u=https-3A__plus.google.com_112031734932782342682_posts&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=VPJTIilh4hEEcbNg9Z0u_i3M0hJ3u3kooxo1KwTjZKg&e=>cid:image006.png at 01CCCA30.7EED3EA0 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_groups-3Fgid-3D671067&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=HnIffnO-GfeNo--T9bDr3XesM_lNcydll8hfcJgW5Vo&e=>cid:image007.png at 01CCCA30.7EED3EA0 <https://urldefense.proofpoint.com/v2/url?u=http-3A__weibo.com_nxpsemiconductors&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=stPkqaGv8HSZKZwJAkN6Z5EJc8L_rLarWz3ouN0rDLY&e=>_
> _
> The information contained in this message is confidential and may be legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
> 
> Unless otherwise recorded in a written agreement, all sales transactions by NXP Semiconductors are subject to our general terms and conditions of commercial sale. These are published at: www.nxp.com/profile/terms/index.html <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nxp.com_profile_terms_index.html&d=DwMFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=Q30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA&s=wUJAX3H8pEbFPrJdGwnCGdtB6q2Qir9GywAIsTDwFQw&e=>__
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-04-13  1:54   ` [Ocfs2-devel] " Changwei Ge
@ 2018-04-13  7:10     ` Daniel Sobe
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Sobe @ 2018-04-13  7:10 UTC (permalink / raw)
  To: ocfs2-devel

Hi Changwei,

I will try to reproduce the crash once again, after I properly understood how to create the reports you are asking for.

I installed the "crash" tool for Debian Stretch now (7.1.7-1). I have looked at the man page, and tried to start the tool, but I have to admit that I did not succeed.

Can you give me a pointer as to how the tool is used, and what type of information you are asking for (are these global symbols, and do they point to a primitive data type or are they structs, are they single symbols or are they lists/arrays/...)?

Regards,

Daniel


-----Original Message-----
From: Changwei Ge [mailto:ge.changwei at h3c.com] 
Sent: Freitag, 13. April 2018 03:54
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

It's not easy to analyze your problem unless you can provide *ocfs2_lock_res* *dlm_lock_resource* *dlm_ctxt* through _crash tool_

Thanks,
Changwei

On 2018/4/11 17:46, Daniel Sobe wrote:
> Hi,
> 
> having used OCFS2 successfully for a while using Debian 8 with its default kernel "3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08)", I'm now facing issues trying to accomplish the same with newer kernels and Debian 9. Below are the problems that occur, they seem to be the same although the kernel is different.
> 
> One trace is from the stock kernel of Debian 9 (at that time), the other is from a very fresh kernel (4.16-rc6). In the latter case, the OOM killer was triggered "shortly" before the bug appeared - it maybe related. The call trace is appended below.
> 
> In both cases, only one machine was active. The cluster is configured for 2 machines, but the cluster is not even configured yet at the 2^nd system. Only one OCFS2 file system was mounted, and the mount shared to several namespaces (using LXC). Although the mount was R/W, the users/containers just read from this file system.
> 
> Please let me know what I can do to get rid of this issue. I can provide more information about my use case if required.
> 
> I already posted to ocfs2-users, only then I saw that it is now recommended to post bugs on ocfs2-devel.
> 
> Regards,
> 
> Daniel
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707568] ------------[ cut here ]------------
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707600] kernel BUG at /build/linux-YDazDa/linux-4.9.82/fs/ocfs2/dlmglue.c:825!
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707635] invalid opcode: 0000 [#1] SMP
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.707654] Modules linked in: appletalk ax25 ipx p8023 p8022 psnap veth ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree nls_ut
> 
> f8 cifs sha256_ssse3 cmac md4 des_generic arc4 dns_resolver fscache iptable_filter bridge stp llc bonding fuse intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt iTCO_vendor_suppor
> 
> t kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nls_ascii nls_cp437 vfat fat intel_cstate intel_uncore intel_rapl_perf efi_pstore efivars pcspkr mgag200 ttm sg drm_kms_helper lpc_ich mfd
> 
> _core drm i2c_algo_bit hpwdt hpilo ioatdma evdev dca shpchp wmi ipmi_si acpi_power_meter ipmi_msghandler pcc_cpufreq button drbd lru_cache libcrc32c efivarfs ip_tables x_tables autofs4 uas usb_storage
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708060]? ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache dm_mod sd_mod crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd xhci_pci uhci_hcd e
> 
> hci_pci xhci_hcd ehci_hcd i2c_i801 i2c_smbus i40e tg3 hpsa usbcore ptp scsi_transport_sas usb_common pps_core libphy scsi_mod [last unloaded: configfs]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708231] CPU: 24 PID: 64700 Comm: perl Not tainted 4.9.0-6-amd64 #1 Debian 4.9.82-1+deb9u3
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708268] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708304] task: ffff990fda6ef100 task.stack: ffffb62f36464000
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708331] RIP: 0010:[<ffffffffc0b1180d>]? [<ffffffffc0b1180d>] __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708422] RSP: 0018:ffffb62f36467b38? EFLAGS: 00010046
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708447] RAX: 0000000000000292 RBX: ffff990fda6c5618 RCX: 0000000000000001
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708479] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff990fda6c5694
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708510] RBP: 0000000000000003 R08: 0000000000000101 R09: 0000000000000000
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708541] R10: 0000000000000038 R11: 000000000000007c R12: ffff990fda6c5694
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708572] R13: ffff991bb0f76000 R14: 0000000000000000 R15: ffffffffc0ba5080
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708604] FS:? 0000000000000000(0000) GS:ffff991bbea80000(0063) knlGS:00000000f7462700
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708640] CS:? 0010 DS: 002b ES: 002b CR0: 0000000080050033
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708666] CR2: ffffffffff600000 CR3: 000000341a7b6000 CR4: 0000000000360670
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708697] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708728] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708759] Stack:
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708771]? ffffffffc0b12b45 0000000000000000 ffff99101a537300 ffff99101a51c4c8
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708812]? ffff990fda6c5e00 ffffffffc0b02274 ffff99101a537180 ffff99101a4e04c8
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708849]? 0000000000000000 ffff99101a537300 dad51186f40d61bf ffff99101a4e04c8
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708886] Call Trace:
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708919]? [<ffffffffc0b12b45>] ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.708964]? [<ffffffffc0b02274>] ? ocfs2_dentry_attach_lock+0x2d4/0x430 [ocfs2]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.709014]? [<ffffffffc0b2b6f1>] ? ocfs2_lookup+0x1a1/0x2e0 [ocfs2]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.709046]? [<ffffffff8861eb56>] ? d_invalidate+0xb6/0x120
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.710690]? [<ffffffff88611a79>] ? lookup_slow+0xa9/0x170
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.713068]? [<ffffffff88612199>] ? walk_component+0x1f9/0x330
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.717146]? [<ffffffff88612d62>] ? link_path_walk+0x1b2/0x670
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.718276]? [<ffffffff88613326>] ? path_lookupat+0x86/0x120
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.720158]? [<ffffffff88615d81>] ? filename_lookup+0xb1/0x180
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.721266]? [<ffffffff886018da>] ? __check_object_size+0xfa/0x1d8
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.722799]? [<ffffffff8875c678>] ? strncpy_from_user+0x48/0x160
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.725064]? [<ffffffff886159ba>] ? getname_flags+0x6a/0x1e0
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.729544]? [<ffffffff8860a899>] ? vfs_fstatat+0x59/0xb0
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.732126]? [<ffffffff8846ccc5>] ? sys32_stat64+0x25/0x60
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.733194]? [<ffffffff884033e7>] ? syscall_trace_enter+0x117/0x2c0
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.734247]? [<ffffffff88403d5c>] ? do_fast_syscall_32+0xac/0x180
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.737170]? [<ffffffff88a12cdf>] ? entry_SYSENTER_compat+0x6f/0x7e
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.739819] Code: 89 c6 5b 5d 41 5c 41 5d e9 61 f7 ef c7 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66
> 
> 2e 0f 1f 84 00 00 00 00 00 0f 1f
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.742738] RIP? [<ffffffffc0b1180d>] __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.745017]? RSP <ffffb62f36467b38>
> 
> Mar 22 19:26:55 drs1s005 kernel: [ 7545.753445] ---[ end trace 9a87ef237c626c21 ]---
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.523354] ------------[ cut here ]------------
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.523358] kernel BUG at /build/linux-sEiGsi/linux-4.16~rc6/fs/ocfs2/dlmglue.c:848!
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.533415] invalid opcode: 0000 [#1] SMP NOPTI
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.538568] Modules linked in: nfnetlink_log nfnetlink cmac arc4 md4 nls_utf8 cifs ccm dns_resolver fscache appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree btrfs zstd_decompress zstd_compress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter bridge stp llc joydev hid_generic usbhid hid fuse nls_ascii nls_cp437 vfat fat uas usb_storage kvm mgag200 ttm drm_kms_helper irqbypass drm efi_pstore crct10dif_pclmul crc32_pclmul pcspkr efivars evdev sg ghash_clmulni_intel ccp(+) hpilo i2c_algo_bit k10temp hpwdt rng_core sp5100_tco shpchp ipmi_si wmi ipmi_devintf ipmi_msghandler acpi_cpufreq button drbd lru_cache libcrc32c efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.613006]? crc32c_generic fscrypto ecb dm_mod ses sd_mod enclosure crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper xhci_pci ehci_pci smartpqi xhci_hcd ehci_hcd scsi_transport_sas scsi_mod tg3 usbcore i2c_piix4 usb_common i40e libphy
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.637598] CPU: 24 PID: 53861 Comm: java Tainted: G???????? C?????? 4.16.0-rc6-amd64 #1 Debian 4.16~rc6-1~exp1
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.648582] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 12/12/2017
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.659815] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9c/0xb0 [ocfs2]
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.667246] RSP: 0018:ffffb25924153ac8 EFLAGS: 00010046
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.673320] RAX: 0000000000000292 RBX: ffff9b1220245e18 RCX: 0000000000000001
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.682963] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9b1220245e94
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.692527] RBP: ffff9b1220245e94 R08: 0000000000000101 R09: 0000000000000000
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.700939] R10: ffffb25924153ab0 R11: 0000000000000073 R12: 0000000000000003
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.709376] R13: ffff9b181850b000 R14: 0000000000000000 R15: ffffffffc1157900
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.717571] FS:? 00007f3f3f3f6700(0000) GS:ffff9af81fa00000(0000) knlGS:0000000000000000
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.726842] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.735192] CR2: 00007f3f9aa9e330 CR3: 00000003483c0000 CR4: 00000000003406e0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.747998] Call Trace:
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.754363]? ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.763538]? ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2]
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.771745]? ocfs2_lookup+0x233/0x2c0 [ocfs2]
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.778048]? lookup_slow+0xa9/0x170
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.782898]? walk_component+0x1c4/0x470
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.788000]? ? inode_permission+0xbe/0x180
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.793238]? link_path_walk+0x2a6/0x510
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.799117]? ? path_init+0x177/0x2f0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.804087]? path_lookupat+0x56/0x1f0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.808830]? ? page_cache_tree_insert+0xe0/0xe0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.814529]? filename_lookup+0xb6/0x190
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.819516]? ? filemap_map_pages+0x228/0x340
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.824993]? ? seccomp_run_filters+0x59/0xc0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.830396]? ? __check_object_size+0xa7/0x1a0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.836156]? ? strncpy_from_user+0x48/0x160
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.841458]? ? getname_flags+0x6a/0x1e0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.846457]? ? vfs_statx+0x73/0xe0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.851105]? vfs_statx+0x73/0xe0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.855426]? SYSC_newstat+0x39/0x70
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.860227]? ? syscall_trace_enter+0x145/0x2e0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.865909]? do_syscall_64+0x6c/0x130
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.870977]? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.877075] RIP: 0033:0x7f3f9b54a4c5
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.882556] RSP: 002b:00007f3f3f3f50c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.892712] RAX: ffffffffffffffda RBX: 00007f3f945fb9e0 RCX: 00007f3f9b54a4c5
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.901419] RDX: 00007f3f3f3f50d0 RSI: 00007f3f3f3f50d0 RDI: 00007f3f300491e0
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.909732] RBP: 00007f3f3f3f5190 R08: 00007f3f300491e0 R09: 0000000000000028
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.918291] R10: 00007f3f85052380 R11: 0000000000000246 R12: 0000000000000000
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.926582] R13: 00007f3f300491e0 R14: 00007f3f30000a30 R15: 00007f3f945fb800
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.935371] Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 be 9c e1 d4 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.957500] RIP: __ocfs2_cluster_unlock.isra.36+0x9c/0xb0 [ocfs2] RSP: ffffb25924153ac8
> 
> Apr 10 13:01:13 drs1s006 kernel: [338194.967146] ---[ end trace 3743be945c8eeed8 ]---
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416347] Call Trace:
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416358]? dump_stack+0x5c/0x85
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416363]? dump_header+0x6b/0x289
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416367]? ? apparmor_capable+0xa4/0xe0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416369]? oom_kill_process+0x228/0x470
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416372]? out_of_memory+0x2ab/0x4b0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416374]? __alloc_pages_slowpath+0x9f5/0xd80
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416376]? __alloc_pages_nodemask+0x236/0x250
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416378]? filemap_fault+0x206/0x650
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416382]? ? recalc_sigpending+0x17/0x50
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416384]? ? __set_task_blocked+0x38/0x90
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416385]? ? __set_current_blocked+0x3d/0x60
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416426]? ocfs2_fault+0x39/0xe0 [ocfs2]
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416431]? __do_fault+0x1f/0xb0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416433]? __handle_mm_fault+0xca6/0x1220
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416436]? handle_mm_fault+0xdc/0x210
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416438]? __do_page_fault+0x256/0x4e0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416442]? ? page_fault+0x2f/0x50
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416444]? page_fault+0x45/0x50
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416446] RIP: 2dc42c40:0x7fc528cc4401
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416447] RSP: 28009490:00007fc4eb3f28e0 EFLAGS: 7fc528cc4400
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416449] Mem-Info:
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475] active_anon:7327303 inactive_anon:201509 isolated_anon:0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? active_file:24608843 inactive_file:24653281 isolated_file:0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? unevictable:0 dirty:5 writeback:0 unstable:0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? slab_reclaimable:3415816 slab_unreclaimable:1736671
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? mapped:210421 shmem:211438 pagetables:30887 bounce:0
> 
> Apr 10 12:59:47 drs1s006 kernel: [338109.416475]? free:3783897 free_pcp:100 free_cma:0
> 
> Daniel Sobe
> 
> Engineer
> 
> BL Car Infotainment & Driver Assistance
> 
> NXP Semiconductors Germany GmbH
> 
> Am Waldschl?sschen 1, 01099 Dresden, Germany
> 
> Tel: +49? 351 32023 504, Fax: +49 351 32023 760
> 
> eMail: daniel.sobe at nxp.com <mailto:daniel.sobe@nxp.com>, https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttp-253A-252F-252Fwww.nxp.com-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C2353b52873a54bb8fd4008d5a0e182fa-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636591812811580319-26sdata-3DSWm4-252BqLYRCF13Q9343UrW-252By-252Fax0OXfgVA0L9f-252ByCK4I-253D-26reserved-3D0&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=t09jYSM4F1MtIry763ccxdW-AApGZHIoeXWO1SSAYd4&s=CNcZVbrwI3cmvaIQYbwetMqyhpRPO_3mdDGoRX6uUkU&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttp-2D3A-5F-5Fwww.nxp.com-5F-2526d-253DDwMFAw-2526c-253DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI-5FJnE-2526r-253DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-2DHD0qT6Fo7Y-2526m-253DQ30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA-2526s-253D5KhvIKFT5ft-2DiBtSDc9Jxs0VltXYXylYJ2cnnj-5F9dQ0-2526e-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C2353b52873a54bb8fd4008d5a0e182fa-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636591812811580319-26sdata-3DDabq-252FIQZwAgYbbkFvC2MYiH6tH523SEJJLMnq9KRXWY-253D-26reserved-3D0-3D&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=t09jYSM4F1MtIry763ccxdW-AApGZHIoeXWO1SSAYd4&s=NhQPSXJ7DtaHJLcIuHHEAKqAWTD5Cp4oq7-A40HHZmI&e=>
> 
> Gesch?ftsf?hrung: Ruediger Stroh (Vors.), Kurt Sievers, Torsten Spinty, Michael Hoffmann / Aufsichtsratsvorsitzender: Gernot Fiedler / Sitz: Hamburg / Registergericht: Hamburg HRB 84 865
> 
> 
> You can also find us on:
> 
> cid:image002.png at 01CCCA30.7EED3EA0 <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttp-2D3A-5F-5Ftwitter.com-5F-2D23-2D21-5FNXP-2526d-253DDwMFAw-2526c-253DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI-5FJnE-2526r-253DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-2DHD0qT6Fo7Y-2526m-253DQ30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA-2526s-253Dz9hK7j4HK-5F34HpeTQ6FKHD9rueEIq8cpwl7nZM-2Deoqs-2526e-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C2353b52873a54bb8fd4008d5a0e182fa-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636591812811580319-26sdata-3DSn-252FNNuKZqymLy6y7nHg66pMUhIUfOmwDh76sgX9U6BI-253D-26reserved-3D0-3D&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=t09jYSM4F1MtIry763ccxdW-AApGZHIoeXWO1SSAYd4&s=NZHV4oHq_xuaMb6_0czcRwEBN-J92VgkTgzQysqC8J4&e=>cid:image003.png at 01CCCA30.7EED3EA0 <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttp-2D3A-5F-5Fwww.youtube.com-5Fuser-5FNXPSemiconductors-2526d-253DDwMFAw-2526c-253DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI-5FJnE-2526r-253DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-2DHD0qT6Fo7Y-2526m-253DQ30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA-2526s-253DyitF8gfDg32ikjnN5K3vk-5Fl0iX1VVAwAzVDU-5Fwaa-2Dgs-2526e-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C2353b52873a54bb8fd4008d5a0e182fa-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636591812811580319-26sdata-3DGkIZmPevdIA-252F4P3hzBFii-252B5wp3URUBeRo535TbLKUS0-253D-26reserved-3D0-3D&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=t09jYSM4F1MtIry763ccxdW-AApGZHIoeXWO1SSAYd4&s=r3cTCnVOZtRwvh67Itx6WJak6vF_tFO-P2KgS0GebH0&e=>cid:image004.png at 01CCCA30.7EED3EA0 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttp-2D3A-5F-5Fwww.facebook.com-5Fpages-5FNXP-2D2DSemiconductors-5F172907906413-2526d-253DDwMFAw-2526c-253DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI-5FJnE-2526r-253DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-2DHD0qT6Fo7Y-2526m-253DQ30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA-2526s-253DrPV59SQpmf1sGbMk7XriP7GIzn4JUL8rKFREqZ2kpnA-2526e-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C2353b52873a54bb8fd4008d5a0e182fa-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636591812811580319-26sdata-3DfTQVqb4PTE-252Bo6jJ2CI-252FXIiQAe-252BxBxnyZzV-252BIR4P8E5A-253D-26reserved-3D0-3D&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=t09jYSM4F1MtIry763ccxdW-AApGZHIoeXWO1SSAYd4&s=NAnngVDz6x9G-2FEe5FmWOD-gzFFEqUv3JGskoAE4_8&e=>cid:image005.png at 01CCCA30.7EED3EA0 <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttps-2D3A-5F-5Fplus.google.com-5F112031734932782342682-5Fposts-2526d-253DDwMFAw-2526c-253DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI-5FJnE-2526r-253DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-2DHD0qT6Fo7Y-2526m-253DQ30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA-2526s-253DVPJTIilh4hEEcbNg9Z0u-5Fi3M0hJ3u3kooxo1KwTjZKg-2526e-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C2353b52873a54bb8fd4008d5a0e182fa-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636591812811580319-26sdata-3D7gMM2x6QkfnHHcXHXAuImxjkbnk0L1maZjWol3Ocv4A-253D-26reserved-3D0-3D&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=t09jYSM4F1MtIry763ccxdW-AApGZHIoeXWO1SSAYd4&s=n-xXiwMfD5kWIXvvZa5AEPgQgvPXu--JnIsuDjWN_tk&e=>cid:image006.png at 01CCCA30.7EED3EA0 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttp-2D3A-5F-5Fwww.linkedin.com-5Fgroups-2D3Fgid-2D3D671067-2526d-253DDwMFAw-2526c-253DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI-5FJnE-2526r-253DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-2DHD0qT6Fo7Y-2526m-253DQ30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA-2526s-253DHnIffnO-2DGfeNo-2D-2DT9bDr3XesM-5FlNcydll8hfcJgW5Vo-2526e-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C2353b52873a54bb8fd4008d5a0e182fa-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636591812811580319-26sdata-3D278ghtrYZ5-252FMFMC9eNJj6lI9WbMveG9nulfnre7e-252B8M-253D-26reserved-3D0-3D&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=t09jYSM4F1MtIry763ccxdW-AApGZHIoeXWO1SSAYd4&s=iwpnFIPNYg5j5X5zaNpjKJRdRd0ANEWgBFcRBjYnLsM&e=>cid:image007.png at 01CCCA30.7EED3EA0 <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttp-2D3A-5F-5Fweibo.com-5Fnxpsemiconductors-2526d-253DDwMFAw-2526c-253DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI-5FJnE-2526r-253DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-2DHD0qT6Fo7Y-2526m-253DQ30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA-2526s-253DstPkqaGv8HSZKZwJAkN6Z5EJc8L-5FrLarWz3ouN0rDLY-2526e-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C2353b52873a54bb8fd4008d5a0e182fa-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636591812811580319-26sdata-3DgHg7Sl769aWN1Dun-252FhHdu1OlTOtD6-252FNer7L2-252F17Gqy8-253D-26reserved-3D0-3D&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=t09jYSM4F1MtIry763ccxdW-AApGZHIoeXWO1SSAYd4&s=b6cmkjVHNSATRefLwR8yoHrvlpOMtiPUM7duh9OoTVk&e=>_
> _
> The information contained in this message is confidential and may be legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
> 
> Unless otherwise recorded in a written agreement, all sales transactions by NXP Semiconductors are subject to our general terms and conditions of commercial sale. These are published at: https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dwww.nxp.com-252Fprofile-252Fterms-252Findex.html-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C2353b52873a54bb8fd4008d5a0e182fa-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636591812811580319-26sdata-3DWbUVSG3aS7PtATJAr4o-252FVzdcQVWOaNH-252Bdy6N4yvwd2M-253D-26reserved-3D0&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=t09jYSM4F1MtIry763ccxdW-AApGZHIoeXWO1SSAYd4&s=XJR2YY6xTt591D2l16NFpokfWolBc0bs2JFvOT78ktI&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttp-2D3A-5F-5Fwww.nxp.com-5Fprofile-5Fterms-5Findex.html-2526d-253DDwMFAw-2526c-253DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI-5FJnE-2526r-253DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-2DHD0qT6Fo7Y-2526m-253DQ30Z8J9hmQC7gxk63OxdHtyQv73UXHlOrqxod0srOwA-2526s-253DwUJAX3H8pEbFPrJdGwnCGdtB6q2Qir9GywAIsTDwFQw-2526e-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C2353b52873a54bb8fd4008d5a0e182fa-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636591812811580319-26sdata-3DovbQsP4XR7AR3920G-252Bz5-252BaH6ssalk-252Fq5y3zcYbgpXME-253D-26reserved-3D0-3D&d=DwIFAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=t09jYSM4F1MtIry763ccxdW-AApGZHIoeXWO1SSAYd4&s=wYO7UK8arr4wc1WDjzy4-oOHaxSS0fOnquw33s2FGIQ&e=>__
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
       [not found]                 ` <HE1PR0401MB2538AE0195A130AD527C2B31EDBC0@HE1PR0401MB2538.eurprd04.prod.outlook.com>
@ 2018-05-11  7:01                   ` Larry Chen
  2018-07-12 14:24                     ` Daniel Sobe
  0 siblings, 1 reply; 32+ messages in thread
From: Larry Chen @ 2018-05-11  7:01 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,

On 04/12/2018 08:20 PM, Daniel Sobe wrote:
> Hi Larry,
>
> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>
> * Install the LXC packages from the distribution
> * run the command "lxc-create -n test1 -t download"
> ** first run might prompt you to generate a ~/.config/lxc/default.conf to define UID mappings
> ** in a corporate environment it might be tricky to set the http_proxy (and maybe even https_proxy) environment variables correctly
> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
> * the container downloads to ~/.local/share/lxc/
> * adapt the "config" file in that directory to add the shared ocfs2 mount like in my example below
> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
> ** if you want to finally enable networking, most distributions arrange a dedicated bridge (lxcbr0) which you can configure similar to my example below
> ** in my case I had to install cgroup related tools and reboot to have all cgroups available, and to allow use of lxcbr0 bridge in /etc/lxc/lxc-usernet
>
> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>
> Regards,
>
> Daniel
>
>
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Donnerstag, 12. April 2018 11:20
> To: Daniel Sobe <daniel.sobe@nxp.com>
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi Daniel,
>
> Quite an interesting issue.
>
> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>
> Do you have a script to build up your lxc environment?
> Because I want to make sure that my environment is quite the same as yours.
>
> Thanks,
> Larry
>
>
> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> not sure if it helps, the issue wasn't there with Debian 8 and kernel 3.16 - but that's a long history. Unfortunately, the only machine where I could try to bisect, does not run any kernel < 4.16 without other issues ?
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Donnerstag, 12. April 2018 05:17
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Thanks for your report.
>> I'll try to reproduce this bug as you did.
>>
>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>
>> Thanks
>> Larry
>>
>>
>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=EEKBYUthmGW6dmlK0mKda8ET_52Dw7AzLknUfRWu4CM&s=U_Q9zZpmHwanY55E01YBaTOA5wQC8fsTGebfuh8E3dc&e=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>
>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> ----
>>>
>>> # Distribution configuration
>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>> lxc.arch = x86_64
>>>
>>> # Container specific configuration
>>> lxc.id_map = u 0 624288 65536
>>> lxc.id_map = g 0 624288 65536
>>>
>>> lxc.utsname = container1
>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>
>>> lxc.network.type = veth
>>> lxc.network.flags = up
>>> lxc.network.link = bridge1
>>> lxc.network.name = eth0
>>> lxc.network.veth.pair = aabbccddeeff
>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>
>>> lxc.cgroup.cpuset.cpus = 63-86
>>>
>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>
>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>
>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>
>>> ----
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Mittwoch, 11. April 2018 13:31
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>>
>>>
>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>
>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>> I thought of there is a way to share a directory between host and docker container, like
>>>     ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>
>>> How did you setup your lxc or container?
>>>
>>> If you could, show me the procedure, I'll try to reproduce it.
>>>
>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>
>>> Regards,
>>> Larry
>>>> Regards,
>>>>
>>>> Daniel
>>>>

Sorry for this delayed reply.

I tried with lxc + ocfs2 in your mount-shared way.

But I can not reproduce your bugs.

What I use is opensuse tumbleweed.

The procedure I try to reproduce your bugs:
0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
 ?? mount /dev/xxx /mnt
 ?? then it shows
 ?? 207 65 254:16 / /mnt rw,relatime shared:94
 ?? I think this *shared* is what you want. And this mount point will be 
shared within multiple namespaces.

1. Start Virtual Machine Manager.
2. add a local LXC connection by clicking File ? Add Connection.
 ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
3. Select the localhost (LXC) connection and click File New Virtual 
Machine menu.
4. Activate Application container and click Forward.
 ?? Set the path to the application to be launched. As an example, the 
field is filled with /bin/sh, which is fine to create a first container. 
Click Forward.
5. Choose the maximum amount of memory and CPUs to allocate to the 
container. Click Forward.
6. Type in a name for the container. This name will be used for all 
virsh commands on the container.
 ?? Click Advanced options. Select the network to connect the container 
to and click Finish. The container will be created and started. A 
console will be opened automatically.

If possible, could you please provide a shell script to show what you 
did with you mount point.

Thanks
Larry

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-05-11  7:01                   ` Larry Chen
@ 2018-07-12 14:24                     ` Daniel Sobe
  2018-07-13  9:35                       ` Daniel Sobe
  2018-07-13  9:48                       ` Larry Chen
  0 siblings, 2 replies; 32+ messages in thread
From: Daniel Sobe @ 2018-07-12 14:24 UTC (permalink / raw)
  To: ocfs2-devel

Hi Larry,

sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:

Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut here ]------------
Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000 [#1] SMP PTI
Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit
Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
Jul 12 15:29:08 drs1p001 kernel: [1300619.423923] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS: 00010046
Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX: 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000
Jul 12 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI: ffff9d269d990018 RDI: ffff9d269d990094
Jul 12 15:29:08 drs1p001 kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000 R09: 000000000000036a
Jul 12 15:29:08 drs1p001 kernel: [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12: ffff9d269d990094
Jul 12 15:29:08 drs1p001 kernel: [1300619.423934] R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940
Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00
Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0
Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]  ocfs2_lookup+0x199/0x2e0 [ocfs2]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423986]  ? _cond_resched+0x16/0x40
Jul 12 15:29:08 drs1p001 kernel: [1300619.423989]  lookup_slow+0xa9/0x170
Jul 12 15:29:08 drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350
Jul 12 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300
Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]  path_lookupat+0x73/0x220
Jul 12 15:29:08 drs1p001 kernel: [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260
Jul 12 15:29:08 drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424003]  ? seccomp_run_filters+0x58/0xb0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424005]  ? __check_object_size+0x98/0x1a0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160
Jul 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  vfs_statx+0x73/0xe0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]  C_SYSC_x86_stat64+0x39/0x70
Jul 12 15:29:08 drs1p001 kernel: [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0
Jul 12 15:29:08 drs1p001 kernel: [1300619.424022]  entry_SYSENTER_compat+0x7f/0x8e
Jul 12 15:29:08 drs1p001 kernel: [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 
Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP: __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10
Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace aea789961795b75f ]---
Jul 12 15:29:08 drs1p001 kernel: [1300628.967649] ------------[ cut here ]------------

As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.

Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.

I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.

Regards,

Daniel


-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com] 
Sent: Freitag, 11. Mai 2018 09:01
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

On 04/12/2018 08:20 PM, Daniel Sobe wrote:
> Hi Larry,
>
> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>
> * Install the LXC packages from the distribution
> * run the command "lxc-create -n test1 -t download"
> ** first run might prompt you to generate a ~/.config/lxc/default.conf 
> to define UID mappings
> ** in a corporate environment it might be tricky to set the http_proxy 
> (and maybe even https_proxy) environment variables correctly
> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
> * the container downloads to ~/.local/share/lxc/
> * adapt the "config" file in that directory to add the shared ocfs2 
> mount like in my example below
> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
> ** if you want to finally enable networking, most distributions 
> arrange a dedicated bridge (lxcbr0) which you can configure similar to 
> my example below
> ** in my case I had to install cgroup related tools and reboot to have 
> all cgroups available, and to allow use of lxcbr0 bridge in 
> /etc/lxc/lxc-usernet
>
> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>
> Regards,
>
> Daniel
>
>
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Donnerstag, 12. April 2018 11:20
> To: Daniel Sobe <daniel.sobe@nxp.com>
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi Daniel,
>
> Quite an interesting issue.
>
> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>
> Do you have a script to build up your lxc environment?
> Because I want to make sure that my environment is quite the same as yours.
>
> Thanks,
> Larry
>
>
> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> not sure if it helps, the issue wasn't there with Debian 8 and kernel 
>> 3.16 - but that's a long history. Unfortunately, the only machine 
>> where I could try to bisect, does not run any kernel < 4.16 without 
>> other issues ?
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Donnerstag, 12. April 2018 05:17
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Thanks for your report.
>> I'll try to reproduce this bug as you did.
>>
>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>
>> Thanks
>> Larry
>>
>>
>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=d8YTOI365uammRcpTuXDoQhwuGDm0CyQ-QNJxQAZczs&s=crzdJkF_u3rBf8xZ1cHEce1LBwHIrVIDads0aP6CP74&e=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>
>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> ----
>>>
>>> # Distribution configuration
>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>> lxc.arch = x86_64
>>>
>>> # Container specific configuration
>>> lxc.id_map = u 0 624288 65536
>>> lxc.id_map = g 0 624288 65536
>>>
>>> lxc.utsname = container1
>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>
>>> lxc.network.type = veth
>>> lxc.network.flags = up
>>> lxc.network.link = bridge1
>>> lxc.network.name = eth0
>>> lxc.network.veth.pair = aabbccddeeff
>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>
>>> lxc.cgroup.cpuset.cpus = 63-86
>>>
>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>
>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>
>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>
>>> ----
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Mittwoch, 11. April 2018 13:31
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>>
>>>
>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>
>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>> I thought of there is a way to share a directory between host and docker container, like
>>>     ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>
>>> How did you setup your lxc or container?
>>>
>>> If you could, show me the procedure, I'll try to reproduce it.
>>>
>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>
>>> Regards,
>>> Larry
>>>> Regards,
>>>>
>>>> Daniel
>>>>

Sorry for this delayed reply.

I tried with lxc + ocfs2 in your mount-shared way.

But I can not reproduce your bugs.

What I use is opensuse tumbleweed.

The procedure I try to reproduce your bugs:
0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
 ?? mount /dev/xxx /mnt
 ?? then it shows
 ?? 207 65 254:16 / /mnt rw,relatime shared:94
 ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.

1. Start Virtual Machine Manager.
2. add a local LXC connection by clicking File ? Add Connection.
 ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
4. Activate Application container and click Forward.
 ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container. 
Click Forward.
5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
6. Type in a name for the container. This name will be used for all virsh commands on the container.
 ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.

If possible, could you please provide a shell script to show what you did with you mount point.

Thanks
Larry

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-12 14:24                     ` Daniel Sobe
@ 2018-07-13  9:35                       ` Daniel Sobe
  2018-07-13  9:51                         ` Larry Chen
  2018-07-13  9:48                       ` Larry Chen
  1 sibling, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2018-07-13  9:35 UTC (permalink / raw)
  To: ocfs2-devel


This is a stacktrace from 4.16.16. All I was doing this time was a "git checkout" which probably led to a lot of file system activity.


Jul 13 11:31:00 drs1p001 kernel: [  849.213765] ------------[ cut here ]------------
Jul 13 11:31:00 drs1p001 kernel: [  849.213766] kernel BUG at /build/linux-Sci2oS/linux-4.16.16/fs/ocfs2/dlmglue.c:848!
Jul 13 11:31:00 drs1p001 kernel: [  849.213774] invalid opcode: 0000 [#1] SMP PTI
Jul 13 11:31:00 drs1p001 kernel: [  849.213776] Modules linked in: tcp_diag inet_diag unix_diag veth ocfs2 quota_tree bridge stp llc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse snd_hda_codec_hdmi rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul dell_wmi crc32_pclmul sparse_keymap wmi_bmof dell_smbios dell_wmi_descriptor ghash_clmulni_intel snd_hda_intel evdev snd_hda_codec intel_cstate dcdbas drm_kms_helper snd_hda_core snd_hwdep intel_uncore intel_rapl_perf snd_pcm snd_timer drm mei_me iTCO_wdt snd pcspkr mei soundcore iTCO_vendor_support i2c_algo_bit sg shpchp intel_pch_thermal wmi serio_raw button video acpi_pad drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4
Jul 13 11:31:00 drs1p001 kernel: [  849.213808]  crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci e1000e libata xhci_pci e1000 xhci_hcd i2c_i801 scsi_mod usbcore usb_common fan thermal
Jul 13 11:31:00 drs1p001 kernel: [  849.213823] CPU: 1 PID: 4266 Comm: git Not tainted 4.16.0-0.bpo.2-amd64 #1 Debian 4.16.16-2~bpo9+1
Jul 13 11:31:00 drs1p001 kernel: [  849.213825] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
Jul 13 11:31:00 drs1p001 kernel: [  849.213851] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
Jul 13 11:31:00 drs1p001 kernel: [  849.213865] RSP: 0000:ffffab4243c73b20 EFLAGS: 00010046
Jul 13 11:31:00 drs1p001 kernel: [  849.213867] RAX: 0000000000000282 RBX: ffff9b5fb19d1818 RCX: 0000000000000000
Jul 13 11:31:00 drs1p001 kernel: [  849.213869] RDX: 0000000000000000 RSI: ffff9b5fb19d1818 RDI: ffff9b5fb19d1894
Jul 13 11:31:00 drs1p001 kernel: [  849.213870] RBP: 0000000000000003 R08: ffff9b5fd9ca22e0 R09: ffff9b5fcf1ac400
Jul 13 11:31:00 drs1p001 kernel: [  849.213872] R10: ffffab4243c73b08 R11: 0000000000000000 R12: ffff9b5fb19d1894
Jul 13 11:31:00 drs1p001 kernel: [  849.213874] R13: ffff9b5fcd2cd000 R14: 0000000000000000 R15: ffffffffc0b2d940
Jul 13 11:31:00 drs1p001 kernel: [  849.213876] FS:  00007f62f1fa4700(0000) GS:ffff9b5fd9c80000(0000) knlGS:0000000000000000
Jul 13 11:31:00 drs1p001 kernel: [  849.213878] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 13 11:31:00 drs1p001 kernel: [  849.213879] CR2: 00007f62cc000010 CR3: 000000022abd2003 CR4: 00000000003606e0
Jul 13 11:31:00 drs1p001 kernel: [  849.213881] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 13 11:31:00 drs1p001 kernel: [  849.213883] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jul 13 11:31:00 drs1p001 kernel: [  849.213884] Call Trace:
Jul 13 11:31:00 drs1p001 kernel: [  849.213897]  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
Jul 13 11:31:00 drs1p001 kernel: [  849.213908]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
Jul 13 11:31:00 drs1p001 kernel: [  849.213921]  ocfs2_lookup+0x199/0x2e0 [ocfs2]
Jul 13 11:31:00 drs1p001 kernel: [  849.213925]  ? _cond_resched+0x16/0x40
Jul 13 11:31:00 drs1p001 kernel: [  849.213928]  lookup_slow+0xa9/0x170
Jul 13 11:31:00 drs1p001 kernel: [  849.213930]  walk_component+0x1c6/0x350
Jul 13 11:31:00 drs1p001 kernel: [  849.213932]  path_lookupat+0x73/0x220
Jul 13 11:31:00 drs1p001 kernel: [  849.213935]  ? ___bpf_prog_run+0xba7/0x1260
Jul 13 11:31:00 drs1p001 kernel: [  849.213937]  filename_lookup+0xb8/0x1a0
Jul 13 11:31:00 drs1p001 kernel: [  849.213940]  ? seccomp_run_filters+0x58/0xb0
Jul 13 11:31:00 drs1p001 kernel: [  849.213942]  ? __check_object_size+0x98/0x1a0
Jul 13 11:31:00 drs1p001 kernel: [  849.213945]  ? strncpy_from_user+0x48/0x160
Jul 13 11:31:00 drs1p001 kernel: [  849.213947]  ? getname_flags+0x6a/0x1e0
Jul 13 11:31:00 drs1p001 kernel: [  849.213950]  ? vfs_statx+0x73/0xe0
Jul 13 11:31:00 drs1p001 kernel: [  849.213952]  vfs_statx+0x73/0xe0
Jul 13 11:31:00 drs1p001 kernel: [  849.213954]  SYSC_newlstat+0x39/0x70
Jul 13 11:31:00 drs1p001 kernel: [  849.213957]  ? syscall_trace_enter+0x117/0x2c0
Jul 13 11:31:00 drs1p001 kernel: [  849.213959]  do_syscall_64+0x6c/0x130
Jul 13 11:31:00 drs1p001 kernel: [  849.213961]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jul 13 11:31:00 drs1p001 kernel: [  849.213964] RIP: 0033:0x7f62f20800f5
Jul 13 11:31:00 drs1p001 kernel: [  849.213965] RSP: 002b:00007f62f1fa3d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Jul 13 11:31:00 drs1p001 kernel: [  849.213967] RAX: ffffffffffffffda RBX: 00007f62f1fa3e50 RCX: 00007f62f20800f5
Jul 13 11:31:00 drs1p001 kernel: [  849.213969] RDX: 00007f62f1fa3d40 RSI: 00007f62f1fa3d40 RDI: 00007f62e80008c0
Jul 13 11:31:00 drs1p001 kernel: [  849.213971] RBP: 0000000000000033 R08: 0000000000000003 R09: 0000000000000000
Jul 13 11:31:00 drs1p001 kernel: [  849.213972] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
Jul 13 11:31:00 drs1p001 kernel: [  849.213974] R13: 0000000000000000 R14: 0000000000000003 R15: 0000564ea29f2878
Jul 13 11:31:00 drs1p001 kernel: [  849.213976] Code: 89 c6 5b 5d 41 5c 41 5d e9 01 b8 e4 ea 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
Jul 13 11:31:00 drs1p001 kernel: [  849.214007] RIP: __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffab4243c73b20
Jul 13 11:31:00 drs1p001 kernel: [  849.214010] ---[ end trace 99c07b7b69ee7717 ]---

I'll see to have a backported 4.17 installed soon to verify whether it happens with newer kernels at all.

Regards,

Daniel

-----Original Message-----
From: ocfs2-devel-bounces@oss.oracle.com [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
Sent: Donnerstag, 12. Juli 2018 16:24
To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Larry,

sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:

Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 15:29:08 drs1p001 kernel: [1300619.423923] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS: 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX: 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI: ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000 R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel: [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12: ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934] R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]  ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001 kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08 drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300 Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]  path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel: [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08 drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424003]  ? seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 Jul 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]  C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel: [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424022]  entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel: [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP: __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel: [1300628.967649] ------------[ cut here ]------------

As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.

Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.

I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.

Regards,

Daniel


-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com]
Sent: Freitag, 11. Mai 2018 09:01
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

On 04/12/2018 08:20 PM, Daniel Sobe wrote:
> Hi Larry,
>
> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>
> * Install the LXC packages from the distribution
> * run the command "lxc-create -n test1 -t download"
> ** first run might prompt you to generate a ~/.config/lxc/default.conf 
> to define UID mappings
> ** in a corporate environment it might be tricky to set the http_proxy 
> (and maybe even https_proxy) environment variables correctly
> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
> * the container downloads to ~/.local/share/lxc/
> * adapt the "config" file in that directory to add the shared ocfs2 
> mount like in my example below
> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
> ** if you want to finally enable networking, most distributions 
> arrange a dedicated bridge (lxcbr0) which you can configure similar to 
> my example below
> ** in my case I had to install cgroup related tools and reboot to have 
> all cgroups available, and to allow use of lxcbr0 bridge in 
> /etc/lxc/lxc-usernet
>
> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>
> Regards,
>
> Daniel
>
>
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Donnerstag, 12. April 2018 11:20
> To: Daniel Sobe <daniel.sobe@nxp.com>
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi Daniel,
>
> Quite an interesting issue.
>
> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>
> Do you have a script to build up your lxc environment?
> Because I want to make sure that my environment is quite the same as yours.
>
> Thanks,
> Larry
>
>
> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> not sure if it helps, the issue wasn't there with Debian 8 and kernel
>> 3.16 - but that's a long history. Unfortunately, the only machine 
>> where I could try to bisect, does not run any kernel < 4.16 without 
>> other issues ?
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Donnerstag, 12. April 2018 05:17
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Thanks for your report.
>> I'll try to reproduce this bug as you did.
>>
>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>
>> Thanks
>> Larry
>>
>>
>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3Dd8YTOI365uammRcpTuXDoQhwuGDm0CyQ-QNJxQAZczs%26s%3DcrzdJkF_u3rBf8xZ1cHEce1LBwHIrVIDads0aP6CP74%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C1f1b5d6a87334604103108d5e803507b%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670023298552201&amp;sdata=fB%2BH2oqPXUFCWoAO%2BlZ1Qg8jJkKpM0rf39AgJ1ObWJQ%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>
>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> ----
>>>
>>> # Distribution configuration
>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>> lxc.arch = x86_64
>>>
>>> # Container specific configuration
>>> lxc.id_map = u 0 624288 65536
>>> lxc.id_map = g 0 624288 65536
>>>
>>> lxc.utsname = container1
>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>
>>> lxc.network.type = veth
>>> lxc.network.flags = up
>>> lxc.network.link = bridge1
>>> lxc.network.name = eth0
>>> lxc.network.veth.pair = aabbccddeeff
>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>
>>> lxc.cgroup.cpuset.cpus = 63-86
>>>
>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>
>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>
>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>
>>> ----
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Mittwoch, 11. April 2018 13:31
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>>
>>>
>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>
>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>> I thought of there is a way to share a directory between host and docker container, like
>>>     ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>
>>> How did you setup your lxc or container?
>>>
>>> If you could, show me the procedure, I'll try to reproduce it.
>>>
>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>
>>> Regards,
>>> Larry
>>>> Regards,
>>>>
>>>> Daniel
>>>>

Sorry for this delayed reply.

I tried with lxc + ocfs2 in your mount-shared way.

But I can not reproduce your bugs.

What I use is opensuse tumbleweed.

The procedure I try to reproduce your bugs:
0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
 ?? mount /dev/xxx /mnt
 ?? then it shows
 ?? 207 65 254:16 / /mnt rw,relatime shared:94
 ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.

1. Start Virtual Machine Manager.
2. add a local LXC connection by clicking File ? Add Connection.
 ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
4. Activate Application container and click Forward.
 ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container. 
Click Forward.
5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
6. Type in a name for the container. This name will be used for all virsh commands on the container.
 ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.

If possible, could you please provide a shell script to show what you did with you mount point.

Thanks
Larry


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel at oss.oracle.com
https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss.oracle.com-252Fmailman-252Flistinfo-252Focfs2-2Ddevel-26amp-3Bdata-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C1f1b5d6a87334604103108d5e803507b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636670023298552201-26amp-3Bsdata-3DSMj8hOyr2U1FctgW76Vei7KqVxNnVDXLmZYhNSKEhGc-253D-26amp-3Breserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=3bYXCH9yvRLxBfpwQceLASJbHuuZ29zmJbPFtjPR91s&s=2icg3OSQjoAiuqSoxkPsC0Uh3n_Y1gAK4fgMErbIjf8&e=

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-12 14:24                     ` Daniel Sobe
  2018-07-13  9:35                       ` Daniel Sobe
@ 2018-07-13  9:48                       ` Larry Chen
  2018-07-13 10:06                         ` Larry Chen
  2018-07-13 11:55                         ` Daniel Sobe
  1 sibling, 2 replies; 32+ messages in thread
From: Larry Chen @ 2018-07-13  9:48 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,

Thanks for your effort to reproduce the bug.
I can confirm that there exist more than one bug.
I'll focus on this interesting issue.


On 07/12/2018 10:24 PM, Daniel Sobe wrote:
> Hi Larry,
> 
> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
> 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut here ]------------
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000 [#1] SMP PTI
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423923] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS: 00010046
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX: 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI: ffff9d269d990018 RDI: ffff9d269d990094
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000 R09: 000000000000036a
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12: ffff9d269d990094
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423934] R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]

Here is caused by ocfs2_dentry_lock failed.
I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the 
failure of ocfs2_dentry_lock.

But why it failed still confuses me.


> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]  ocfs2_lookup+0x199/0x2e0 [ocfs2]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423986]  ? _cond_resched+0x16/0x40
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423989]  lookup_slow+0xa9/0x170
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]  path_lookupat+0x73/0x220
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424003]  ? seccomp_run_filters+0x58/0xb0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424005]  ? __check_object_size+0x98/0x1a0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  vfs_statx+0x73/0xe0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]  C_SYSC_x86_stat64+0x39/0x70
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424022]  entry_SYSENTER_compat+0x7f/0x8e
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP: __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace aea789961795b75f ]---
> Jul 12 15:29:08 drs1p001 kernel: [1300628.967649] ------------[ cut here ]------------
> 
> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
> 
> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
> 
> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
> 
> Regards,
> 
> Daniel
> 
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Freitag, 11. Mai 2018 09:01
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>
>> * Install the LXC packages from the distribution
>> * run the command "lxc-create -n test1 -t download"
>> ** first run might prompt you to generate a ~/.config/lxc/default.conf
>> to define UID mappings
>> ** in a corporate environment it might be tricky to set the http_proxy
>> (and maybe even https_proxy) environment variables correctly
>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>> * the container downloads to ~/.local/share/lxc/
>> * adapt the "config" file in that directory to add the shared ocfs2
>> mount like in my example below
>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>> ** if you want to finally enable networking, most distributions
>> arrange a dedicated bridge (lxcbr0) which you can configure similar to
>> my example below
>> ** in my case I had to install cgroup related tools and reboot to have
>> all cgroups available, and to allow use of lxcbr0 bridge in
>> /etc/lxc/lxc-usernet
>>
>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Donnerstag, 12. April 2018 11:20
>> To: Daniel Sobe <daniel.sobe@nxp.com>
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Quite an interesting issue.
>>
>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>
>> Do you have a script to build up your lxc environment?
>> Because I want to make sure that my environment is quite the same as yours.
>>
>> Thanks,
>> Larry
>>
>>
>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> not sure if it helps, the issue wasn't there with Debian 8 and kernel
>>> 3.16 - but that's a long history. Unfortunately, the only machine
>>> where I could try to bisect, does not run any kernel < 4.16 without
>>> other issues ?
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Donnerstag, 12. April 2018 05:17
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Thanks for your report.
>>> I'll try to reproduce this bug as you did.
>>>
>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>
>>> Thanks
>>> Larry
>>>
>>>
>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ZyZJHyS-hpOt5cG5HUXWP4DMTlWsFjNC24kNNJrQP7Y&s=p6IBQtAwi0mNcKsq2KQH96D2xRdWu7HXtgXDOtDYq28&e=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>
>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> ----
>>>>
>>>> # Distribution configuration
>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>> lxc.arch = x86_64
>>>>
>>>> # Container specific configuration
>>>> lxc.id_map = u 0 624288 65536
>>>> lxc.id_map = g 0 624288 65536
>>>>
>>>> lxc.utsname = container1
>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>
>>>> lxc.network.type = veth
>>>> lxc.network.flags = up
>>>> lxc.network.link = bridge1
>>>> lxc.network.name = eth0
>>>> lxc.network.veth.pair = aabbccddeeff
>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>>
>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>
>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>
>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>
>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>
>>>> ----
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>>
>>>>
>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>
>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>      ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>
>>>> How did you setup your lxc or container?
>>>>
>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>
>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>
>>>> Regards,
>>>> Larry
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
> 
> Sorry for this delayed reply.
> 
> I tried with lxc + ocfs2 in your mount-shared way.
> 
> But I can not reproduce your bugs.
> 
> What I use is opensuse tumbleweed.
> 
> The procedure I try to reproduce your bugs:
> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>   ?? mount /dev/xxx /mnt
>   ?? then it shows
>   ?? 207 65 254:16 / /mnt rw,relatime shared:94
>   ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
> 
> 1. Start Virtual Machine Manager.
> 2. add a local LXC connection by clicking File ? Add Connection.
>   ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
> 4. Activate Application container and click Forward.
>   ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
> Click Forward.
> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>   ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
> 
> If possible, could you please provide a shell script to show what you did with you mount point.
> 
> Thanks
> Larry
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-13  9:35                       ` Daniel Sobe
@ 2018-07-13  9:51                         ` Larry Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Larry Chen @ 2018-07-13  9:51 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,

Could you please describe your environment and the way to reproduce the bug.

Thanks
Larry
On 07/13/2018 05:35 PM, Daniel Sobe wrote:
> 
> This is a stacktrace from 4.16.16. All I was doing this time was a "git checkout" which probably led to a lot of file system activity.
> 
> 
> Jul 13 11:31:00 drs1p001 kernel: [  849.213765] ------------[ cut here ]------------
> Jul 13 11:31:00 drs1p001 kernel: [  849.213766] kernel BUG at /build/linux-Sci2oS/linux-4.16.16/fs/ocfs2/dlmglue.c:848!
> Jul 13 11:31:00 drs1p001 kernel: [  849.213774] invalid opcode: 0000 [#1] SMP PTI
> Jul 13 11:31:00 drs1p001 kernel: [  849.213776] Modules linked in: tcp_diag inet_diag unix_diag veth ocfs2 quota_tree bridge stp llc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse snd_hda_codec_hdmi rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul dell_wmi crc32_pclmul sparse_keymap wmi_bmof dell_smbios dell_wmi_descriptor ghash_clmulni_intel snd_hda_intel evdev snd_hda_codec intel_cstate dcdbas drm_kms_helper snd_hda_core snd_hwdep intel_uncore intel_rapl_perf snd_pcm snd_timer drm mei_me iTCO_wdt snd pcspkr mei soundcore iTCO_vendor_support i2c_algo_bit sg shpchp intel_pch_thermal wmi serio_raw button video acpi_pad drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4
> Jul 13 11:31:00 drs1p001 kernel: [  849.213808]  crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci e1000e libata xhci_pci e1000 xhci_hcd i2c_i801 scsi_mod usbcore usb_common fan thermal
> Jul 13 11:31:00 drs1p001 kernel: [  849.213823] CPU: 1 PID: 4266 Comm: git Not tainted 4.16.0-0.bpo.2-amd64 #1 Debian 4.16.16-2~bpo9+1
> Jul 13 11:31:00 drs1p001 kernel: [  849.213825] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
> Jul 13 11:31:00 drs1p001 kernel: [  849.213851] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
> Jul 13 11:31:00 drs1p001 kernel: [  849.213865] RSP: 0000:ffffab4243c73b20 EFLAGS: 00010046
> Jul 13 11:31:00 drs1p001 kernel: [  849.213867] RAX: 0000000000000282 RBX: ffff9b5fb19d1818 RCX: 0000000000000000
> Jul 13 11:31:00 drs1p001 kernel: [  849.213869] RDX: 0000000000000000 RSI: ffff9b5fb19d1818 RDI: ffff9b5fb19d1894
> Jul 13 11:31:00 drs1p001 kernel: [  849.213870] RBP: 0000000000000003 R08: ffff9b5fd9ca22e0 R09: ffff9b5fcf1ac400
> Jul 13 11:31:00 drs1p001 kernel: [  849.213872] R10: ffffab4243c73b08 R11: 0000000000000000 R12: ffff9b5fb19d1894
> Jul 13 11:31:00 drs1p001 kernel: [  849.213874] R13: ffff9b5fcd2cd000 R14: 0000000000000000 R15: ffffffffc0b2d940
> Jul 13 11:31:00 drs1p001 kernel: [  849.213876] FS:  00007f62f1fa4700(0000) GS:ffff9b5fd9c80000(0000) knlGS:0000000000000000
> Jul 13 11:31:00 drs1p001 kernel: [  849.213878] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul 13 11:31:00 drs1p001 kernel: [  849.213879] CR2: 00007f62cc000010 CR3: 000000022abd2003 CR4: 00000000003606e0
> Jul 13 11:31:00 drs1p001 kernel: [  849.213881] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jul 13 11:31:00 drs1p001 kernel: [  849.213883] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Jul 13 11:31:00 drs1p001 kernel: [  849.213884] Call Trace:
> Jul 13 11:31:00 drs1p001 kernel: [  849.213897]  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Jul 13 11:31:00 drs1p001 kernel: [  849.213908]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
> Jul 13 11:31:00 drs1p001 kernel: [  849.213921]  ocfs2_lookup+0x199/0x2e0 [ocfs2]
> Jul 13 11:31:00 drs1p001 kernel: [  849.213925]  ? _cond_resched+0x16/0x40
> Jul 13 11:31:00 drs1p001 kernel: [  849.213928]  lookup_slow+0xa9/0x170
> Jul 13 11:31:00 drs1p001 kernel: [  849.213930]  walk_component+0x1c6/0x350
> Jul 13 11:31:00 drs1p001 kernel: [  849.213932]  path_lookupat+0x73/0x220
> Jul 13 11:31:00 drs1p001 kernel: [  849.213935]  ? ___bpf_prog_run+0xba7/0x1260
> Jul 13 11:31:00 drs1p001 kernel: [  849.213937]  filename_lookup+0xb8/0x1a0
> Jul 13 11:31:00 drs1p001 kernel: [  849.213940]  ? seccomp_run_filters+0x58/0xb0
> Jul 13 11:31:00 drs1p001 kernel: [  849.213942]  ? __check_object_size+0x98/0x1a0
> Jul 13 11:31:00 drs1p001 kernel: [  849.213945]  ? strncpy_from_user+0x48/0x160
> Jul 13 11:31:00 drs1p001 kernel: [  849.213947]  ? getname_flags+0x6a/0x1e0
> Jul 13 11:31:00 drs1p001 kernel: [  849.213950]  ? vfs_statx+0x73/0xe0
> Jul 13 11:31:00 drs1p001 kernel: [  849.213952]  vfs_statx+0x73/0xe0
> Jul 13 11:31:00 drs1p001 kernel: [  849.213954]  SYSC_newlstat+0x39/0x70
> Jul 13 11:31:00 drs1p001 kernel: [  849.213957]  ? syscall_trace_enter+0x117/0x2c0
> Jul 13 11:31:00 drs1p001 kernel: [  849.213959]  do_syscall_64+0x6c/0x130
> Jul 13 11:31:00 drs1p001 kernel: [  849.213961]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> Jul 13 11:31:00 drs1p001 kernel: [  849.213964] RIP: 0033:0x7f62f20800f5
> Jul 13 11:31:00 drs1p001 kernel: [  849.213965] RSP: 002b:00007f62f1fa3d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
> Jul 13 11:31:00 drs1p001 kernel: [  849.213967] RAX: ffffffffffffffda RBX: 00007f62f1fa3e50 RCX: 00007f62f20800f5
> Jul 13 11:31:00 drs1p001 kernel: [  849.213969] RDX: 00007f62f1fa3d40 RSI: 00007f62f1fa3d40 RDI: 00007f62e80008c0
> Jul 13 11:31:00 drs1p001 kernel: [  849.213971] RBP: 0000000000000033 R08: 0000000000000003 R09: 0000000000000000
> Jul 13 11:31:00 drs1p001 kernel: [  849.213972] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
> Jul 13 11:31:00 drs1p001 kernel: [  849.213974] R13: 0000000000000000 R14: 0000000000000003 R15: 0000564ea29f2878
> Jul 13 11:31:00 drs1p001 kernel: [  849.213976] Code: 89 c6 5b 5d 41 5c 41 5d e9 01 b8 e4 ea 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
> Jul 13 11:31:00 drs1p001 kernel: [  849.214007] RIP: __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffab4243c73b20
> Jul 13 11:31:00 drs1p001 kernel: [  849.214010] ---[ end trace 99c07b7b69ee7717 ]---
> 
> I'll see to have a backported 4.17 installed soon to verify whether it happens with newer kernels at all.
> 
> Regards,
> 
> Daniel
> 
> -----Original Message-----
> From: ocfs2-devel-bounces at oss.oracle.com [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
> Sent: Donnerstag, 12. Juli 2018 16:24
> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Larry,
> 
> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
> 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 15:29:08 drs1p001 kernel: [1300619.423923] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS: 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX: 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI: ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000 R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel: [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12: ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934] R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]  ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001 kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08 drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300 Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]  path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel: [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08 drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424003]  ? seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 Jul 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]  C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel: [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424022]  entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel: [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP: __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel: [1300628.967649] ------------[ cut here ]------------
> 
> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
> 
> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
> 
> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
> 
> Regards,
> 
> Daniel
> 
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Freitag, 11. Mai 2018 09:01
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>
>> * Install the LXC packages from the distribution
>> * run the command "lxc-create -n test1 -t download"
>> ** first run might prompt you to generate a ~/.config/lxc/default.conf
>> to define UID mappings
>> ** in a corporate environment it might be tricky to set the http_proxy
>> (and maybe even https_proxy) environment variables correctly
>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>> * the container downloads to ~/.local/share/lxc/
>> * adapt the "config" file in that directory to add the shared ocfs2
>> mount like in my example below
>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>> ** if you want to finally enable networking, most distributions
>> arrange a dedicated bridge (lxcbr0) which you can configure similar to
>> my example below
>> ** in my case I had to install cgroup related tools and reboot to have
>> all cgroups available, and to allow use of lxcbr0 bridge in
>> /etc/lxc/lxc-usernet
>>
>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Donnerstag, 12. April 2018 11:20
>> To: Daniel Sobe <daniel.sobe@nxp.com>
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Quite an interesting issue.
>>
>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>
>> Do you have a script to build up your lxc environment?
>> Because I want to make sure that my environment is quite the same as yours.
>>
>> Thanks,
>> Larry
>>
>>
>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> not sure if it helps, the issue wasn't there with Debian 8 and kernel
>>> 3.16 - but that's a long history. Unfortunately, the only machine
>>> where I could try to bisect, does not run any kernel < 4.16 without
>>> other issues ?
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Donnerstag, 12. April 2018 05:17
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Thanks for your report.
>>> I'll try to reproduce this bug as you did.
>>>
>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>
>>> Thanks
>>> Larry
>>>
>>>
>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3Dd8YTOI365uammRcpTuXDoQhwuGDm0CyQ-QNJxQAZczs%26s%3DcrzdJkF_u3rBf8xZ1cHEce1LBwHIrVIDads0aP6CP74%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C1f1b5d6a87334604103108d5e803507b%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670023298552201&amp;sdata=fB%2BH2oqPXUFCWoAO%2BlZ1Qg8jJkKpM0rf39AgJ1ObWJQ%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>
>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> ----
>>>>
>>>> # Distribution configuration
>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>> lxc.arch = x86_64
>>>>
>>>> # Container specific configuration
>>>> lxc.id_map = u 0 624288 65536
>>>> lxc.id_map = g 0 624288 65536
>>>>
>>>> lxc.utsname = container1
>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>
>>>> lxc.network.type = veth
>>>> lxc.network.flags = up
>>>> lxc.network.link = bridge1
>>>> lxc.network.name = eth0
>>>> lxc.network.veth.pair = aabbccddeeff
>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>>
>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>
>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>
>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>
>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>
>>>> ----
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>>
>>>>
>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>
>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>      ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>
>>>> How did you setup your lxc or container?
>>>>
>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>
>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>
>>>> Regards,
>>>> Larry
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
> 
> Sorry for this delayed reply.
> 
> I tried with lxc + ocfs2 in your mount-shared way.
> 
> But I can not reproduce your bugs.
> 
> What I use is opensuse tumbleweed.
> 
> The procedure I try to reproduce your bugs:
> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>   ?? mount /dev/xxx /mnt
>   ?? then it shows
>   ?? 207 65 254:16 / /mnt rw,relatime shared:94
>   ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
> 
> 1. Start Virtual Machine Manager.
> 2. add a local LXC connection by clicking File ? Add Connection.
>   ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
> 4. Activate Application container and click Forward.
>   ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
> Click Forward.
> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>   ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
> 
> If possible, could you please provide a shell script to show what you did with you mount point.
> 
> Thanks
> Larry
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss.oracle.com-252Fmailman-252Flistinfo-252Focfs2-2Ddevel-26amp-3Bdata-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C1f1b5d6a87334604103108d5e803507b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636670023298552201-26amp-3Bsdata-3DSMj8hOyr2U1FctgW76Vei7KqVxNnVDXLmZYhNSKEhGc-253D-26amp-3Breserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=jxv7mc6IKJoCli8onTYptgqtB2F0pH85mBSm_siNaW0&s=MroZwPKlWi9mMDMNLiVspA1V9oS6VBAxi12k-7Epp2E&e=
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-13  9:48                       ` Larry Chen
@ 2018-07-13 10:06                         ` Larry Chen
  2018-07-13 11:55                         ` Daniel Sobe
  1 sibling, 0 replies; 32+ messages in thread
From: Larry Chen @ 2018-07-13 10:06 UTC (permalink / raw)
  To: ocfs2-devel

Hi list,

Some mistake on my previous analysis.

On 07/13/2018 05:48 PM, Larry Chen wrote:
> Hi Daniel,
> 
> Thanks for your effort to reproduce the bug.
> I can confirm that there exist more than one bug.
> I'll focus on this interesting issue.
> 
> 
> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut here ]------------
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000 [#1] SMP PTI
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423923] RIP: 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2]
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS: 00010046
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX: 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI: ffff9d269d990018 RDI: ffff9d269d990094
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000 R09: 000000000000036a
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12: ffff9d269d990094
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423934] R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
> 
> Here is caused by ocfs2_dentry_lock failed.
> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the
> failure of ocfs2_dentry_lock.
> 
Here maybe ocfs2_dentry_lock has counter-leaking bug.
I'll review the logic of source code.

Regards,
Larry
> But why it failed still confuses me.
> 
> 
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]  ocfs2_lookup+0x199/0x2e0 [ocfs2]
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423986]  ? _cond_resched+0x16/0x40
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423989]  lookup_slow+0xa9/0x170
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]  path_lookupat+0x73/0x220
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424003]  ? seccomp_run_filters+0x58/0xb0
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424005]  ? __check_object_size+0x98/0x1a0
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  vfs_statx+0x73/0xe0
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]  C_SYSC_x86_stat64+0x39/0x70
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424022]  entry_SYSENTER_compat+0x7f/0x8e
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP: __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace aea789961795b75f ]---
>> Jul 12 15:29:08 drs1p001 kernel: [1300628.967649] ------------[ cut here ]------------
>>
>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>
>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>
>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Freitag, 11. Mai 2018 09:01
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>
>>> * Install the LXC packages from the distribution
>>> * run the command "lxc-create -n test1 -t download"
>>> ** first run might prompt you to generate a ~/.config/lxc/default.conf
>>> to define UID mappings
>>> ** in a corporate environment it might be tricky to set the http_proxy
>>> (and maybe even https_proxy) environment variables correctly
>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>> * the container downloads to ~/.local/share/lxc/
>>> * adapt the "config" file in that directory to add the shared ocfs2
>>> mount like in my example below
>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>> ** if you want to finally enable networking, most distributions
>>> arrange a dedicated bridge (lxcbr0) which you can configure similar to
>>> my example below
>>> ** in my case I had to install cgroup related tools and reboot to have
>>> all cgroups available, and to allow use of lxcbr0 bridge in
>>> /etc/lxc/lxc-usernet
>>>
>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Donnerstag, 12. April 2018 11:20
>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Quite an interesting issue.
>>>
>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>
>>> Do you have a script to build up your lxc environment?
>>> Because I want to make sure that my environment is quite the same as yours.
>>>
>>> Thanks,
>>> Larry
>>>
>>>
>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> not sure if it helps, the issue wasn't there with Debian 8 and kernel
>>>> 3.16 - but that's a long history. Unfortunately, the only machine
>>>> where I could try to bisect, does not run any kernel < 4.16 without
>>>> other issues ?
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Thanks for your report.
>>>> I'll try to reproduce this bug as you did.
>>>>
>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>
>>>> Thanks
>>>> Larry
>>>>
>>>>
>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ZyZJHyS-hpOt5cG5HUXWP4DMTlWsFjNC24kNNJrQP7Y&s=p6IBQtAwi0mNcKsq2KQH96D2xRdWu7HXtgXDOtDYq28&e=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>
>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>> ----
>>>>>
>>>>> # Distribution configuration
>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>> lxc.arch = x86_64
>>>>>
>>>>> # Container specific configuration
>>>>> lxc.id_map = u 0 624288 65536
>>>>> lxc.id_map = g 0 624288 65536
>>>>>
>>>>> lxc.utsname = container1
>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>
>>>>> lxc.network.type = veth
>>>>> lxc.network.flags = up
>>>>> lxc.network.link = bridge1
>>>>> lxc.network.name = eth0
>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>>>
>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>
>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>
>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>
>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>
>>>>> ----
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>>
>>>>>
>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>
>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>       ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>
>>>>> How did you setup your lxc or container?
>>>>>
>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>
>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>
>>>>> Regards,
>>>>> Larry
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>
>> Sorry for this delayed reply.
>>
>> I tried with lxc + ocfs2 in your mount-shared way.
>>
>> But I can not reproduce your bugs.
>>
>> What I use is opensuse tumbleweed.
>>
>> The procedure I try to reproduce your bugs:
>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>    ?? mount /dev/xxx /mnt
>>    ?? then it shows
>>    ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>    ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>
>> 1. Start Virtual Machine Manager.
>> 2. add a local LXC connection by clicking File ? Add Connection.
>>    ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>> 4. Activate Application container and click Forward.
>>    ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>> Click Forward.
>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>    ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>
>> If possible, could you please provide a shell script to show what you did with you mount point.
>>
>> Thanks
>> Larry
>>
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-13  9:48                       ` Larry Chen
  2018-07-13 10:06                         ` Larry Chen
@ 2018-07-13 11:55                         ` Daniel Sobe
  2018-07-16 11:49                           ` Daniel Sobe
  1 sibling, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2018-07-13 11:55 UTC (permalink / raw)
  To: ocfs2-devel

Hi Larry,

I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).

Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.

Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.

Regards,

Daniel

-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com] 
Sent: Freitag, 13. Juli 2018 11:49
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

Thanks for your effort to reproduce the bug.
I can confirm that there exist more than one bug.
I'll focus on this interesting issue.


On 07/12/2018 10:24 PM, Daniel Sobe wrote:
> Hi Larry,
> 
> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
> 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut 
> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000 
> [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules 
> linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell 
> Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 15:29:08 
> drs1p001 kernel: [1300619.423923] RIP: 
> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12 15:29:08 
> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS: 
> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX: 
> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12 
> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI: 
> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001 
> kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000 
> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12: 
> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934] 
> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ? 
> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]

Here is caused by ocfs2_dentry_lock failed.
I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.

But why it failed still confuses me.


> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]  
> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001 
> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08 
> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12 
> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]  
> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08 
> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12 
> 15:29:08 drs1p001 kernel: [1300619.424003]  ? 
> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08 
> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 Jul 
> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  vfs_statx+0x73/0xe0 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]  
> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08 
> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul 
> 12 15:29:08 drs1p001 kernel: [1300619.424022]  
> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 8b 
> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 
> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 
> 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP: 
> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace 
> aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel: 
> [1300628.967649] ------------[ cut here ]------------
> 
> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
> 
> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
> 
> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
> 
> Regards,
> 
> Daniel
> 
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Freitag, 11. Mai 2018 09:01
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>
>> * Install the LXC packages from the distribution
>> * run the command "lxc-create -n test1 -t download"
>> ** first run might prompt you to generate a 
>> ~/.config/lxc/default.conf to define UID mappings
>> ** in a corporate environment it might be tricky to set the 
>> http_proxy (and maybe even https_proxy) environment variables 
>> correctly
>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>> * the container downloads to ~/.local/share/lxc/
>> * adapt the "config" file in that directory to add the shared ocfs2 
>> mount like in my example below
>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>> ** if you want to finally enable networking, most distributions 
>> arrange a dedicated bridge (lxcbr0) which you can configure similar 
>> to my example below
>> ** in my case I had to install cgroup related tools and reboot to 
>> have all cgroups available, and to allow use of lxcbr0 bridge in 
>> /etc/lxc/lxc-usernet
>>
>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Donnerstag, 12. April 2018 11:20
>> To: Daniel Sobe <daniel.sobe@nxp.com>
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Quite an interesting issue.
>>
>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>
>> Do you have a script to build up your lxc environment?
>> Because I want to make sure that my environment is quite the same as yours.
>>
>> Thanks,
>> Larry
>>
>>
>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> not sure if it helps, the issue wasn't there with Debian 8 and 
>>> kernel
>>> 3.16 - but that's a long history. Unfortunately, the only machine 
>>> where I could try to bisect, does not run any kernel < 4.16 without 
>>> other issues ?
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Donnerstag, 12. April 2018 05:17
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Thanks for your report.
>>> I'll try to reproduce this bug as you did.
>>>
>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>
>>> Thanks
>>> Larry
>>>
>>>
>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=VTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA&s=tGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw&e=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>
>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> ----
>>>>
>>>> # Distribution configuration
>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>> lxc.arch = x86_64
>>>>
>>>> # Container specific configuration
>>>> lxc.id_map = u 0 624288 65536
>>>> lxc.id_map = g 0 624288 65536
>>>>
>>>> lxc.utsname = container1
>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>
>>>> lxc.network.type = veth
>>>> lxc.network.flags = up
>>>> lxc.network.link = bridge1
>>>> lxc.network.name = eth0
>>>> lxc.network.veth.pair = aabbccddeeff
>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>>
>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>
>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>
>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>
>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>
>>>> ----
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>>
>>>>
>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>
>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>      ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>
>>>> How did you setup your lxc or container?
>>>>
>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>
>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>
>>>> Regards,
>>>> Larry
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
> 
> Sorry for this delayed reply.
> 
> I tried with lxc + ocfs2 in your mount-shared way.
> 
> But I can not reproduce your bugs.
> 
> What I use is opensuse tumbleweed.
> 
> The procedure I try to reproduce your bugs:
> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>   ?? mount /dev/xxx /mnt
>   ?? then it shows
>   ?? 207 65 254:16 / /mnt rw,relatime shared:94
>   ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
> 
> 1. Start Virtual Machine Manager.
> 2. add a local LXC connection by clicking File ? Add Connection.
>   ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
> 4. Activate Application container and click Forward.
>   ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
> Click Forward.
> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>   ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
> 
> If possible, could you please provide a shell script to show what you did with you mount point.
> 
> Thanks
> Larry
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-13 11:55                         ` Daniel Sobe
@ 2018-07-16 11:49                           ` Daniel Sobe
  2018-07-17  2:54                             ` Larry Chen
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2018-07-16 11:49 UTC (permalink / raw)
  To: ocfs2-devel

Hi,

the same issue happens with 4.17.6 kernel from Debian unstable.

This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.

All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume. 

After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.

Regards,

Daniel


Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here ]------------
Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
Jul 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager oc
Jul 16 13:40:24 drs1p002 kernel:  jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci
Jul 16 13:40:24 drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted 4.17.0-1-amd64 #1 Debian 4.17.6-1
Jul 16 13:40:24 drs1p002 kernel: Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
Jul 16 13:40:24 drs1p002 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2]
Jul 16 13:40:24 drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046
Jul 16 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff92559ee9f018 RCX: 00000000000501e7
Jul 16 13:40:24 drs1p002 kernel: RDX: 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094
Jul 16 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763
Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003
Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200
Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000
Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0
Jul 16 13:40:24 drs1p002 kernel: Call Trace:
Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
Jul 16 13:40:24 drs1p002 kernel:  ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2]
Jul 16 13:40:24 drs1p002 kernel:  ? d_splice_alias+0x2a5/0x410
Jul 16 13:40:24 drs1p002 kernel:  ocfs2_lookup+0x233/0x2c0 [ocfs2]
Jul 16 13:40:24 drs1p002 kernel:  __lookup_slow+0x97/0x150
Jul 16 13:40:24 drs1p002 kernel:  lookup_slow+0x35/0x50
Jul 16 13:40:24 drs1p002 kernel:  walk_component+0x1c4/0x470
Jul 16 13:40:24 drs1p002 kernel:  ? link_path_walk+0x27c/0x510
Jul 16 13:40:24 drs1p002 kernel:  ? ktime_get+0x3e/0xa0
Jul 16 13:40:24 drs1p002 kernel:  path_lookupat+0x84/0x1f0
Jul 16 13:40:24 drs1p002 kernel:  filename_lookup+0xb6/0x190
Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_inode_unlock+0xe4/0xf0 [ocfs2]
Jul 16 13:40:24 drs1p002 kernel:  ? __check_object_size+0xa7/0x1a0
Jul 16 13:40:24 drs1p002 kernel:  ? strncpy_from_user+0x48/0x160
Jul 16 13:40:24 drs1p002 kernel:  ? getname_flags+0x6a/0x1e0
Jul 16 13:40:24 drs1p002 kernel:  ? vfs_statx+0x73/0xe0
Jul 16 13:40:24 drs1p002 kernel:  vfs_statx+0x73/0xe0
Jul 16 13:40:24 drs1p002 kernel:  __do_sys_newlstat+0x39/0x70
Jul 16 13:40:24 drs1p002 kernel:  do_syscall_64+0x55/0x110
Jul 16 13:40:24 drs1p002 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995
Jul 16 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Jul 16 13:40:24 drs1p002 kernel: RAX: ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995
Jul 16 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI: 00007f0ce8ff8ce0 RDI: 00007f0cb0000b20
Jul 16 13:40:24 drs1p002 kernel: RBP: 0000000000000017 R08: 0000000000000003 R09: 0000000000000000
Jul 16 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f0ce8ff8dc4
Jul 16 13:40:24 drs1p002 kernel: R13: 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005
Jul 16 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 e
Jul 16 13:40:24 drs1p002 kernel: RIP: __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP: ffff9e57887dfaf8
Jul 16 13:40:24 drs1p002 kernel: ---[ end trace a5a84fa62e77df42 ]---

-----Original Message-----
From: ocfs2-devel-bounces@oss.oracle.com [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
Sent: Freitag, 13. Juli 2018 13:56
To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Larry,

I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).

Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.

Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.

Regards,

Daniel

-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com]
Sent: Freitag, 13. Juli 2018 11:49
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

Thanks for your effort to reproduce the bug.
I can confirm that there exist more than one bug.
I'll focus on this interesting issue.


On 07/12/2018 10:24 PM, Daniel Sobe wrote:
> Hi Larry,
> 
> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
> 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut 
> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000 
> [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules 
> linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell 
> Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 15:29:08
> drs1p001 kernel: [1300619.423923] RIP: 
> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12 15:29:08
> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS: 
> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX: 
> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI: 
> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
> kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000
> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12: 
> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ? 
> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]

Here is caused by ocfs2_dentry_lock failed.
I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.

But why it failed still confuses me.


> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001
> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12
> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12
> 15:29:08 drs1p001 kernel: [1300619.424003]  ? 
> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 Jul
> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  vfs_statx+0x73/0xe0 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]
> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel: 
> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul
> 12 15:29:08 drs1p001 kernel: [1300619.424022] 
> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 8b
> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 
> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 
> 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10 
> Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace 
> aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
> [1300628.967649] ------------[ cut here ]------------
> 
> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
> 
> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
> 
> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
> 
> Regards,
> 
> Daniel
> 
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Freitag, 11. Mai 2018 09:01
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>
>> * Install the LXC packages from the distribution
>> * run the command "lxc-create -n test1 -t download"
>> ** first run might prompt you to generate a 
>> ~/.config/lxc/default.conf to define UID mappings
>> ** in a corporate environment it might be tricky to set the 
>> http_proxy (and maybe even https_proxy) environment variables 
>> correctly
>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>> * the container downloads to ~/.local/share/lxc/
>> * adapt the "config" file in that directory to add the shared ocfs2 
>> mount like in my example below
>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>> ** if you want to finally enable networking, most distributions 
>> arrange a dedicated bridge (lxcbr0) which you can configure similar 
>> to my example below
>> ** in my case I had to install cgroup related tools and reboot to 
>> have all cgroups available, and to allow use of lxcbr0 bridge in 
>> /etc/lxc/lxc-usernet
>>
>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Donnerstag, 12. April 2018 11:20
>> To: Daniel Sobe <daniel.sobe@nxp.com>
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Quite an interesting issue.
>>
>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>
>> Do you have a script to build up your lxc environment?
>> Because I want to make sure that my environment is quite the same as yours.
>>
>> Thanks,
>> Larry
>>
>>
>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> not sure if it helps, the issue wasn't there with Debian 8 and 
>>> kernel
>>> 3.16 - but that's a long history. Unfortunately, the only machine 
>>> where I could try to bisect, does not run any kernel < 4.16 without 
>>> other issues ?
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Donnerstag, 12. April 2018 05:17
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Thanks for your report.
>>> I'll try to reproduce this bug as you did.
>>>
>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>
>>> Thanks
>>> Larry
>>>
>>>
>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=DPJ%2BOixL7cb5fRv3whA2NOpvGtq%2BzQ9il4m2gk7MXgo%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>
>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> ----
>>>>
>>>> # Distribution configuration
>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>> lxc.arch = x86_64
>>>>
>>>> # Container specific configuration
>>>> lxc.id_map = u 0 624288 65536
>>>> lxc.id_map = g 0 624288 65536
>>>>
>>>> lxc.utsname = container1
>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>
>>>> lxc.network.type = veth
>>>> lxc.network.flags = up
>>>> lxc.network.link = bridge1
>>>> lxc.network.name = eth0
>>>> lxc.network.veth.pair = aabbccddeeff
>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>>
>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>
>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>
>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>
>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>
>>>> ----
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>>
>>>>
>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>
>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>      ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>
>>>> How did you setup your lxc or container?
>>>>
>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>
>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>
>>>> Regards,
>>>> Larry
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
> 
> Sorry for this delayed reply.
> 
> I tried with lxc + ocfs2 in your mount-shared way.
> 
> But I can not reproduce your bugs.
> 
> What I use is opensuse tumbleweed.
> 
> The procedure I try to reproduce your bugs:
> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>   ?? mount /dev/xxx /mnt
>   ?? then it shows
>   ?? 207 65 254:16 / /mnt rw,relatime shared:94
>   ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
> 
> 1. Start Virtual Machine Manager.
> 2. add a local LXC connection by clicking File ? Add Connection.
>   ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
> 4. Activate Application container and click Forward.
>   ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
> Click Forward.
> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>   ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
> 
> If possible, could you please provide a shell script to show what you did with you mount point.
> 
> Thanks
> Larry
> 


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel at oss.oracle.com
https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss.oracle.com-252Fmailman-252Flistinfo-252Focfs2-2Ddevel-26amp-3Bdata-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C9befd428db39400d656308d5e8b7b97d-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636670798149970770-26amp-3Bsdata-3Ddc-252BBrbJTpIRAEs8NHtosqLOejDR1auX9-252FaSFXda0TIo-253D-26amp-3Breserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=LBNIa3ZmDRSJE8_bBqR2uE8ZYUXy48H1v9YfB94Mdjs&s=uXp2GbG8ocpYvCFo10UkZZhehRVwKKOrH-ENzyfOBxA&e=

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-16 11:49                           ` Daniel Sobe
@ 2018-07-17  2:54                             ` Larry Chen
  2018-07-17  8:11                               ` Daniel Sobe
  0 siblings, 1 reply; 32+ messages in thread
From: Larry Chen @ 2018-07-17  2:54 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,

Could you please simplify your environment?
Can I use several virtual machines to reproduce the bug??

Thanks
Larry

On 07/16/2018 07:49 PM, Daniel Sobe wrote:
> Hi,
> 
> the same issue happens with 4.17.6 kernel from Debian unstable.
> 
> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
> 
> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
> 
> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
> 
> Regards,
> 
> Daniel
> 
> 
> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here ]------------
> Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
> Jul 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager oc
> Jul 16 13:40:24 drs1p002 kernel:  jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci
> Jul 16 13:40:24 drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted 4.17.0-1-amd64 #1 Debian 4.17.6-1
> Jul 16 13:40:24 drs1p002 kernel: Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
> Jul 16 13:40:24 drs1p002 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2]
> Jul 16 13:40:24 drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046
> Jul 16 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff92559ee9f018 RCX: 00000000000501e7
> Jul 16 13:40:24 drs1p002 kernel: RDX: 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094
> Jul 16 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763
> Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003
> Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200
> Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000
> Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0
> Jul 16 13:40:24 drs1p002 kernel: Call Trace:
> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Jul 16 13:40:24 drs1p002 kernel:  ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2]
> Jul 16 13:40:24 drs1p002 kernel:  ? d_splice_alias+0x2a5/0x410
> Jul 16 13:40:24 drs1p002 kernel:  ocfs2_lookup+0x233/0x2c0 [ocfs2]
> Jul 16 13:40:24 drs1p002 kernel:  __lookup_slow+0x97/0x150
> Jul 16 13:40:24 drs1p002 kernel:  lookup_slow+0x35/0x50
> Jul 16 13:40:24 drs1p002 kernel:  walk_component+0x1c4/0x470
> Jul 16 13:40:24 drs1p002 kernel:  ? link_path_walk+0x27c/0x510
> Jul 16 13:40:24 drs1p002 kernel:  ? ktime_get+0x3e/0xa0
> Jul 16 13:40:24 drs1p002 kernel:  path_lookupat+0x84/0x1f0
> Jul 16 13:40:24 drs1p002 kernel:  filename_lookup+0xb6/0x190
> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_inode_unlock+0xe4/0xf0 [ocfs2]
> Jul 16 13:40:24 drs1p002 kernel:  ? __check_object_size+0xa7/0x1a0
> Jul 16 13:40:24 drs1p002 kernel:  ? strncpy_from_user+0x48/0x160
> Jul 16 13:40:24 drs1p002 kernel:  ? getname_flags+0x6a/0x1e0
> Jul 16 13:40:24 drs1p002 kernel:  ? vfs_statx+0x73/0xe0
> Jul 16 13:40:24 drs1p002 kernel:  vfs_statx+0x73/0xe0
> Jul 16 13:40:24 drs1p002 kernel:  __do_sys_newlstat+0x39/0x70
> Jul 16 13:40:24 drs1p002 kernel:  do_syscall_64+0x55/0x110
> Jul 16 13:40:24 drs1p002 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995
> Jul 16 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
> Jul 16 13:40:24 drs1p002 kernel: RAX: ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995
> Jul 16 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI: 00007f0ce8ff8ce0 RDI: 00007f0cb0000b20
> Jul 16 13:40:24 drs1p002 kernel: RBP: 0000000000000017 R08: 0000000000000003 R09: 0000000000000000
> Jul 16 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f0ce8ff8dc4
> Jul 16 13:40:24 drs1p002 kernel: R13: 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005
> Jul 16 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 e
> Jul 16 13:40:24 drs1p002 kernel: RIP: __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP: ffff9e57887dfaf8
> Jul 16 13:40:24 drs1p002 kernel: ---[ end trace a5a84fa62e77df42 ]---
> 
> -----Original Message-----
> From: ocfs2-devel-bounces at oss.oracle.com [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
> Sent: Freitag, 13. Juli 2018 13:56
> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Larry,
> 
> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
> 
> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
> 
> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
> 
> Regards,
> 
> Daniel
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Freitag, 13. Juli 2018 11:49
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> Thanks for your effort to reproduce the bug.
> I can confirm that there exist more than one bug.
> I'll focus on this interesting issue.
> 
> 
> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut
>> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 0000
>> [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] Modules
>> linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell
>> Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 15:29:08
>> drs1p001 kernel: [1300619.423923] RIP:
>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12 15:29:08
>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>> kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000
>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
> 
> Here is caused by ocfs2_dentry_lock failed.
> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
> 
> But why it failed still confuses me.
> 
> 
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001
>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12
>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12
>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 Jul
>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  vfs_statx+0x73/0xe0
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]
>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul
>> 12 15:29:08 drs1p001 kernel: [1300619.424022]
>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 8b
>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1
>> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
>> 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: ffffb14b4a133b10
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] ---[ end trace
>> aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>> [1300628.967649] ------------[ cut here ]------------
>>
>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>
>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>
>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Freitag, 11. Mai 2018 09:01
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>
>>> * Install the LXC packages from the distribution
>>> * run the command "lxc-create -n test1 -t download"
>>> ** first run might prompt you to generate a
>>> ~/.config/lxc/default.conf to define UID mappings
>>> ** in a corporate environment it might be tricky to set the
>>> http_proxy (and maybe even https_proxy) environment variables
>>> correctly
>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>> * the container downloads to ~/.local/share/lxc/
>>> * adapt the "config" file in that directory to add the shared ocfs2
>>> mount like in my example below
>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>> ** if you want to finally enable networking, most distributions
>>> arrange a dedicated bridge (lxcbr0) which you can configure similar
>>> to my example below
>>> ** in my case I had to install cgroup related tools and reboot to
>>> have all cgroups available, and to allow use of lxcbr0 bridge in
>>> /etc/lxc/lxc-usernet
>>>
>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Donnerstag, 12. April 2018 11:20
>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Quite an interesting issue.
>>>
>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>
>>> Do you have a script to build up your lxc environment?
>>> Because I want to make sure that my environment is quite the same as yours.
>>>
>>> Thanks,
>>> Larry
>>>
>>>
>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> not sure if it helps, the issue wasn't there with Debian 8 and
>>>> kernel
>>>> 3.16 - but that's a long history. Unfortunately, the only machine
>>>> where I could try to bisect, does not run any kernel < 4.16 without
>>>> other issues ?
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Thanks for your report.
>>>> I'll try to reproduce this bug as you did.
>>>>
>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>
>>>> Thanks
>>>> Larry
>>>>
>>>>
>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=DPJ%2BOixL7cb5fRv3whA2NOpvGtq%2BzQ9il4m2gk7MXgo%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>
>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>> ----
>>>>>
>>>>> # Distribution configuration
>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>> lxc.arch = x86_64
>>>>>
>>>>> # Container specific configuration
>>>>> lxc.id_map = u 0 624288 65536
>>>>> lxc.id_map = g 0 624288 65536
>>>>>
>>>>> lxc.utsname = container1
>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>
>>>>> lxc.network.type = veth
>>>>> lxc.network.flags = up
>>>>> lxc.network.link = bridge1
>>>>> lxc.network.name = eth0
>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>>>
>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>
>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>
>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>
>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>
>>>>> ----
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>>
>>>>>
>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>
>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>       ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>
>>>>> How did you setup your lxc or container?
>>>>>
>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>
>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>
>>>>> Regards,
>>>>> Larry
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>
>> Sorry for this delayed reply.
>>
>> I tried with lxc + ocfs2 in your mount-shared way.
>>
>> But I can not reproduce your bugs.
>>
>> What I use is opensuse tumbleweed.
>>
>> The procedure I try to reproduce your bugs:
>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>    ?? mount /dev/xxx /mnt
>>    ?? then it shows
>>    ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>    ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>
>> 1. Start Virtual Machine Manager.
>> 2. add a local LXC connection by clicking File ? Add Connection.
>>    ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>> 4. Activate Application container and click Forward.
>>    ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>> Click Forward.
>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>    ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>
>> If possible, could you please provide a shell script to show what you did with you mount point.
>>
>> Thanks
>> Larry
>>
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss.oracle.com-252Fmailman-252Flistinfo-252Focfs2-2Ddevel-26amp-3Bdata-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C9befd428db39400d656308d5e8b7b97d-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636670798149970770-26amp-3Bsdata-3Ddc-252BBrbJTpIRAEs8NHtosqLOejDR1auX9-252FaSFXda0TIo-253D-26amp-3Breserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ytpjqrKtTXjW0SeKG_vmEgrp1eQ4wttMe3FZYWR0v7Y&s=2hkMsDNtmowNzcB_la7H2M18RXfykmNCOqv7-3J1-44&e=
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-17  2:54                             ` Larry Chen
@ 2018-07-17  8:11                               ` Daniel Sobe
  2018-07-18  8:08                                 ` Larry Chen
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2018-07-17  8:11 UTC (permalink / raw)
  To: ocfs2-devel

Hi Larry,

I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved. 

The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:

SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2

I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?

Regards,

Daniel


-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com] 
Sent: Dienstag, 17. Juli 2018 04:54
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

Could you please simplify your environment?
Can I use several virtual machines to reproduce the bug??

Thanks
Larry

On 07/16/2018 07:49 PM, Daniel Sobe wrote:
> Hi,
> 
> the same issue happens with 4.17.6 kernel from Debian unstable.
> 
> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
> 
> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
> 
> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
> 
> Regards,
> 
> Daniel
> 
> 
> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here ]------------ 
> Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI Jul 
> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag 
> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2 
> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 
> dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16 13:40:24 
> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted 
> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel: 
> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014 Jul 
> 16 13:40:24 drs1p002 kernel: RIP: 
> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16 13:40:24 
> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16 
> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff92559ee9f018 
> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX: 
> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16 
> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 
> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:  
> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24 drs1p002 
> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:  
> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:  
> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:  
> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:  
> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ? 
> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ? 
> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:  
> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:  
> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ? 
> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:  
> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ? 
> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ? 
> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ? 
> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:  
> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:  
> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:  
> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:  
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16 
> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS: 00000246 
> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX: 
> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16 
> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI: 00007f0ce8ff8ce0 
> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP: 
> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16 
> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11: 0000000000000246 
> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13: 
> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16 
> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 
> 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 
> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP: 
> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP: ffff9e57887dfaf8 
> Jul 16 13:40:24 drs1p002 kernel: ---[ end trace a5a84fa62e77df42 ]---
> 
> -----Original Message-----
> From: ocfs2-devel-bounces at oss.oracle.com 
> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
> Sent: Freitag, 13. Juli 2018 13:56
> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Larry,
> 
> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
> 
> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
> 
> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
> 
> Regards,
> 
> Daniel
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Freitag, 13. Juli 2018 11:49
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> Thanks for your effort to reproduce the bug.
> I can confirm that there exist more than one bug.
> I'll focus on this interesting issue.
> 
> 
> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut 
>> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode: 
>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] 
>> Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell 
>> Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 15:29:08
>> drs1p001 kernel: [1300619.423923] RIP:
>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12 15:29:08
>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>> kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000
>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
> 
> Here is caused by ocfs2_dentry_lock failed.
> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
> 
> But why it failed still confuses me.
> 
> 
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001
>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12
>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300 
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12
>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 Jul
>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0 
>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]  
>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]
>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul
>> 12 15:29:08 drs1p001 kernel: [1300619.424022] 
>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 
>> 8b
>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 
>> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 
>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP: 
>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] 
>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>> [1300628.967649] ------------[ cut here ]------------
>>
>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>
>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>
>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Freitag, 11. Mai 2018 09:01
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>
>>> * Install the LXC packages from the distribution
>>> * run the command "lxc-create -n test1 -t download"
>>> ** first run might prompt you to generate a 
>>> ~/.config/lxc/default.conf to define UID mappings
>>> ** in a corporate environment it might be tricky to set the 
>>> http_proxy (and maybe even https_proxy) environment variables 
>>> correctly
>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>> * the container downloads to ~/.local/share/lxc/
>>> * adapt the "config" file in that directory to add the shared ocfs2 
>>> mount like in my example below
>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>> ** if you want to finally enable networking, most distributions 
>>> arrange a dedicated bridge (lxcbr0) which you can configure similar 
>>> to my example below
>>> ** in my case I had to install cgroup related tools and reboot to 
>>> have all cgroups available, and to allow use of lxcbr0 bridge in 
>>> /etc/lxc/lxc-usernet
>>>
>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Donnerstag, 12. April 2018 11:20
>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Quite an interesting issue.
>>>
>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>
>>> Do you have a script to build up your lxc environment?
>>> Because I want to make sure that my environment is quite the same as yours.
>>>
>>> Thanks,
>>> Larry
>>>
>>>
>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> not sure if it helps, the issue wasn't there with Debian 8 and 
>>>> kernel
>>>> 3.16 - but that's a long history. Unfortunately, the only machine 
>>>> where I could try to bisect, does not run any kernel < 4.16 without 
>>>> other issues ?
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Thanks for your report.
>>>> I'll try to reproduce this bug as you did.
>>>>
>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>
>>>> Thanks
>>>> Larry
>>>>
>>>>
>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=DPJ%2BOixL7cb5fRv3whA2NOpvGtq%2BzQ9il4m2gk7MXgo%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>
>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>> ----
>>>>>
>>>>> # Distribution configuration
>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>> lxc.arch = x86_64
>>>>>
>>>>> # Container specific configuration lxc.id_map = u 0 624288 65536 
>>>>> lxc.id_map = g 0 624288 65536
>>>>>
>>>>> lxc.utsname = container1
>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>
>>>>> lxc.network.type = veth
>>>>> lxc.network.flags = up
>>>>> lxc.network.link = bridge1
>>>>> lxc.network.name = eth0
>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>>>
>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>
>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>
>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>
>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>
>>>>> ----
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>>
>>>>>
>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>
>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>       ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>
>>>>> How did you setup your lxc or container?
>>>>>
>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>
>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>
>>>>> Regards,
>>>>> Larry
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>
>> Sorry for this delayed reply.
>>
>> I tried with lxc + ocfs2 in your mount-shared way.
>>
>> But I can not reproduce your bugs.
>>
>> What I use is opensuse tumbleweed.
>>
>> The procedure I try to reproduce your bugs:
>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>    ?? mount /dev/xxx /mnt
>>    ?? then it shows
>>    ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>    ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>
>> 1. Start Virtual Machine Manager.
>> 2. add a local LXC connection by clicking File ? Add Connection.
>>    ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>> 4. Activate Application container and click Forward.
>>    ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>> Click Forward.
>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>    ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>
>> If possible, could you please provide a shell script to show what you did with you mount point.
>>
>> Thanks
>> Larry
>>
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=oJB1J1AGjhukTQWgdrPYxHfZFJR2CMOyfHYBEFarl-I&s=AiKxlIj1uRAIhgq3g9i5rEuVhnkTwKLRMjEpOGA9r9M&e=
> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdani
> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6
> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJTpIR
> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-17  8:11                               ` Daniel Sobe
@ 2018-07-18  8:08                                 ` Larry Chen
  2018-07-19 12:36                                   ` Daniel Sobe
  2018-09-11 11:35                                   ` Daniel Sobe
  0 siblings, 2 replies; 32+ messages in thread
From: Larry Chen @ 2018-07-18  8:08 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,

Which stack do you use? dlm or o2cb??

I tried to reproduce the bug.

I have set up 2 virtual machines that share one block device(as a qcow2 
file on host). And I was using dlm stack instead of o2cb. Kernel version 
is 4.12.14. I clone linux kernel tree from github and execute the 
following shell script.

#! /bin/bash
for i in $(git tag)
do
         echo $i
         git checkout $i
done

Bug could not be reproduced.

According to the back trace, I think the bug is caused by the logic of 
holding a lock.

If possible, I think the bug will recur, even without drdb, lvm or other
components.

Regards,
Larry

On 07/17/2018 04:11 PM, Daniel Sobe wrote:
> Hi Larry,
> 
> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
> 
> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
> 
> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
> 
> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
> 
> Regards,
> 
> Daniel
> 
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Dienstag, 17. Juli 2018 04:54
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> Could you please simplify your environment?
> Can I use several virtual machines to reproduce the bug??
> 
> Thanks
> Larry
> 
> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>> Hi,
>>
>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>
>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>
>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>
>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here ]------------
>> Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI Jul
>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2
>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64
>> dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16 13:40:24
>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014 Jul
>> 16 13:40:24 drs1p002 kernel: RIP:
>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16 13:40:24
>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff92559ee9f018
>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24 drs1p002
>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS: 00000246
>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI: 00007f0ce8ff8ce0
>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11: 0000000000000246
>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9
>> 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85
>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP: ffff9e57887dfaf8
>> Jul 16 13:40:24 drs1p002 kernel: ---[ end trace a5a84fa62e77df42 ]---
>>
>> -----Original Message-----
>> From: ocfs2-devel-bounces at oss.oracle.com
>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
>> Sent: Freitag, 13. Juli 2018 13:56
>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Larry,
>>
>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>
>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>
>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Freitag, 13. Juli 2018 11:49
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Thanks for your effort to reproduce the bug.
>> I can confirm that there exist more than one bug.
>> I'll focus on this interesting issue.
>>
>>
>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut
>>> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836]
>>> Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: Dell
>>> Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.423923] RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>> kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000
>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>
>> Here is caused by ocfs2_dentry_lock failed.
>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>
>> But why it failed still confuses me.
>>
>>
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001
>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 Jul
>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]
>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: [1300619.424015]
>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul
>>> 12 15:29:08 drs1p001 kernel: [1300619.424022]
>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b
>>> 8b
>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1
>>> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00
>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057]
>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>> [1300628.967649] ------------[ cut here ]------------
>>>
>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>
>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>
>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Freitag, 11. Mai 2018 09:01
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>
>>>> * Install the LXC packages from the distribution
>>>> * run the command "lxc-create -n test1 -t download"
>>>> ** first run might prompt you to generate a
>>>> ~/.config/lxc/default.conf to define UID mappings
>>>> ** in a corporate environment it might be tricky to set the
>>>> http_proxy (and maybe even https_proxy) environment variables
>>>> correctly
>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>> * the container downloads to ~/.local/share/lxc/
>>>> * adapt the "config" file in that directory to add the shared ocfs2
>>>> mount like in my example below
>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>> ** if you want to finally enable networking, most distributions
>>>> arrange a dedicated bridge (lxcbr0) which you can configure similar
>>>> to my example below
>>>> ** in my case I had to install cgroup related tools and reboot to
>>>> have all cgroups available, and to allow use of lxcbr0 bridge in
>>>> /etc/lxc/lxc-usernet
>>>>
>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Quite an interesting issue.
>>>>
>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>
>>>> Do you have a script to build up your lxc environment?
>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>
>>>> Thanks,
>>>> Larry
>>>>
>>>>
>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> not sure if it helps, the issue wasn't there with Debian 8 and
>>>>> kernel
>>>>> 3.16 - but that's a long history. Unfortunately, the only machine
>>>>> where I could try to bisect, does not run any kernel < 4.16 without
>>>>> other issues ?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Thanks for your report.
>>>>> I'll try to reproduce this bug as you did.
>>>>>
>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>
>>>>> Thanks
>>>>> Larry
>>>>>
>>>>>
>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=DPJ%2BOixL7cb5fRv3whA2NOpvGtq%2BzQ9il4m2gk7MXgo%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>
>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> ----
>>>>>>
>>>>>> # Distribution configuration
>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>> lxc.arch = x86_64
>>>>>>
>>>>>> # Container specific configuration lxc.id_map = u 0 624288 65536
>>>>>> lxc.id_map = g 0 624288 65536
>>>>>>
>>>>>> lxc.utsname = container1
>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>
>>>>>> lxc.network.type = veth
>>>>>> lxc.network.flags = up
>>>>>> lxc.network.link = bridge1
>>>>>> lxc.network.name = eth0
>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY
>>>>>> lxc.network.ipv4.gateway = ZZ.ZZ.ZZ.ZZ
>>>>>>
>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>
>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>
>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>
>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>
>>>>>> ----
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>
>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>        ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>
>>>>>> How did you setup your lxc or container?
>>>>>>
>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>
>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>
>>>>>> Regards,
>>>>>> Larry
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>
>>> Sorry for this delayed reply.
>>>
>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>
>>> But I can not reproduce your bugs.
>>>
>>> What I use is opensuse tumbleweed.
>>>
>>> The procedure I try to reproduce your bugs:
>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>     ?? mount /dev/xxx /mnt
>>>     ?? then it shows
>>>     ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>     ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>
>>> 1. Start Virtual Machine Manager.
>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>     ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>> 4. Activate Application container and click Forward.
>>>     ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>> Click Forward.
>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>     ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>
>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>
>>> Thanks
>>> Larry
>>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=3FDelIaAkf_ZzWidbcupsmZdZEsSYKwwncmTcvwpt1U&s=duh3x5MVrMKguE1w5UJM-O0SDjgPnOAR4TDVkAAS3Xs&e=
>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdani
>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6
>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJTpIR
>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-18  8:08                                 ` Larry Chen
@ 2018-07-19 12:36                                   ` Daniel Sobe
  2018-09-11 11:35                                   ` Daniel Sobe
  1 sibling, 0 replies; 32+ messages in thread
From: Daniel Sobe @ 2018-07-19 12:36 UTC (permalink / raw)
  To: ocfs2-devel

Hi Larry,

I was not aware that I can pick between 2 alternatives ?

I'm probably using o2cb because I start the cluster with "/etc/init.d/o2cb enable && /etc/init.d/o2cb start". 

I'll need to learn how to use dlm to check whether the crash happens with that one as well.

Regards,

Daniel

-----Original Message-----
From: Larry Chen [mailto:lchen at suse.com] 
Sent: Mittwoch, 18. Juli 2018 10:09
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

Which stack do you use? dlm or o2cb??

I tried to reproduce the bug.

I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.

#! /bin/bash
for i in $(git tag)
do
         echo $i
         git checkout $i
done

Bug could not be reproduced.

According to the back trace, I think the bug is caused by the logic of holding a lock.

If possible, I think the bug will recur, even without drdb, lvm or other components.

Regards,
Larry

On 07/17/2018 04:11 PM, Daniel Sobe wrote:
> Hi Larry,
> 
> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
> 
> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
> 
> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
> 
> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
> 
> Regards,
> 
> Daniel
> 
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Dienstag, 17. Juli 2018 04:54
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> Could you please simplify your environment?
> Can I use several virtual machines to reproduce the bug??
> 
> Thanks
> Larry
> 
> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>> Hi,
>>
>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>
>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>
>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>
>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here ]------------ 
>> Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI 
>> Jul
>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag 
>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2 
>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 
>> dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16 13:40:24
>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014 
>> Jul
>> 16 13:40:24 drs1p002 kernel: RIP:
>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16 13:40:24
>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff92559ee9f018
>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 
>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24 drs1p002
>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS: 00000246
>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI: 00007f0ce8ff8ce0
>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11: 0000000000000246
>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d 
>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 
>> 85
>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP: 
>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace 
>> a5a84fa62e77df42 ]---
>>
>> -----Original Message-----
>> From: ocfs2-devel-bounces at oss.oracle.com
>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
>> Sent: Freitag, 13. Juli 2018 13:56
>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Larry,
>>
>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>
>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>
>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Freitag, 13. Juli 2018 11:49
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Thanks for your effort to reproduce the bug.
>> I can confirm that there exist more than one bug.
>> I'll focus on this interesting issue.
>>
>>
>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut 
>>> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] 
>>> Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: 
>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 
>>> 15:29:08
>>> drs1p001 kernel: [1300619.423923] RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12 
>>> 15:29:08
>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>> kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000
>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>
>> Here is caused by ocfs2_dentry_lock failed.
>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>
>> But why it failed still confuses me.
>>
>>
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001
>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300 
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 
>>> Jul
>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0 
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]
>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: 
>>> [1300619.424015]
>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul
>>> 12 15:29:08 drs1p001 kernel: [1300619.424022] 
>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 
>>> 8b
>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 
>>> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00
>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] 
>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>> [1300628.967649] ------------[ cut here ]------------
>>>
>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>
>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>
>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Freitag, 11. Mai 2018 09:01
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>
>>>> * Install the LXC packages from the distribution
>>>> * run the command "lxc-create -n test1 -t download"
>>>> ** first run might prompt you to generate a 
>>>> ~/.config/lxc/default.conf to define UID mappings
>>>> ** in a corporate environment it might be tricky to set the 
>>>> http_proxy (and maybe even https_proxy) environment variables 
>>>> correctly
>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>> * the container downloads to ~/.local/share/lxc/
>>>> * adapt the "config" file in that directory to add the shared ocfs2 
>>>> mount like in my example below
>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>> ** if you want to finally enable networking, most distributions 
>>>> arrange a dedicated bridge (lxcbr0) which you can configure similar 
>>>> to my example below
>>>> ** in my case I had to install cgroup related tools and reboot to 
>>>> have all cgroups available, and to allow use of lxcbr0 bridge in 
>>>> /etc/lxc/lxc-usernet
>>>>
>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Quite an interesting issue.
>>>>
>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>
>>>> Do you have a script to build up your lxc environment?
>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>
>>>> Thanks,
>>>> Larry
>>>>
>>>>
>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> not sure if it helps, the issue wasn't there with Debian 8 and 
>>>>> kernel
>>>>> 3.16 - but that's a long history. Unfortunately, the only machine 
>>>>> where I could try to bisect, does not run any kernel < 4.16 
>>>>> without other issues ?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Thanks for your report.
>>>>> I'll try to reproduce this bug as you did.
>>>>>
>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>
>>>>> Thanks
>>>>> Larry
>>>>>
>>>>>
>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=DPJ%2BOixL7cb5fRv3whA2NOpvGtq%2BzQ9il4m2gk7MXgo%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>
>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> ----
>>>>>>
>>>>>> # Distribution configuration
>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>> lxc.arch = x86_64
>>>>>>
>>>>>> # Container specific configuration lxc.id_map = u 0 624288 65536 
>>>>>> lxc.id_map = g 0 624288 65536
>>>>>>
>>>>>> lxc.utsname = container1
>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>
>>>>>> lxc.network.type = veth
>>>>>> lxc.network.flags = up
>>>>>> lxc.network.link = bridge1
>>>>>> lxc.network.name = eth0
>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway = 
>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>
>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>
>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>
>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>
>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>
>>>>>> ----
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>
>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>        ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>
>>>>>> How did you setup your lxc or container?
>>>>>>
>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>
>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>
>>>>>> Regards,
>>>>>> Larry
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>
>>> Sorry for this delayed reply.
>>>
>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>
>>> But I can not reproduce your bugs.
>>>
>>> What I use is opensuse tumbleweed.
>>>
>>> The procedure I try to reproduce your bugs:
>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>     ?? mount /dev/xxx /mnt
>>>     ?? then it shows
>>>     ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>     ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>
>>> 1. Start Virtual Machine Manager.
>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>     ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>> 4. Activate Application container and click Forward.
>>>     ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>> Click Forward.
>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>     ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>
>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>
>>> Thanks
>>> Larry
>>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=SCmFmLx-GV7KjI-mMXStDC98QLvPo40PAHPgGIraViw&s=DVdf95FJZr8ysd1yzVTnp-4IBFF_6kixQfTctTkBN2s&e=
>> s 
>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdan
>> i
>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c
>> 6 
>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJTpI
>> R
>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-07-18  8:08                                 ` Larry Chen
  2018-07-19 12:36                                   ` Daniel Sobe
@ 2018-09-11 11:35                                   ` Daniel Sobe
  2018-09-12  7:03                                     ` Larry Chen
  2019-02-20  8:48                                     ` Daniel Sobe
  1 sibling, 2 replies; 32+ messages in thread
From: Daniel Sobe @ 2018-09-11 11:35 UTC (permalink / raw)
  To: ocfs2-devel

Hi Larry,

I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:

Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------
Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
Sep 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted 4.18.0-1-amd64 #1 Debian 4.18.6-1
Sep 11 13:08:51 drs1p002 kernel: Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
Sep 11 13:08:51 drs1p002 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f
Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8 EFLAGS: 00010046
Sep 11 13:08:51 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100
Sep 11 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94
Sep 11 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47
Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003
Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240
Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000
Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
Sep 11 13:08:51 drs1p002 kernel: Call Trace:
Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel:  ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel:  ? d_splice_alias+0x299/0x410
Sep 11 13:08:51 drs1p002 kernel:  ocfs2_lookup+0x233/0x2c0 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel:  __lookup_slow+0x97/0x150
Sep 11 13:08:51 drs1p002 kernel:  lookup_slow+0x35/0x50
Sep 11 13:08:51 drs1p002 kernel:  walk_component+0x1c4/0x480
Sep 11 13:08:51 drs1p002 kernel:  ? link_path_walk+0x27c/0x510
Sep 11 13:08:51 drs1p002 kernel:  ? path_init+0x177/0x2f0
Sep 11 13:08:51 drs1p002 kernel:  path_lookupat+0x84/0x1f0
Sep 11 13:08:51 drs1p002 kernel:  filename_lookup+0xb6/0x190
Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_inode_unlock+0xe4/0xf0 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel:  ? __check_object_size+0xa7/0x1a0
Sep 11 13:08:51 drs1p002 kernel:  ? strncpy_from_user+0x48/0x160
Sep 11 13:08:51 drs1p002 kernel:  ? getname_flags+0x6a/0x1e0
Sep 11 13:08:51 drs1p002 kernel:  ? vfs_statx+0x73/0xe0
Sep 11 13:08:51 drs1p002 kernel:  vfs_statx+0x73/0xe0
Sep 11 13:08:51 drs1p002 kernel:  __do_sys_newlstat+0x39/0x70
Sep 11 13:08:51 drs1p002 kernel:  do_syscall_64+0x55/0x110
Sep 11 13:08:51 drs1p002 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995
Sep 11 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4 0c 00 f7 d8 64 89
Sep 11 13:08:51 drs1p002 kernel: RSP: 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Sep 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX: 00007f2434e20390 RCX: 00007f24b6cc5995
Sep 11 13:08:51 drs1p002 kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI: 00007f24640dd9d0
Sep 11 13:08:51 drs1p002 kernel: RBP: 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800
Sep 11 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246 R12: 00007f24640dd9d0
Sep 11 13:08:51 drs1p002 kernel: R13: 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800
Sep 11 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate intel_uncore intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me drm_kms_helper snd_timer snd soundcore pcspkr serio_raw efivars drm mei lpc_ich i2c_algo_bit sg ie31200_edac video pcc_cpufreq button drbd lru_cache libcrc32c parport_pc sunrpc ppdev lp parport efivarfs ip_tables x_tables autofs4 ext4 crc16
Sep 11 13:08:51 drs1p002 kernel:  mbcache jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801 libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod usbcore e1000e usb_common thermal
Sep 11 13:08:51 drs1p002 kernel: ---[ end trace feba92ba6e432478 ]---
Sep 11 13:08:51 drs1p002 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f
Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8 EFLAGS: 00010046
Sep 11 13:08:51 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100
Sep 11 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94
Sep 11 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47
Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003
Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240
Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000
Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0


All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.

Regards,

Daniel

-----Original Message-----
From: Larry Chen <lchen@suse.com> 
Sent: Mittwoch, 18. Juli 2018 10:09
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

Which stack do you use? dlm or o2cb??

I tried to reproduce the bug.

I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.

#! /bin/bash
for i in $(git tag)
do
         echo $i
         git checkout $i
done

Bug could not be reproduced.

According to the back trace, I think the bug is caused by the logic of holding a lock.

If possible, I think the bug will recur, even without drdb, lvm or other components.

Regards,
Larry

On 07/17/2018 04:11 PM, Daniel Sobe wrote:
> Hi Larry,
> 
> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
> 
> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
> 
> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
> 
> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
> 
> Regards,
> 
> Daniel
> 
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Dienstag, 17. Juli 2018 04:54
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> Could you please simplify your environment?
> Can I use several virtual machines to reproduce the bug??
> 
> Thanks
> Larry
> 
> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>> Hi,
>>
>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>
>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>
>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>
>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here ]------------ 
>> Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI 
>> Jul
>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag 
>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2 
>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 
>> dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16 13:40:24
>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014 
>> Jul
>> 16 13:40:24 drs1p002 kernel: RIP:
>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16 13:40:24
>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff92559ee9f018
>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 
>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24 drs1p002
>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS: 00000246
>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI: 00007f0ce8ff8ce0
>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11: 0000000000000246
>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d 
>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 
>> 85
>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP: 
>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace 
>> a5a84fa62e77df42 ]---
>>
>> -----Original Message-----
>> From: ocfs2-devel-bounces at oss.oracle.com
>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
>> Sent: Freitag, 13. Juli 2018 13:56
>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Larry,
>>
>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>
>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>
>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Freitag, 13. Juli 2018 11:49
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Thanks for your effort to reproduce the bug.
>> I can confirm that there exist more than one bug.
>> I'll focus on this interesting issue.
>>
>>
>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut 
>>> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] 
>>> Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: 
>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 
>>> 15:29:08
>>> drs1p001 kernel: [1300619.423923] RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12 
>>> 15:29:08
>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>> kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000
>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>
>> Here is caused by ocfs2_dentry_lock failed.
>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>
>> But why it failed still confuses me.
>>
>>
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001
>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300 
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 
>>> Jul
>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0 
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]
>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: 
>>> [1300619.424015]
>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul
>>> 12 15:29:08 drs1p001 kernel: [1300619.424022] 
>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 
>>> 8b
>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 
>>> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00
>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] 
>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>> [1300628.967649] ------------[ cut here ]------------
>>>
>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>
>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>
>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Freitag, 11. Mai 2018 09:01
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>
>>>> * Install the LXC packages from the distribution
>>>> * run the command "lxc-create -n test1 -t download"
>>>> ** first run might prompt you to generate a 
>>>> ~/.config/lxc/default.conf to define UID mappings
>>>> ** in a corporate environment it might be tricky to set the 
>>>> http_proxy (and maybe even https_proxy) environment variables 
>>>> correctly
>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>> * the container downloads to ~/.local/share/lxc/
>>>> * adapt the "config" file in that directory to add the shared ocfs2 
>>>> mount like in my example below
>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>> ** if you want to finally enable networking, most distributions 
>>>> arrange a dedicated bridge (lxcbr0) which you can configure similar 
>>>> to my example below
>>>> ** in my case I had to install cgroup related tools and reboot to 
>>>> have all cgroups available, and to allow use of lxcbr0 bridge in 
>>>> /etc/lxc/lxc-usernet
>>>>
>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Quite an interesting issue.
>>>>
>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>
>>>> Do you have a script to build up your lxc environment?
>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>
>>>> Thanks,
>>>> Larry
>>>>
>>>>
>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> not sure if it helps, the issue wasn't there with Debian 8 and 
>>>>> kernel
>>>>> 3.16 - but that's a long history. Unfortunately, the only machine 
>>>>> where I could try to bisect, does not run any kernel < 4.16 
>>>>> without other issues ?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Thanks for your report.
>>>>> I'll try to reproduce this bug as you did.
>>>>>
>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>
>>>>> Thanks
>>>>> Larry
>>>>>
>>>>>
>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=DPJ%2BOixL7cb5fRv3whA2NOpvGtq%2BzQ9il4m2gk7MXgo%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>
>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> ----
>>>>>>
>>>>>> # Distribution configuration
>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>> lxc.arch = x86_64
>>>>>>
>>>>>> # Container specific configuration lxc.id_map = u 0 624288 65536 
>>>>>> lxc.id_map = g 0 624288 65536
>>>>>>
>>>>>> lxc.utsname = container1
>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>
>>>>>> lxc.network.type = veth
>>>>>> lxc.network.flags = up
>>>>>> lxc.network.link = bridge1
>>>>>> lxc.network.name = eth0
>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway = 
>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>
>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>
>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>
>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>
>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>
>>>>>> ----
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>
>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>        ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>
>>>>>> How did you setup your lxc or container?
>>>>>>
>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>
>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>
>>>>>> Regards,
>>>>>> Larry
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>
>>> Sorry for this delayed reply.
>>>
>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>
>>> But I can not reproduce your bugs.
>>>
>>> What I use is opensuse tumbleweed.
>>>
>>> The procedure I try to reproduce your bugs:
>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>     ?? mount /dev/xxx /mnt
>>>     ?? then it shows
>>>     ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>     ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>
>>> 1. Start Virtual Machine Manager.
>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>     ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>> 4. Activate Application container and click Forward.
>>>     ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>> Click Forward.
>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>     ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>
>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>
>>> Thanks
>>> Larry
>>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=-aydWb5ODzHDVYGRnOleUmpEH9oSFwodVpLkaB38QBc&s=C1DLTaFiyJffTfrESH7xlnnHcOo-EnEhbyrqLpszgFE&e=
>> s 
>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdan
>> i
>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c
>> 6 
>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJTpI
>> R
>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-09-11 11:35                                   ` Daniel Sobe
@ 2018-09-12  7:03                                     ` Larry Chen
  2019-02-20  8:48                                     ` Daniel Sobe
  1 sibling, 0 replies; 32+ messages in thread
From: Larry Chen @ 2018-09-12  7:03 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,

Thanks for your report.

I'm looking into this bug.

Regards,
Larry


On 09/11/2018 07:35 PM, Daniel Sobe wrote:
> Hi Larry,
> 
> I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:
> 
> Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------
> Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
> Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
> Sep 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted 4.18.0-1-amd64 #1 Debian 4.18.6-1
> Sep 11 13:08:51 drs1p002 kernel: Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
> Sep 11 13:08:51 drs1p002 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f
> Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8 EFLAGS: 00010046
> Sep 11 13:08:51 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100
> Sep 11 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94
> Sep 11 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47
> Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003
> Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240
> Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000
> Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
> Sep 11 13:08:51 drs1p002 kernel: Call Trace:
> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel:  ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel:  ? d_splice_alias+0x299/0x410
> Sep 11 13:08:51 drs1p002 kernel:  ocfs2_lookup+0x233/0x2c0 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel:  __lookup_slow+0x97/0x150
> Sep 11 13:08:51 drs1p002 kernel:  lookup_slow+0x35/0x50
> Sep 11 13:08:51 drs1p002 kernel:  walk_component+0x1c4/0x480
> Sep 11 13:08:51 drs1p002 kernel:  ? link_path_walk+0x27c/0x510
> Sep 11 13:08:51 drs1p002 kernel:  ? path_init+0x177/0x2f0
> Sep 11 13:08:51 drs1p002 kernel:  path_lookupat+0x84/0x1f0
> Sep 11 13:08:51 drs1p002 kernel:  filename_lookup+0xb6/0x190
> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_inode_unlock+0xe4/0xf0 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel:  ? __check_object_size+0xa7/0x1a0
> Sep 11 13:08:51 drs1p002 kernel:  ? strncpy_from_user+0x48/0x160
> Sep 11 13:08:51 drs1p002 kernel:  ? getname_flags+0x6a/0x1e0
> Sep 11 13:08:51 drs1p002 kernel:  ? vfs_statx+0x73/0xe0
> Sep 11 13:08:51 drs1p002 kernel:  vfs_statx+0x73/0xe0
> Sep 11 13:08:51 drs1p002 kernel:  __do_sys_newlstat+0x39/0x70
> Sep 11 13:08:51 drs1p002 kernel:  do_syscall_64+0x55/0x110
> Sep 11 13:08:51 drs1p002 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995
> Sep 11 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4 0c 00 f7 d8 64 89
> Sep 11 13:08:51 drs1p002 kernel: RSP: 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
> Sep 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX: 00007f2434e20390 RCX: 00007f24b6cc5995
> Sep 11 13:08:51 drs1p002 kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI: 00007f24640dd9d0
> Sep 11 13:08:51 drs1p002 kernel: RBP: 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800
> Sep 11 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246 R12: 00007f24640dd9d0
> Sep 11 13:08:51 drs1p002 kernel: R13: 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800
> Sep 11 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate intel_uncore intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me drm_kms_helper snd_timer snd soundcore pcspkr serio_raw efivars drm mei lpc_ich i2c_algo_bit sg ie31200_edac video pcc_cpufreq button drbd lru_cache libcrc32c parport_pc sunrpc ppdev lp parport efivarfs ip_tables x_tables autofs4 ext4 crc16
> Sep 11 13:08:51 drs1p002 kernel:  mbcache jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801 libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod usbcore e1000e usb_common thermal
> Sep 11 13:08:51 drs1p002 kernel: ---[ end trace feba92ba6e432478 ]---
> Sep 11 13:08:51 drs1p002 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f
> Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8 EFLAGS: 00010046
> Sep 11 13:08:51 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100
> Sep 11 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94
> Sep 11 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47
> Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003
> Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240
> Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000
> Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
> 
> 
> All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.
> 
> Regards,
> 
> Daniel
> 
> -----Original Message-----
> From: Larry Chen <lchen@suse.com>
> Sent: Mittwoch, 18. Juli 2018 10:09
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> Which stack do you use? dlm or o2cb??
> 
> I tried to reproduce the bug.
> 
> I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.
> 
> #! /bin/bash
> for i in $(git tag)
> do
>           echo $i
>           git checkout $i
> done
> 
> Bug could not be reproduced.
> 
> According to the back trace, I think the bug is caused by the logic of holding a lock.
> 
> If possible, I think the bug will recur, even without drdb, lvm or other components.
> 
> Regards,
> Larry
> 
> On 07/17/2018 04:11 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
>>
>> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
>>
>> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
>>
>> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Dienstag, 17. Juli 2018 04:54
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Could you please simplify your environment?
>> Can I use several virtual machines to reproduce the bug??
>>
>> Thanks
>> Larry
>>
>> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>>> Hi,
>>>
>>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>>
>>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>>
>>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>>
>>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here ]------------
>>> Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>> Jul
>>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2
>>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64
>>> dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16 13:40:24
>>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>> Jul
>>> 16 13:40:24 drs1p002 kernel: RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16 13:40:24
>>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff92559ee9f018
>>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24 drs1p002
>>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS: 00000246
>>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI: 00007f0ce8ff8ce0
>>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11: 0000000000000246
>>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d
>>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c
>>> 85
>>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP:
>>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace
>>> a5a84fa62e77df42 ]---
>>>
>>> -----Original Message-----
>>> From: ocfs2-devel-bounces at oss.oracle.com
>>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
>>> Sent: Freitag, 13. Juli 2018 13:56
>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Larry,
>>>
>>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>>
>>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>>
>>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Freitag, 13. Juli 2018 11:49
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Thanks for your effort to reproduce the bug.
>>> I can confirm that there exist more than one bug.
>>> I'll focus on this interesting issue.
>>>
>>>
>>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>>
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut
>>>> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836]
>>>> Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name:
>>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12
>>>> 15:29:08
>>>> drs1p001 kernel: [1300619.423923] RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12
>>>> 15:29:08
>>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
>>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>>> kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000
>>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>>
>>> Here is caused by ocfs2_dentry_lock failed.
>>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>>
>>> But why it failed still confuses me.
>>>
>>>
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001
>>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12
>>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
>>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12
>>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160
>>>> Jul
>>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]
>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424015]
>>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul
>>>> 12 15:29:08 drs1p001 kernel: [1300619.424022]
>>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b
>>>> 8b
>>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1
>>>> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00
>>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057]
>>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300628.967649] ------------[ cut here ]------------
>>>>
>>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>>
>>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>>
>>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Freitag, 11. Mai 2018 09:01
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>>
>>>>> * Install the LXC packages from the distribution
>>>>> * run the command "lxc-create -n test1 -t download"
>>>>> ** first run might prompt you to generate a
>>>>> ~/.config/lxc/default.conf to define UID mappings
>>>>> ** in a corporate environment it might be tricky to set the
>>>>> http_proxy (and maybe even https_proxy) environment variables
>>>>> correctly
>>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>>> * the container downloads to ~/.local/share/lxc/
>>>>> * adapt the "config" file in that directory to add the shared ocfs2
>>>>> mount like in my example below
>>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>>> ** if you want to finally enable networking, most distributions
>>>>> arrange a dedicated bridge (lxcbr0) which you can configure similar
>>>>> to my example below
>>>>> ** in my case I had to install cgroup related tools and reboot to
>>>>> have all cgroups available, and to allow use of lxcbr0 bridge in
>>>>> /etc/lxc/lxc-usernet
>>>>>
>>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Quite an interesting issue.
>>>>>
>>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>>
>>>>> Do you have a script to build up your lxc environment?
>>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>>
>>>>> Thanks,
>>>>> Larry
>>>>>
>>>>>
>>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> not sure if it helps, the issue wasn't there with Debian 8 and
>>>>>> kernel
>>>>>> 3.16 - but that's a long history. Unfortunately, the only machine
>>>>>> where I could try to bisect, does not run any kernel < 4.16
>>>>>> without other issues ?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> Thanks for your report.
>>>>>> I'll try to reproduce this bug as you did.
>>>>>>
>>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>>
>>>>>> Thanks
>>>>>> Larry
>>>>>>
>>>>>>
>>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=DPJ%2BOixL7cb5fRv3whA2NOpvGtq%2BzQ9il4m2gk7MXgo%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>>
>>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>> # Distribution configuration
>>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>>> lxc.arch = x86_64
>>>>>>>
>>>>>>> # Container specific configuration lxc.id_map = u 0 624288 65536
>>>>>>> lxc.id_map = g 0 624288 65536
>>>>>>>
>>>>>>> lxc.utsname = container1
>>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>>
>>>>>>> lxc.network.type = veth
>>>>>>> lxc.network.flags = up
>>>>>>> lxc.network.link = bridge1
>>>>>>> lxc.network.name = eth0
>>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway =
>>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>>
>>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>>
>>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>>
>>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>>
>>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>>> Hi Larry,
>>>>>>>>
>>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>>
>>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>>         ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>>
>>>>>>> How did you setup your lxc or container?
>>>>>>>
>>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>>
>>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Larry
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>
>>>> Sorry for this delayed reply.
>>>>
>>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>>
>>>> But I can not reproduce your bugs.
>>>>
>>>> What I use is opensuse tumbleweed.
>>>>
>>>> The procedure I try to reproduce your bugs:
>>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>>      ?? mount /dev/xxx /mnt
>>>>      ?? then it shows
>>>>      ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>>      ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>>
>>>> 1. Start Virtual Machine Manager.
>>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>>      ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>>> 4. Activate Application container and click Forward.
>>>>      ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>>> Click Forward.
>>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>>      ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>>
>>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>>
>>>> Thanks
>>>> Larry
>>>>
>>>
>>>
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ox3wff8uv0vtMq7xoK_0UZerwIYGpwLmacicP3PCqdQ&s=7Pc1RUYH6AHQ2dLoSnruuBUZ1RhRqU908M5NeYKFHrM&e=
>>> s
>>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdan
>>> i
>>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c
>>> 6
>>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJTpI
>>> R
>>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>>
>>
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2018-09-11 11:35                                   ` Daniel Sobe
  2018-09-12  7:03                                     ` Larry Chen
@ 2019-02-20  8:48                                     ` Daniel Sobe
  2019-03-18 17:45                                       ` Wengang
  1 sibling, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2019-02-20  8:48 UTC (permalink / raw)
  To: ocfs2-devel

Hi Larry,

The issue still happens with 4.19 as well, but it took quite a while to trigger it:

Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------
Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI
Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24018 Comm: git Not tainted 4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046
Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000
Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694
Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153
Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694
Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0
Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000
Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0
Feb 20 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150
Feb 20 09:37:56 drs1p001 kernel:  lookup_slow+0x35/0x50
Feb 20 09:37:56 drs1p001 kernel:  walk_component+0x1c6/0x360
Feb 20 09:37:56 drs1p001 kernel:  ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80
Feb 20 09:37:56 drs1p001 kernel:  path_lookupat+0x67/0x200
Feb 20 09:37:56 drs1p001 kernel:  filename_lookup+0xb8/0x1a0
Feb 20 09:37:56 drs1p001 kernel:  ? seccomp_run_filters+0x58/0xb0
Feb 20 09:37:56 drs1p001 kernel:  ? __check_object_size+0x9d/0x1a0
Feb 20 09:37:56 drs1p001 kernel:  ? strncpy_from_user+0x48/0x160
Feb 20 09:37:56 drs1p001 kernel:  ? getname_flags+0x6a/0x1e0
Feb 20 09:37:56 drs1p001 kernel:  ? vfs_statx+0x73/0xe0
Feb 20 09:37:56 drs1p001 kernel:  vfs_statx+0x73/0xe0
Feb 20 09:37:56 drs1p001 kernel:  __do_sys_newlstat+0x39/0x70
Feb 20 09:37:56 drs1p001 kernel:  ? syscall_trace_enter+0x117/0x2c0
Feb 20 09:37:56 drs1p001 kernel:  do_syscall_64+0x55/0x110
Feb 20 09:37:56 drs1p001 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5
Feb 20 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd 2b 00 f7 d8 64 89
Feb 20 09:37:56 drs1p001 kernel: RSP: 002b:00007fc258ff8d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Feb 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX: 00007fc258ff8e50 RCX: 00007fc2622d80f5
Feb 20 09:37:56 drs1p001 kernel: RDX: 00007fc258ff8d40 RSI: 00007fc258ff8d40 RDI: 00007fc2300008c0
Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000045 R08: 0000000000000003 R09: 0000000000000000
Feb 20 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
Feb 20 09:37:56 drs1p001 kernel: R13: 000000000000000d R14: 0000000000000015 R15: 000055ec17f94d58
Feb 20 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof sparse_keymap snd_hda_intel dell_wmi_descriptor evdev ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables
Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan
Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe1 ]---
Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046
Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000
Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694
Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153
Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694
Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0
Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000
Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0
Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------
Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#2] SMP PTI
Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24024 Comm: git Tainted: G      D           4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c8177af8 EFLAGS: 00010046
Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fdf3b23418 RCX: 0000000000000000
Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fdf3b23418 RDI: ffff95fdf3b23494
Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: ffff95fe91aa2620 R09: 0000000000000089
Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c8177ae0 R11: ffff95fe6e3efb40 R12: ffff95fdf3b23494
Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0
Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000
Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0
Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 20 09:37:56 drs1p001 kernel: Call Trace:
Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel:  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x29d/0x3f0
Feb 20 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150
Feb 20 09:37:56 drs1p001 kernel:  lookup_slow+0x35/0x50
Feb 20 09:37:56 drs1p001 kernel:  walk_component+0x1c6/0x360
Feb 20 09:37:56 drs1p001 kernel:  ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80
Feb 20 09:37:56 drs1p001 kernel:  path_lookupat+0x67/0x200
Feb 20 09:37:56 drs1p001 kernel:  filename_lookup+0xb8/0x1a0
Feb 20 09:37:56 drs1p001 kernel:  ? seccomp_run_filters+0x58/0xb0
Feb 20 09:37:56 drs1p001 kernel:  ? __check_object_size+0x9d/0x1a0
Feb 20 09:37:56 drs1p001 kernel:  ? strncpy_from_user+0x48/0x160
Feb 20 09:37:56 drs1p001 kernel:  ? getname_flags+0x6a/0x1e0
Feb 20 09:37:56 drs1p001 kernel:  ? vfs_statx+0x73/0xe0
Feb 20 09:37:56 drs1p001 kernel:  vfs_statx+0x73/0xe0
Feb 20 09:37:56 drs1p001 kernel:  __do_sys_newlstat+0x39/0x70
Feb 20 09:37:56 drs1p001 kernel:  ? syscall_trace_enter+0x117/0x2c0
Feb 20 09:37:56 drs1p001 kernel:  do_syscall_64+0x55/0x110
Feb 20 09:37:56 drs1p001 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5
Feb 20 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd 2b 00 f7 d8 64 89
Feb 20 09:37:56 drs1p001 kernel: RSP: 002b:00007fc24d7f9d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Feb 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX: 00007fc24d7f9e50 RCX: 00007fc2622d80f5
Feb 20 09:37:56 drs1p001 kernel: RDX: 00007fc24d7f9d40 RSI: 00007fc24d7f9d40 RDI: 00007fc2100008c0
Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000044 R08: 0000000000000003 R09: 0000000000000000
Feb 20 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
Feb 20 09:37:56 drs1p001 kernel: R13: 000000000000000f R14: 000000000000001d R15: 000055ec18015008
Feb 20 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof sparse_keymap snd_hda_intel dell_wmi_descriptor evdev ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables
Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan
Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe2 ]---
Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046
Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000
Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694
Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153
Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694
Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0
Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000
Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0
Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Regards,

Daniel

-----Original Message-----
From: Daniel Sobe 
Sent: Dienstag, 11. September 2018 13:36
To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
Subject: RE: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Larry,

I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:

Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------
Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
Sep 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted 4.18.0-1-amd64 #1 Debian 4.18.6-1
Sep 11 13:08:51 drs1p002 kernel: Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
Sep 11 13:08:51 drs1p002 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f
Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8 EFLAGS: 00010046
Sep 11 13:08:51 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100
Sep 11 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94
Sep 11 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47
Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003
Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240
Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000
Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
Sep 11 13:08:51 drs1p002 kernel: Call Trace:
Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel:  ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel:  ? d_splice_alias+0x299/0x410
Sep 11 13:08:51 drs1p002 kernel:  ocfs2_lookup+0x233/0x2c0 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel:  __lookup_slow+0x97/0x150
Sep 11 13:08:51 drs1p002 kernel:  lookup_slow+0x35/0x50
Sep 11 13:08:51 drs1p002 kernel:  walk_component+0x1c4/0x480
Sep 11 13:08:51 drs1p002 kernel:  ? link_path_walk+0x27c/0x510
Sep 11 13:08:51 drs1p002 kernel:  ? path_init+0x177/0x2f0
Sep 11 13:08:51 drs1p002 kernel:  path_lookupat+0x84/0x1f0
Sep 11 13:08:51 drs1p002 kernel:  filename_lookup+0xb6/0x190
Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_inode_unlock+0xe4/0xf0 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel:  ? __check_object_size+0xa7/0x1a0
Sep 11 13:08:51 drs1p002 kernel:  ? strncpy_from_user+0x48/0x160
Sep 11 13:08:51 drs1p002 kernel:  ? getname_flags+0x6a/0x1e0
Sep 11 13:08:51 drs1p002 kernel:  ? vfs_statx+0x73/0xe0
Sep 11 13:08:51 drs1p002 kernel:  vfs_statx+0x73/0xe0
Sep 11 13:08:51 drs1p002 kernel:  __do_sys_newlstat+0x39/0x70
Sep 11 13:08:51 drs1p002 kernel:  do_syscall_64+0x55/0x110
Sep 11 13:08:51 drs1p002 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995
Sep 11 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4 0c 00 f7 d8 64 89
Sep 11 13:08:51 drs1p002 kernel: RSP: 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Sep 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX: 00007f2434e20390 RCX: 00007f24b6cc5995
Sep 11 13:08:51 drs1p002 kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI: 00007f24640dd9d0
Sep 11 13:08:51 drs1p002 kernel: RBP: 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800
Sep 11 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246 R12: 00007f24640dd9d0
Sep 11 13:08:51 drs1p002 kernel: R13: 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800
Sep 11 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate intel_uncore intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me drm_kms_helper snd_timer snd soundcore pcspkr serio_raw efivars drm mei lpc_ich i2c_algo_bit sg ie31200_edac video pcc_cpufreq button drbd lru_cache libcrc32c parport_pc sunrpc ppdev lp parport efivarfs ip_tables x_tables autofs4 ext4 crc16
Sep 11 13:08:51 drs1p002 kernel:  mbcache jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801 libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod usbcore e1000e usb_common thermal
Sep 11 13:08:51 drs1p002 kernel: ---[ end trace feba92ba6e432478 ]---
Sep 11 13:08:51 drs1p002 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2]
Sep 11 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f
Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8 EFLAGS: 00010046
Sep 11 13:08:51 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100
Sep 11 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94
Sep 11 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47
Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003
Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240
Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000
Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0


All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.

Regards,

Daniel

-----Original Message-----
From: Larry Chen <lchen@suse.com> 
Sent: Mittwoch, 18. Juli 2018 10:09
To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi Daniel,

Which stack do you use? dlm or o2cb??

I tried to reproduce the bug.

I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.

#! /bin/bash
for i in $(git tag)
do
         echo $i
         git checkout $i
done

Bug could not be reproduced.

According to the back trace, I think the bug is caused by the logic of holding a lock.

If possible, I think the bug will recur, even without drdb, lvm or other components.

Regards,
Larry

On 07/17/2018 04:11 PM, Daniel Sobe wrote:
> Hi Larry,
> 
> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
> 
> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
> 
> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
> 
> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
> 
> Regards,
> 
> Daniel
> 
> 
> -----Original Message-----
> From: Larry Chen [mailto:lchen at suse.com]
> Sent: Dienstag, 17. Juli 2018 04:54
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
> 
> Hi Daniel,
> 
> Could you please simplify your environment?
> Can I use several virtual machines to reproduce the bug??
> 
> Thanks
> Larry
> 
> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>> Hi,
>>
>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>
>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>
>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>
>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>
>> Regards,
>>
>> Daniel
>>
>>
>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here ]------------ 
>> Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI 
>> Jul
>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag 
>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2 
>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 
>> dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16 13:40:24
>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014 
>> Jul
>> 16 13:40:24 drs1p002 kernel: RIP:
>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16 13:40:24
>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff92559ee9f018
>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 
>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24 drs1p002
>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS: 00000246
>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI: 00007f0ce8ff8ce0
>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11: 0000000000000246
>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d 
>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 
>> 85
>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP: 
>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace 
>> a5a84fa62e77df42 ]---
>>
>> -----Original Message-----
>> From: ocfs2-devel-bounces at oss.oracle.com
>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
>> Sent: Freitag, 13. Juli 2018 13:56
>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Larry,
>>
>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>
>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>
>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Freitag, 13. Juli 2018 11:49
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Thanks for your effort to reproduce the bug.
>> I can confirm that there exist more than one bug.
>> I'll focus on this interesting issue.
>>
>>
>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut 
>>> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] 
>>> Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name: 
>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12 
>>> 15:29:08
>>> drs1p001 kernel: [1300619.423923] RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12 
>>> 15:29:08
>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>> kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000
>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>
>> Here is caused by ocfs2_dentry_lock failed.
>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>
>> But why it failed still confuses me.
>>
>>
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001
>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300 
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12
>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 
>>> Jul
>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0 
>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]
>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: 
>>> [1300619.424015]
>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul
>>> 12 15:29:08 drs1p001 kernel: [1300619.424022] 
>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 
>>> 8b
>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 
>>> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00
>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] 
>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>> [1300628.967649] ------------[ cut here ]------------
>>>
>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>
>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>
>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Freitag, 11. Mai 2018 09:01
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>
>>>> * Install the LXC packages from the distribution
>>>> * run the command "lxc-create -n test1 -t download"
>>>> ** first run might prompt you to generate a 
>>>> ~/.config/lxc/default.conf to define UID mappings
>>>> ** in a corporate environment it might be tricky to set the 
>>>> http_proxy (and maybe even https_proxy) environment variables 
>>>> correctly
>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>> * the container downloads to ~/.local/share/lxc/
>>>> * adapt the "config" file in that directory to add the shared ocfs2 
>>>> mount like in my example below
>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>> ** if you want to finally enable networking, most distributions 
>>>> arrange a dedicated bridge (lxcbr0) which you can configure similar 
>>>> to my example below
>>>> ** in my case I had to install cgroup related tools and reboot to 
>>>> have all cgroups available, and to allow use of lxcbr0 bridge in 
>>>> /etc/lxc/lxc-usernet
>>>>
>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Quite an interesting issue.
>>>>
>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>
>>>> Do you have a script to build up your lxc environment?
>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>
>>>> Thanks,
>>>> Larry
>>>>
>>>>
>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> not sure if it helps, the issue wasn't there with Debian 8 and 
>>>>> kernel
>>>>> 3.16 - but that's a long history. Unfortunately, the only machine 
>>>>> where I could try to bisect, does not run any kernel < 4.16 
>>>>> without other issues ?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Thanks for your report.
>>>>> I'll try to reproduce this bug as you did.
>>>>>
>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>
>>>>> Thanks
>>>>> Larry
>>>>>
>>>>>
>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=DPJ%2BOixL7cb5fRv3whA2NOpvGtq%2BzQ9il4m2gk7MXgo%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>
>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> ----
>>>>>>
>>>>>> # Distribution configuration
>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>> lxc.arch = x86_64
>>>>>>
>>>>>> # Container specific configuration lxc.id_map = u 0 624288 65536 
>>>>>> lxc.id_map = g 0 624288 65536
>>>>>>
>>>>>> lxc.utsname = container1
>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>
>>>>>> lxc.network.type = veth
>>>>>> lxc.network.flags = up
>>>>>> lxc.network.link = bridge1
>>>>>> lxc.network.name = eth0
>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway = 
>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>
>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>
>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>
>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>
>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>
>>>>>> ----
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>
>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>        ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>
>>>>>> How did you setup your lxc or container?
>>>>>>
>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>
>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>
>>>>>> Regards,
>>>>>> Larry
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>
>>> Sorry for this delayed reply.
>>>
>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>
>>> But I can not reproduce your bugs.
>>>
>>> What I use is opensuse tumbleweed.
>>>
>>> The procedure I try to reproduce your bugs:
>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>     ?? mount /dev/xxx /mnt
>>>     ?? then it shows
>>>     ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>     ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>
>>> 1. Start Virtual Machine Manager.
>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>     ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>> 4. Activate Application container and click Forward.
>>>     ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>> Click Forward.
>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>     ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>
>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>
>>> Thanks
>>> Larry
>>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=eDT2dYwkSxcLa1NsepzLRIpUZlkC_NECl_Qk34Foqvo&s=AiHVWnx-sunWZO4cbXP7v6z6Bw5vegbCZBA-wGNCoqA&e=
>> s 
>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdan
>> i
>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c
>> 6 
>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJTpI
>> R
>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2019-02-20  8:48                                     ` Daniel Sobe
@ 2019-03-18 17:45                                       ` Wengang
  2019-03-26 12:27                                         ` Daniel Sobe
  0 siblings, 1 reply; 32+ messages in thread
From: Wengang @ 2019-03-18 17:45 UTC (permalink / raw)
  To: ocfs2-devel

Hi,

I also see this problem on a lower version at 4.1.12.xxx.

The l_ro_holders is changed in ocfs2 layer, not DLM layer. And another 
thing is that the dentry lock is just used to get notification for file 
remote deleting. So the code is requesting PR lock and then releasing 
the it after PR is granted.? I am not sure, but I feel this is not a DLM 
issue, but a memory issue on the ocfs2_lock_res.? Do you have a vmcore 
available for this problem?

Thanks,
Wengang


On 02/20/2019 12:48 AM, Daniel Sobe wrote:
> Hi Larry,
>
> The issue still happens with 4.19 as well, but it took quite a while to trigger it:
>
> Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------
> Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI
> Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24018 Comm: git Not tainted 4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
> Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
> Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
> Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046
> Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694
> Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153
> Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694
> Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0
> Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Feb 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0
> Feb 20 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150
> Feb 20 09:37:56 drs1p001 kernel:  lookup_slow+0x35/0x50
> Feb 20 09:37:56 drs1p001 kernel:  walk_component+0x1c6/0x360
> Feb 20 09:37:56 drs1p001 kernel:  ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80
> Feb 20 09:37:56 drs1p001 kernel:  path_lookupat+0x67/0x200
> Feb 20 09:37:56 drs1p001 kernel:  filename_lookup+0xb8/0x1a0
> Feb 20 09:37:56 drs1p001 kernel:  ? seccomp_run_filters+0x58/0xb0
> Feb 20 09:37:56 drs1p001 kernel:  ? __check_object_size+0x9d/0x1a0
> Feb 20 09:37:56 drs1p001 kernel:  ? strncpy_from_user+0x48/0x160
> Feb 20 09:37:56 drs1p001 kernel:  ? getname_flags+0x6a/0x1e0
> Feb 20 09:37:56 drs1p001 kernel:  ? vfs_statx+0x73/0xe0
> Feb 20 09:37:56 drs1p001 kernel:  vfs_statx+0x73/0xe0
> Feb 20 09:37:56 drs1p001 kernel:  __do_sys_newlstat+0x39/0x70
> Feb 20 09:37:56 drs1p001 kernel:  ? syscall_trace_enter+0x117/0x2c0
> Feb 20 09:37:56 drs1p001 kernel:  do_syscall_64+0x55/0x110
> Feb 20 09:37:56 drs1p001 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5
> Feb 20 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd 2b 00 f7 d8 64 89
> Feb 20 09:37:56 drs1p001 kernel: RSP: 002b:00007fc258ff8d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
> Feb 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX: 00007fc258ff8e50 RCX: 00007fc2622d80f5
> Feb 20 09:37:56 drs1p001 kernel: RDX: 00007fc258ff8d40 RSI: 00007fc258ff8d40 RDI: 00007fc2300008c0
> Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000045 R08: 0000000000000003 R09: 0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
> Feb 20 09:37:56 drs1p001 kernel: R13: 000000000000000d R14: 0000000000000015 R15: 000055ec17f94d58
> Feb 20 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof sparse_keymap snd_hda_intel dell_wmi_descriptor evdev ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables
> Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan
> Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe1 ]---
> Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
> Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046
> Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694
> Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153
> Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694
> Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0
> Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0
> Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------
> Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#2] SMP PTI
> Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24024 Comm: git Tainted: G      D           4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
> Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
> Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
> Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c8177af8 EFLAGS: 00010046
> Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fdf3b23418 RCX: 0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fdf3b23418 RDI: ffff95fdf3b23494
> Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: ffff95fe91aa2620 R09: 0000000000000089
> Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c8177ae0 R11: ffff95fe6e3efb40 R12: ffff95fdf3b23494
> Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0
> Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0
> Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Feb 20 09:37:56 drs1p001 kernel: Call Trace:
> Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel:  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x29d/0x3f0
> Feb 20 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150
> Feb 20 09:37:56 drs1p001 kernel:  lookup_slow+0x35/0x50
> Feb 20 09:37:56 drs1p001 kernel:  walk_component+0x1c6/0x360
> Feb 20 09:37:56 drs1p001 kernel:  ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80
> Feb 20 09:37:56 drs1p001 kernel:  path_lookupat+0x67/0x200
> Feb 20 09:37:56 drs1p001 kernel:  filename_lookup+0xb8/0x1a0
> Feb 20 09:37:56 drs1p001 kernel:  ? seccomp_run_filters+0x58/0xb0
> Feb 20 09:37:56 drs1p001 kernel:  ? __check_object_size+0x9d/0x1a0
> Feb 20 09:37:56 drs1p001 kernel:  ? strncpy_from_user+0x48/0x160
> Feb 20 09:37:56 drs1p001 kernel:  ? getname_flags+0x6a/0x1e0
> Feb 20 09:37:56 drs1p001 kernel:  ? vfs_statx+0x73/0xe0
> Feb 20 09:37:56 drs1p001 kernel:  vfs_statx+0x73/0xe0
> Feb 20 09:37:56 drs1p001 kernel:  __do_sys_newlstat+0x39/0x70
> Feb 20 09:37:56 drs1p001 kernel:  ? syscall_trace_enter+0x117/0x2c0
> Feb 20 09:37:56 drs1p001 kernel:  do_syscall_64+0x55/0x110
> Feb 20 09:37:56 drs1p001 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5
> Feb 20 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd 2b 00 f7 d8 64 89
> Feb 20 09:37:56 drs1p001 kernel: RSP: 002b:00007fc24d7f9d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
> Feb 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX: 00007fc24d7f9e50 RCX: 00007fc2622d80f5
> Feb 20 09:37:56 drs1p001 kernel: RDX: 00007fc24d7f9d40 RSI: 00007fc24d7f9d40 RDI: 00007fc2100008c0
> Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000044 R08: 0000000000000003 R09: 0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
> Feb 20 09:37:56 drs1p001 kernel: R13: 000000000000000f R14: 000000000000001d R15: 000055ec18015008
> Feb 20 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof sparse_keymap snd_hda_intel dell_wmi_descriptor evdev ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables
> Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan
> Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe2 ]---
> Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
> Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
> Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046
> Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694
> Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153
> Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694
> Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0
> Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0
> Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
> Regards,
>
> Daniel
>
> -----Original Message-----
> From: Daniel Sobe
> Sent: Dienstag, 11. September 2018 13:36
> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
> Subject: RE: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi Larry,
>
> I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:
>
> Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------
> Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
> Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
> Sep 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted 4.18.0-1-amd64 #1 Debian 4.18.6-1
> Sep 11 13:08:51 drs1p002 kernel: Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
> Sep 11 13:08:51 drs1p002 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f
> Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8 EFLAGS: 00010046
> Sep 11 13:08:51 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100
> Sep 11 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94
> Sep 11 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47
> Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003
> Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240
> Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000
> Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
> Sep 11 13:08:51 drs1p002 kernel: Call Trace:
> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel:  ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel:  ? d_splice_alias+0x299/0x410
> Sep 11 13:08:51 drs1p002 kernel:  ocfs2_lookup+0x233/0x2c0 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel:  __lookup_slow+0x97/0x150
> Sep 11 13:08:51 drs1p002 kernel:  lookup_slow+0x35/0x50
> Sep 11 13:08:51 drs1p002 kernel:  walk_component+0x1c4/0x480
> Sep 11 13:08:51 drs1p002 kernel:  ? link_path_walk+0x27c/0x510
> Sep 11 13:08:51 drs1p002 kernel:  ? path_init+0x177/0x2f0
> Sep 11 13:08:51 drs1p002 kernel:  path_lookupat+0x84/0x1f0
> Sep 11 13:08:51 drs1p002 kernel:  filename_lookup+0xb6/0x190
> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_inode_unlock+0xe4/0xf0 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel:  ? __check_object_size+0xa7/0x1a0
> Sep 11 13:08:51 drs1p002 kernel:  ? strncpy_from_user+0x48/0x160
> Sep 11 13:08:51 drs1p002 kernel:  ? getname_flags+0x6a/0x1e0
> Sep 11 13:08:51 drs1p002 kernel:  ? vfs_statx+0x73/0xe0
> Sep 11 13:08:51 drs1p002 kernel:  vfs_statx+0x73/0xe0
> Sep 11 13:08:51 drs1p002 kernel:  __do_sys_newlstat+0x39/0x70
> Sep 11 13:08:51 drs1p002 kernel:  do_syscall_64+0x55/0x110
> Sep 11 13:08:51 drs1p002 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995
> Sep 11 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4 0c 00 f7 d8 64 89
> Sep 11 13:08:51 drs1p002 kernel: RSP: 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
> Sep 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX: 00007f2434e20390 RCX: 00007f24b6cc5995
> Sep 11 13:08:51 drs1p002 kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI: 00007f24640dd9d0
> Sep 11 13:08:51 drs1p002 kernel: RBP: 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800
> Sep 11 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246 R12: 00007f24640dd9d0
> Sep 11 13:08:51 drs1p002 kernel: R13: 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800
> Sep 11 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate intel_uncore intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me drm_kms_helper snd_timer snd soundcore pcspkr serio_raw efivars drm mei lpc_ich i2c_algo_bit sg ie31200_edac video pcc_cpufreq button drbd lru_cache libcrc32c parport_pc sunrpc ppdev lp parport efivarfs ip_tables x_tables autofs4 ext4 crc16
> Sep 11 13:08:51 drs1p002 kernel:  mbcache jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801 libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod usbcore e1000e usb_common thermal
> Sep 11 13:08:51 drs1p002 kernel: ---[ end trace feba92ba6e432478 ]---
> Sep 11 13:08:51 drs1p002 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2]
> Sep 11 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f
> Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8 EFLAGS: 00010046
> Sep 11 13:08:51 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100
> Sep 11 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94
> Sep 11 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47
> Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003
> Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240
> Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000
> Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
>
>
> All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.
>
> Regards,
>
> Daniel
>
> -----Original Message-----
> From: Larry Chen <lchen@suse.com>
> Sent: Mittwoch, 18. Juli 2018 10:09
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi Daniel,
>
> Which stack do you use? dlm or o2cb??
>
> I tried to reproduce the bug.
>
> I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.
>
> #! /bin/bash
> for i in $(git tag)
> do
>           echo $i
>           git checkout $i
> done
>
> Bug could not be reproduced.
>
> According to the back trace, I think the bug is caused by the logic of holding a lock.
>
> If possible, I think the bug will recur, even without drdb, lvm or other components.
>
> Regards,
> Larry
>
> On 07/17/2018 04:11 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
>>
>> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
>>
>> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
>>
>> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Dienstag, 17. Juli 2018 04:54
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Could you please simplify your environment?
>> Can I use several virtual machines to reproduce the bug??
>>
>> Thanks
>> Larry
>>
>> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>>> Hi,
>>>
>>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>>
>>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>>
>>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>>
>>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here ]------------
>>> Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>> Jul
>>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2
>>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper aes_x86_64
>>> dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16 13:40:24
>>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>> Jul
>>> 16 13:40:24 drs1p002 kernel: RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16 13:40:24
>>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX: ffff92559ee9f018
>>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24 drs1p002
>>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS: 00000246
>>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI: 00007f0ce8ff8ce0
>>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11: 0000000000000246
>>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d
>>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c
>>> 85
>>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP:
>>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace
>>> a5a84fa62e77df42 ]---
>>>
>>> -----Original Message-----
>>> From: ocfs2-devel-bounces at oss.oracle.com
>>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
>>> Sent: Freitag, 13. Juli 2018 13:56
>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Larry,
>>>
>>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>>
>>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>>
>>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Freitag, 13. Juli 2018 11:49
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Thanks for your effort to reproduce the bug.
>>> I can confirm that there exist more than one bug.
>>> I'll focus on this interesting issue.
>>>
>>>
>>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>>
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut
>>>> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836]
>>>> Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name:
>>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12
>>>> 15:29:08
>>>> drs1p001 kernel: [1300619.423923] RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12
>>>> 15:29:08
>>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
>>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>>> kernel: [1300619.423931] RBP: 0000000000000003 R08: 000062d940000000
>>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>> Here is caused by ocfs2_dentry_lock failed.
>>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>>
>>> But why it failed still confuses me.
>>>
>>>
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 drs1p001
>>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 12
>>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
>>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 12
>>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160
>>>> Jul
>>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? vfs_statx+0x73/0xe0
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.424012]
>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424015]
>>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 Jul
>>>> 12 15:29:08 drs1p001 kernel: [1300619.424022]
>>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b
>>>> 8b
>>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1
>>>> 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00
>>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057]
>>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300628.967649] ------------[ cut here ]------------
>>>>
>>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>>
>>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>>
>>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Freitag, 11. Mai 2018 09:01
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>>
>>>>> * Install the LXC packages from the distribution
>>>>> * run the command "lxc-create -n test1 -t download"
>>>>> ** first run might prompt you to generate a
>>>>> ~/.config/lxc/default.conf to define UID mappings
>>>>> ** in a corporate environment it might be tricky to set the
>>>>> http_proxy (and maybe even https_proxy) environment variables
>>>>> correctly
>>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>>> * the container downloads to ~/.local/share/lxc/
>>>>> * adapt the "config" file in that directory to add the shared ocfs2
>>>>> mount like in my example below
>>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>>> ** if you want to finally enable networking, most distributions
>>>>> arrange a dedicated bridge (lxcbr0) which you can configure similar
>>>>> to my example below
>>>>> ** in my case I had to install cgroup related tools and reboot to
>>>>> have all cgroups available, and to allow use of lxcbr0 bridge in
>>>>> /etc/lxc/lxc-usernet
>>>>>
>>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Quite an interesting issue.
>>>>>
>>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>>
>>>>> Do you have a script to build up your lxc environment?
>>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>>
>>>>> Thanks,
>>>>> Larry
>>>>>
>>>>>
>>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> not sure if it helps, the issue wasn't there with Debian 8 and
>>>>>> kernel
>>>>>> 3.16 - but that's a long history. Unfortunately, the only machine
>>>>>> where I could try to bisect, does not run any kernel < 4.16
>>>>>> without other issues ?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> Thanks for your report.
>>>>>> I'll try to reproduce this bug as you did.
>>>>>>
>>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>>
>>>>>> Thanks
>>>>>> Larry
>>>>>>
>>>>>>
>>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=DPJ%2BOixL7cb5fRv3whA2NOpvGtq%2BzQ9il4m2gk7MXgo%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>>
>>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>> # Distribution configuration
>>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>>> lxc.arch = x86_64
>>>>>>>
>>>>>>> # Container specific configuration lxc.id_map = u 0 624288 65536
>>>>>>> lxc.id_map = g 0 624288 65536
>>>>>>>
>>>>>>> lxc.utsname = container1
>>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>>
>>>>>>> lxc.network.type = veth
>>>>>>> lxc.network.flags = up
>>>>>>> lxc.network.link = bridge1
>>>>>>> lxc.network.name = eth0
>>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway =
>>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>>
>>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>>
>>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>>
>>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>>
>>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>>> Hi Larry,
>>>>>>>>
>>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>>
>>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>>         ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>>
>>>>>>> How did you setup your lxc or container?
>>>>>>>
>>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>>
>>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Larry
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>> Sorry for this delayed reply.
>>>>
>>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>>
>>>> But I can not reproduce your bugs.
>>>>
>>>> What I use is opensuse tumbleweed.
>>>>
>>>> The procedure I try to reproduce your bugs:
>>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>>      ?? mount /dev/xxx /mnt
>>>>      ?? then it shows
>>>>      ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>>      ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>>
>>>> 1. Start Virtual Machine Manager.
>>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>>      ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>>> 4. Activate Application container and click Forward.
>>>>      ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>>> Click Forward.
>>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>>      ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>>
>>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>>
>>>> Thanks
>>>> Larry
>>>>
>>>
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=eDT2dYwkSxcLa1NsepzLRIpUZlkC_NECl_Qk34Foqvo&s=AiHVWnx-sunWZO4cbXP7v6z6Bw5vegbCZBA-wGNCoqA&e=
>>> s
>>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdan
>>> i
>>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4c
>>> 6
>>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJTpI
>>> R
>>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2019-03-18 17:45                                       ` Wengang
@ 2019-03-26 12:27                                         ` Daniel Sobe
  2019-03-26 21:24                                           ` Wengang Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2019-03-26 12:27 UTC (permalink / raw)
  To: ocfs2-devel

Hi Wengang,

Thanks for confirming that this bug is reproducible! Long time I was under the impression that I'm the only one facing this issue.

Unfortunately, I do not know what a "vmcore" is. Let me google it and then check whether I can reproduce the bug again easily and provide what you request.

Regards,

Daniel

-----Original Message-----
From: ocfs2-devel-bounces@oss.oracle.com <ocfs2-devel-bounces@oss.oracle.com> On Behalf Of Wengang
Sent: Montag, 18. M?rz 2019 18:46
To: ocfs2-devel at oss.oracle.com
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Hi,

I also see this problem on a lower version at 4.1.12.xxx.

The l_ro_holders is changed in ocfs2 layer, not DLM layer. And another thing is that the dentry lock is just used to get notification for file remote deleting. So the code is requesting PR lock and then releasing the it after PR is granted.? I am not sure, but I feel this is not a DLM issue, but a memory issue on the ocfs2_lock_res.? Do you have a vmcore available for this problem?

Thanks,
Wengang


On 02/20/2019 12:48 AM, Daniel Sobe wrote:
> Hi Larry,
>
> The issue still happens with 4.19 as well, but it took quite a while to trigger it:
>
> Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------ 
> Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI Feb 
> 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24018 Comm: git Not tainted 
> 4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1 Feb 20 09:37:56 
> drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 
> 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel: RIP: 
> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56 
> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 
> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 
> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 
> 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 
> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX: 
> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20 
> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 
> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP: 
> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20 
> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b 
> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13: 
> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 
> 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) 
> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 
> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 
> 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0 Feb 20 
> 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20 
> 09:37:56 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2] Feb 20 
> 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150 Feb 20 09:37:56 
> drs1p001 kernel:  lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001 
> kernel:  walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:  
> ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56 
> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56 
> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001 
> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  
> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ? 
> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ? 
> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ? 
> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ? 
> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:  
> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:  
> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ? 
> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:  
> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:  
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20 
> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8 ff 
> ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 
> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd 2b 
> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP: 
> 002b:00007fc258ff8d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb 
> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX: 
> 00007fc258ff8e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001 
> kernel: RDX: 00007fc258ff8d40 RSI: 00007fc258ff8d40 RDI: 
> 00007fc2300008c0 Feb 20 09:37:56 drs1p001 kernel: RBP: 
> 0000000000000045 R08: 0000000000000003 R09: 0000000000000000 Feb 20 
> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246 
> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13: 
> 000000000000000d R14: 0000000000000015 R15: 000055ec17f94d58 Feb 20 
> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag 
> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb 
> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree 
> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi 
> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl 
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 
> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof 
> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev 
> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate 
> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm drm 
> snd_timer snd mei_me soundcore pcspkr i2c_algo_bit intel_pch_thermal 
> iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe1 ]--- Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------ Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#2] SMP PTI
> Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24024 Comm: git Tainted: G      D           4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
> Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 
> 5040/0R790T, BIOS 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel: 
> RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 
> 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 
> 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 
> eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 
> 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 
> 0018:ffffaa68c8177af8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 
> kernel: RAX: 0000000000000292 RBX: ffff95fdf3b23418 RCX: 
> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX: 
> 0000000000000000 RSI: ffff95fdf3b23418 RDI: ffff95fdf3b23494 Feb 20 
> 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: ffff95fe91aa2620 
> R09: 0000000000000089 Feb 20 09:37:56 drs1p001 kernel: R10: 
> ffffaa68c8177ae0 R11: ffff95fe6e3efb40 R12: ffff95fdf3b23494 Feb 20 
> 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: Call Trace:
> Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 
> [ocfs2] Feb 20 09:37:56 drs1p001 kernel:  
> ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2] Feb 20 09:37:56 drs1p001 
> kernel:  ? d_splice_alias+0x29d/0x3f0 Feb 20 09:37:56 drs1p001 kernel:  
> ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel:  
> __lookup_slow+0x97/0x150 Feb 20 09:37:56 drs1p001 kernel:  
> lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001 kernel:  
> walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:  ? 
> __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56 
> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56 
> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001 
> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  
> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ? 
> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ? 
> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ? 
> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ? 
> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:  
> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:  
> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ? 
> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:  
> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:  
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20 
> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8 ff 
> ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 
> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd 2b 
> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP: 
> 002b:00007fc24d7f9d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb 
> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX: 
> 00007fc24d7f9e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001 
> kernel: RDX: 00007fc24d7f9d40 RSI: 00007fc24d7f9d40 RDI: 
> 00007fc2100008c0 Feb 20 09:37:56 drs1p001 kernel: RBP: 
> 0000000000000044 R08: 0000000000000003 R09: 0000000000000000 Feb 20 
> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246 
> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13: 
> 000000000000000f R14: 000000000000001d R15: 000055ec18015008 Feb 20 
> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag 
> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb 
> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree 
> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi 
> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl 
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 
> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof 
> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev 
> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate 
> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm drm 
> snd_timer snd mei_me soundcore pcspkr i2c_algo_bit intel_pch_thermal 
> iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button 
> acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56 drs1p001 
> kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb 
> sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd 
> cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd 
> scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb 20 
> 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe2 ]--- Feb 20 
> 09:37:56 drs1p001 kernel: RIP: 
> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56 
> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 
> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 
> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 
> 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 
> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX: 
> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20 
> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 
> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP: 
> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20 
> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b 
> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13: 
> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 
> 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) 
> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 
> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 
> 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 
> 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 
> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 
> 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
> Regards,
>
> Daniel
>
> -----Original Message-----
> From: Daniel Sobe
> Sent: Dienstag, 11. September 2018 13:36
> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
> Subject: RE: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi Larry,
>
> I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:
>
> Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------ 
> Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
> Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI Sep 
> 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted 
> 4.18.0-1-amd64 #1 Debian 4.18.6-1 Sep 11 13:08:51 drs1p002 kernel: 
> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014 Sep 
> 11 13:08:51 drs1p002 kernel: RIP: 
> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11 13:08:51 
> drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 cc 
> 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb 
> d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 
> 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8 
> EFLAGS: 00010046 Sep 11 13:08:51 drs1p002 kernel: RAX: 
> 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100 Sep 11 
> 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000 
> RDI: ffff95cdbd985a94 Sep 11 13:08:51 drs1p002 kernel: RBP: 
> ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0 Sep 11 13:08:51 drs1p002 kernel: Call Trace:
> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 
> [ocfs2] Sep 11 13:08:51 drs1p002 kernel:  
> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Sep 11 13:08:51 drs1p002 
> kernel:  ? d_splice_alias+0x299/0x410 Sep 11 13:08:51 drs1p002 kernel:  
> ocfs2_lookup+0x233/0x2c0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:  
> __lookup_slow+0x97/0x150 Sep 11 13:08:51 drs1p002 kernel:  
> lookup_slow+0x35/0x50 Sep 11 13:08:51 drs1p002 kernel:  
> walk_component+0x1c4/0x480 Sep 11 13:08:51 drs1p002 kernel:  ? 
> link_path_walk+0x27c/0x510 Sep 11 13:08:51 drs1p002 kernel:  ? 
> path_init+0x177/0x2f0 Sep 11 13:08:51 drs1p002 kernel:  
> path_lookupat+0x84/0x1f0 Sep 11 13:08:51 drs1p002 kernel:  
> filename_lookup+0xb6/0x190 Sep 11 13:08:51 drs1p002 kernel:  ? 
> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:  
> ? __check_object_size+0xa7/0x1a0 Sep 11 13:08:51 drs1p002 kernel:  ? 
> strncpy_from_user+0x48/0x160 Sep 11 13:08:51 drs1p002 kernel:  ? 
> getname_flags+0x6a/0x1e0 Sep 11 13:08:51 drs1p002 kernel:  ? 
> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:  
> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:  
> __do_sys_newlstat+0x39/0x70 Sep 11 13:08:51 drs1p002 kernel:  
> do_syscall_64+0x55/0x110 Sep 11 13:08:51 drs1p002 kernel:  
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995 Sep 11 
> 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8 ff 
> ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 
> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4 0c 
> 00 f7 d8 64 89 Sep 11 13:08:51 drs1p002 kernel: RSP: 
> 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Sep 
> 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX: 
> 00007f2434e20390 RCX: 00007f24b6cc5995 Sep 11 13:08:51 drs1p002 
> kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI: 
> 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: RBP: 
> 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800 Sep 11 
> 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246 
> R12: 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: R13: 
> 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800 Sep 11 
> 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag 
> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
> ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse 
> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic 
> nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp 
> vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm 
> irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas 
> ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate intel_uncore 
> intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me drm_kms_helper 
> snd_timer snd soundcore pcspkr serio_raw efivars drm mei lpc_ich 
> i2c_algo_bit sg ie31200_edac video pcc_cpufreq button drbd lru_cache 
> libcrc32c parport_pc sunrpc ppdev lp parport efivarfs ip_tables 
> x_tables autofs4 ext4 crc16 Sep 11 13:08:51 drs1p002 kernel:  mbcache 
> jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper 
> aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801 
> libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod 
> usbcore e1000e usb_common thermal Sep 11 13:08:51 drs1p002 kernel: 
> ---[ end trace feba92ba6e432478 ]--- Sep 11 13:08:51 drs1p002 kernel: 
> RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11 
> 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 
> 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 
> 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 
> 00 00 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP: 
> 0018:ffffb1248eeb3af8 EFLAGS: 00010046 Sep 11 13:08:51 drs1p002 
> kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX: 
> 0000000000000100 Sep 11 13:08:51 drs1p002 kernel: RDX: 
> 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94 Sep 11 
> 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000 
> R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10: 
> ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11 
> 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 
> R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:  
> 00007f2434e21700(0000) GS:ffff95ce9e200000(0000) 
> knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 
> 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel: 
> CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
>
>
> All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.
>
> Regards,
>
> Daniel
>
> -----Original Message-----
> From: Larry Chen <lchen@suse.com>
> Sent: Mittwoch, 18. Juli 2018 10:09
> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi Daniel,
>
> Which stack do you use? dlm or o2cb??
>
> I tried to reproduce the bug.
>
> I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.
>
> #! /bin/bash
> for i in $(git tag)
> do
>           echo $i
>           git checkout $i
> done
>
> Bug could not be reproduced.
>
> According to the back trace, I think the bug is caused by the logic of holding a lock.
>
> If possible, I think the bug will recur, even without drdb, lvm or other components.
>
> Regards,
> Larry
>
> On 07/17/2018 04:11 PM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
>>
>> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
>>
>> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
>>
>> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
>>
>> Regards,
>>
>> Daniel
>>
>>
>> -----Original Message-----
>> From: Larry Chen [mailto:lchen at suse.com]
>> Sent: Dienstag, 17. Juli 2018 04:54
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Could you please simplify your environment?
>> Can I use several virtual machines to reproduce the bug??
>>
>> Thanks
>> Larry
>>
>> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>>> Hi,
>>>
>>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>>
>>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>>
>>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>>
>>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here 
>>> ]------------ Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI 
>>> Jul
>>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag 
>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
>>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2 
>>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper 
>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16 
>>> 13:40:24
>>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014 
>>> Jul
>>> 16 13:40:24 drs1p002 kernel: RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16 
>>> 13:40:24
>>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX: 
>>> ffff92559ee9f018
>>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 
>>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24 
>>> drs1p002
>>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS: 
>>> 00000246
>>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI: 
>>> 00007f0ce8ff8ce0
>>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11: 
>>> 0000000000000246
>>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d
>>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c
>>> 85
>>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP:
>>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace
>>> a5a84fa62e77df42 ]---
>>>
>>> -----Original Message-----
>>> From: ocfs2-devel-bounces at oss.oracle.com
>>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
>>> Sent: Freitag, 13. Juli 2018 13:56
>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Larry,
>>>
>>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>>
>>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>>
>>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Freitag, 13. Juli 2018 11:49
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Thanks for your effort to reproduce the bug.
>>> I can confirm that there exist more than one bug.
>>> I'll focus on this interesting issue.
>>>
>>>
>>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>>
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut 
>>>> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836] 
>>>> Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name:
>>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12
>>>> 15:29:08
>>>> drs1p001 kernel: [1300619.423923] RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12
>>>> 15:29:08
>>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
>>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>>> kernel: [1300619.423931] RBP: 0000000000000003 R08: 
>>>> 000062d940000000
>>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>> Here is caused by ocfs2_dentry_lock failed.
>>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>>
>>> But why it failed still confuses me.
>>>
>>>
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08 
>>>> drs1p001
>>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul 
>>>> 12
>>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300 
>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
>>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul 
>>>> 12
>>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 
>>>> Jul
>>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ? 
>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel: 
>>>> [1300619.424012]
>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424015]
>>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
>>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 
>>>> Jul
>>>> 12 15:29:08 drs1p001 kernel: [1300619.424022] 
>>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b 
>>>> 8b
>>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb 
>>>> d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 
>>>> 00
>>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] 
>>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>>> [1300628.967649] ------------[ cut here ]------------
>>>>
>>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>>
>>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>>
>>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Freitag, 11. Mai 2018 09:01
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>>
>>>>> * Install the LXC packages from the distribution
>>>>> * run the command "lxc-create -n test1 -t download"
>>>>> ** first run might prompt you to generate a 
>>>>> ~/.config/lxc/default.conf to define UID mappings
>>>>> ** in a corporate environment it might be tricky to set the 
>>>>> http_proxy (and maybe even https_proxy) environment variables 
>>>>> correctly
>>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>>> * the container downloads to ~/.local/share/lxc/
>>>>> * adapt the "config" file in that directory to add the shared 
>>>>> ocfs2 mount like in my example below
>>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>>> ** if you want to finally enable networking, most distributions 
>>>>> arrange a dedicated bridge (lxcbr0) which you can configure 
>>>>> similar to my example below
>>>>> ** in my case I had to install cgroup related tools and reboot to 
>>>>> have all cgroups available, and to allow use of lxcbr0 bridge in 
>>>>> /etc/lxc/lxc-usernet
>>>>>
>>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Quite an interesting issue.
>>>>>
>>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>>
>>>>> Do you have a script to build up your lxc environment?
>>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>>
>>>>> Thanks,
>>>>> Larry
>>>>>
>>>>>
>>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> not sure if it helps, the issue wasn't there with Debian 8 and 
>>>>>> kernel
>>>>>> 3.16 - but that's a long history. Unfortunately, the only machine 
>>>>>> where I could try to bisect, does not run any kernel < 4.16 
>>>>>> without other issues ?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> Thanks for your report.
>>>>>> I'll try to reproduce this bug as you did.
>>>>>>
>>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>>
>>>>>> Thanks
>>>>>> Larry
>>>>>>
>>>>>>
>>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6abc9b8e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636885280224158395&amp;sdata=w5eZ7APO7D73%2FtggnLwFi8mOAOQ%2FMGVmlE%2F6O%2FLdkXs%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>>
>>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>> # Distribution configuration
>>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>>> lxc.arch = x86_64
>>>>>>>
>>>>>>> # Container specific configuration lxc.id_map = u 0 624288 65536 
>>>>>>> lxc.id_map = g 0 624288 65536
>>>>>>>
>>>>>>> lxc.utsname = container1
>>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>>
>>>>>>> lxc.network.type = veth
>>>>>>> lxc.network.flags = up
>>>>>>> lxc.network.link = bridge1
>>>>>>> lxc.network.name = eth0
>>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway = 
>>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>>
>>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>>
>>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>>
>>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>>
>>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; 
>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>>> Hi Larry,
>>>>>>>>
>>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>>
>>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>>         ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>>
>>>>>>> How did you setup your lxc or container?
>>>>>>>
>>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>>
>>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Larry
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>> Sorry for this delayed reply.
>>>>
>>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>>
>>>> But I can not reproduce your bugs.
>>>>
>>>> What I use is opensuse tumbleweed.
>>>>
>>>> The procedure I try to reproduce your bugs:
>>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>>      ?? mount /dev/xxx /mnt
>>>>      ?? then it shows
>>>>      ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>>      ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>>
>>>> 1. Start Virtual Machine Manager.
>>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>>      ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>>> 4. Activate Application container and click Forward.
>>>>      ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>>> Click Forward.
>>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>>      ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>>
>>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>>
>>>> Thanks
>>>> Larry
>>>>
>>>
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fur&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ldW8vjsEV_xc4jfEIeyWatUFMn-QajUAxBVcs1Z-ggg&s=-y2Lssk9PmgqiJcQTrLRnGmvzwCnXmuSWtnraGhxjBY&e=
>>> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.
>>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos%26d%3DDwIGaQ%
>>> 26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAv
>>> Tdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DeDT2dYwkSxcLa1NsepzLRIpUZlkC_N
>>> ECl_Qk34Foqvo%26s%3DAiHVWnx-sunWZO4cbXP7v6z6Bw5vegbCZBA-wGNCoqA%26e&
>>> amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6a
>>> bc9b8e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636885280224158
>>> 395&amp;sdata=jU%2FI8ickhjrnOfxrg6pDU5fnTgzrOuhQSxqreDkw5V8%3D&amp;r
>>> eserved=0=
>>> s
>>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cda
>>> n
>>> i
>>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4
>>> c
>>> 6
>>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJTp
>>> I
>>> R
>>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ldW8vjsEV_xc4jfEIeyWatUFMn-QajUAxBVcs1Z-ggg&s=9U57VuaSYecRGyvRGfwQ-7qpHlEdJDFpg22e_vDg_n4&e=.
> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdanie
> l.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6abc9b8e4%7C686ea1d3bc2b4c6f
> a92cd99c5c301635%7C0%7C0%7C636885280224158395&amp;sdata=3wHi5VznbDmynM
> ohhVO5H7mRmkx113SL06BHrfnDIcg%3D&amp;reserved=0



_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel at oss.oracle.com
https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss.oracle.com-252Fmailman-252Flistinfo-252Focfs2-2Ddevel-26amp-3Bdata-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C083d3c6f8d5847b9ba2508d6abc9b8e4-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636885280224168405-26amp-3Bsdata-3DdInMY8cNOF6WbD-252BUb62XZA3tGfRlr3KpX1YhY4a3sbQ-253D-26amp-3Breserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ldW8vjsEV_xc4jfEIeyWatUFMn-QajUAxBVcs1Z-ggg&s=Sue31oqtra2iyU7nSOYtoBU5Ojpwf0iIm5vgNdYpFrU&e=

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2019-03-26 12:27                                         ` Daniel Sobe
@ 2019-03-26 21:24                                           ` Wengang Wang
  2019-03-27  7:57                                             ` Daniel Sobe
  0 siblings, 1 reply; 32+ messages in thread
From: Wengang Wang @ 2019-03-26 21:24 UTC (permalink / raw)
  To: ocfs2-devel

Thank you, Daniel.

Wengang

On 2019/3/26 5:27, Daniel Sobe wrote:
> Hi Wengang,
>
> Thanks for confirming that this bug is reproducible! Long time I was under the impression that I'm the only one facing this issue.
>
> Unfortunately, I do not know what a "vmcore" is. Let me google it and then check whether I can reproduce the bug again easily and provide what you request.
>
> Regards,
>
> Daniel
>
> -----Original Message-----
> From: ocfs2-devel-bounces at oss.oracle.com <ocfs2-devel-bounces@oss.oracle.com> On Behalf Of Wengang
> Sent: Montag, 18. M?rz 2019 18:46
> To: ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi,
>
> I also see this problem on a lower version at 4.1.12.xxx.
>
> The l_ro_holders is changed in ocfs2 layer, not DLM layer. And another thing is that the dentry lock is just used to get notification for file remote deleting. So the code is requesting PR lock and then releasing the it after PR is granted.? I am not sure, but I feel this is not a DLM issue, but a memory issue on the ocfs2_lock_res.? Do you have a vmcore available for this problem?
>
> Thanks,
> Wengang
>
>
> On 02/20/2019 12:48 AM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> The issue still happens with 4.19 as well, but it took quite a while to trigger it:
>>
>> Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------
>> Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI Feb
>> 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24018 Comm: git Not tainted
>> 4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1 Feb 20 09:37:56
>> drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS
>> 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel: RIP:
>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53
>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f
>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f
>> 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>> 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000)
>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb
>> 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0 Feb 20
>> 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20
>> 09:37:56 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2] Feb 20
>> 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150 Feb 20 09:37:56
>> drs1p001 kernel:  lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001
>> kernel:  walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:
>> ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8 ff
>> ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8
>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd 2b
>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>> 002b:00007fc258ff8d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>> 00007fc258ff8e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>> kernel: RDX: 00007fc258ff8d40 RSI: 00007fc258ff8d40 RDI:
>> 00007fc2300008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>> 0000000000000045 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>> 000000000000000d R14: 0000000000000015 R15: 000055ec17f94d58 Feb 20
>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag
>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb
>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree
>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi
>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl
>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915
>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof
>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev
>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate
>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm drm
>> snd_timer snd mei_me soundcore pcspkr i2c_algo_bit intel_pch_thermal
>> iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe1 ]--- Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------ Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#2] SMP PTI
>> Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24024 Comm: git Tainted: G      D           4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
>> Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex
>> 5040/0R790T, BIOS 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel:
>> RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20
>> 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f
>> 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3
>> eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
>> 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP:
>> 0018:ffffaa68c8177af8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001
>> kernel: RAX: 0000000000000292 RBX: ffff95fdf3b23418 RCX:
>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX:
>> 0000000000000000 RSI: ffff95fdf3b23418 RDI: ffff95fdf3b23494 Feb 20
>> 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: ffff95fe91aa2620
>> R09: 0000000000000089 Feb 20 09:37:56 drs1p001 kernel: R10:
>> ffffaa68c8177ae0 R11: ffff95fe6e3efb40 R12: ffff95fdf3b23494 Feb 20
>> 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: Call Trace:
>> Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>> [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>> ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2] Feb 20 09:37:56 drs1p001
>> kernel:  ? d_splice_alias+0x29d/0x3f0 Feb 20 09:37:56 drs1p001 kernel:
>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>> __lookup_slow+0x97/0x150 Feb 20 09:37:56 drs1p001 kernel:
>> lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001 kernel:
>> walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:  ?
>> __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8 ff
>> ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8
>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd 2b
>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>> 002b:00007fc24d7f9d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>> 00007fc24d7f9e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>> kernel: RDX: 00007fc24d7f9d40 RSI: 00007fc24d7f9d40 RDI:
>> 00007fc2100008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>> 0000000000000044 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>> 000000000000000f R14: 000000000000001d R15: 000055ec18015008 Feb 20
>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag
>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb
>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree
>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi
>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl
>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915
>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof
>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev
>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate
>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm drm
>> snd_timer snd mei_me soundcore pcspkr i2c_algo_bit intel_pch_thermal
>> iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button
>> acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56 drs1p001
>> kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb
>> sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd
>> cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd
>> scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb 20
>> 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe2 ]--- Feb 20
>> 09:37:56 drs1p001 kernel: RIP:
>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53
>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f
>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f
>> 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>> 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000)
>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb
>> 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3:
>> 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001
>> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3:
>> 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: Daniel Sobe
>> Sent: Dienstag, 11. September 2018 13:36
>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>> Subject: RE: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Larry,
>>
>> I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:
>>
>> Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------
>> Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
>> Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI Sep
>> 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted
>> 4.18.0-1-amd64 #1 Debian 4.18.6-1 Sep 11 13:08:51 drs1p002 kernel:
>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014 Sep
>> 11 13:08:51 drs1p002 kernel: RIP:
>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11 13:08:51
>> drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 cc
>> 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 eb
>> d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90
>> 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8
>> EFLAGS: 00010046 Sep 11 13:08:51 drs1p002 kernel: RAX:
>> 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100 Sep 11
>> 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000
>> RDI: ffff95cdbd985a94 Sep 11 13:08:51 drs1p002 kernel: RBP:
>> ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0 Sep 11 13:08:51 drs1p002 kernel: Call Trace:
>> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>> [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Sep 11 13:08:51 drs1p002
>> kernel:  ? d_splice_alias+0x299/0x410 Sep 11 13:08:51 drs1p002 kernel:
>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>> __lookup_slow+0x97/0x150 Sep 11 13:08:51 drs1p002 kernel:
>> lookup_slow+0x35/0x50 Sep 11 13:08:51 drs1p002 kernel:
>> walk_component+0x1c4/0x480 Sep 11 13:08:51 drs1p002 kernel:  ?
>> link_path_walk+0x27c/0x510 Sep 11 13:08:51 drs1p002 kernel:  ?
>> path_init+0x177/0x2f0 Sep 11 13:08:51 drs1p002 kernel:
>> path_lookupat+0x84/0x1f0 Sep 11 13:08:51 drs1p002 kernel:
>> filename_lookup+0xb6/0x190 Sep 11 13:08:51 drs1p002 kernel:  ?
>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>> ? __check_object_size+0xa7/0x1a0 Sep 11 13:08:51 drs1p002 kernel:  ?
>> strncpy_from_user+0x48/0x160 Sep 11 13:08:51 drs1p002 kernel:  ?
>> getname_flags+0x6a/0x1e0 Sep 11 13:08:51 drs1p002 kernel:  ?
>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>> __do_sys_newlstat+0x39/0x70 Sep 11 13:08:51 drs1p002 kernel:
>> do_syscall_64+0x55/0x110 Sep 11 13:08:51 drs1p002 kernel:
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995 Sep 11
>> 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8 ff
>> ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8
>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4 0c
>> 00 f7 d8 64 89 Sep 11 13:08:51 drs1p002 kernel: RSP:
>> 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Sep
>> 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX:
>> 00007f2434e20390 RCX: 00007f24b6cc5995 Sep 11 13:08:51 drs1p002
>> kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI:
>> 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: RBP:
>> 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800 Sep 11
>> 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246
>> R12: 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: R13:
>> 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800 Sep 11
>> 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>> ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse
>> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
>> nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp
>> vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm
>> irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas
>> ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate intel_uncore
>> intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me drm_kms_helper
>> snd_timer snd soundcore pcspkr serio_raw efivars drm mei lpc_ich
>> i2c_algo_bit sg ie31200_edac video pcc_cpufreq button drbd lru_cache
>> libcrc32c parport_pc sunrpc ppdev lp parport efivarfs ip_tables
>> x_tables autofs4 ext4 crc16 Sep 11 13:08:51 drs1p002 kernel:  mbcache
>> jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>> aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801
>> libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod
>> usbcore e1000e usb_common thermal Sep 11 13:08:51 drs1p002 kernel:
>> ---[ end trace feba92ba6e432478 ]--- Sep 11 13:08:51 drs1p002 kernel:
>> RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11
>> 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e
>> 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2
>> 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00
>> 00 00 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP:
>> 0018:ffffb1248eeb3af8 EFLAGS: 00010046 Sep 11 13:08:51 drs1p002
>> kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX:
>> 0000000000000100 Sep 11 13:08:51 drs1p002 kernel: RDX:
>> 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94 Sep 11
>> 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000
>> R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10:
>> ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11
>> 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000
>> R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:
>> 00007f2434e21700(0000) GS:ffff95ce9e200000(0000)
>> knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS:
>> 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel:
>> CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
>>
>>
>> All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: Larry Chen <lchen@suse.com>
>> Sent: Mittwoch, 18. Juli 2018 10:09
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Which stack do you use? dlm or o2cb??
>>
>> I tried to reproduce the bug.
>>
>> I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.
>>
>> #! /bin/bash
>> for i in $(git tag)
>> do
>>            echo $i
>>            git checkout $i
>> done
>>
>> Bug could not be reproduced.
>>
>> According to the back trace, I think the bug is caused by the logic of holding a lock.
>>
>> If possible, I think the bug will recur, even without drdb, lvm or other components.
>>
>> Regards,
>> Larry
>>
>> On 07/17/2018 04:11 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
>>>
>>> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
>>>
>>> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
>>>
>>> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Dienstag, 17. Juli 2018 04:54
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Could you please simplify your environment?
>>> Can I use several virtual machines to reproduce the bug??
>>>
>>> Thanks
>>> Larry
>>>
>>> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>>>> Hi,
>>>>
>>>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>>>
>>>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>>>
>>>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>>>
>>>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here
>>>> ]------------ Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>>>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>>> Jul
>>>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2
>>>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16
>>>> 13:40:24
>>>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>>>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>>> Jul
>>>> 16 13:40:24 drs1p002 kernel: RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16
>>>> 13:40:24
>>>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>>>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX:
>>>> ffff92559ee9f018
>>>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>>>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>>>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>>>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24
>>>> drs1p002
>>>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>>>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>>>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>>>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>>>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>>>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>>>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS:
>>>> 00000246
>>>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>>>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>>>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI:
>>>> 00007f0ce8ff8ce0
>>>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>>>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>>>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11:
>>>> 0000000000000246
>>>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>>>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>>>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d
>>>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c
>>>> 85
>>>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>>>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP:
>>>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace
>>>> a5a84fa62e77df42 ]---
>>>>
>>>> -----Original Message-----
>>>> From: ocfs2-devel-bounces at oss.oracle.com
>>>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel Sobe
>>>> Sent: Freitag, 13. Juli 2018 13:56
>>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Larry,
>>>>
>>>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>>>
>>>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>>>
>>>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Freitag, 13. Juli 2018 11:49
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Thanks for your effort to reproduce the bug.
>>>> I can confirm that there exist more than one bug.
>>>> I'll focus on this interesting issue.
>>>>
>>>>
>>>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>>>
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ cut
>>>>> here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: [1300619.423836]
>>>>> Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name:
>>>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12
>>>>> 15:29:08
>>>>> drs1p001 kernel: [1300619.423923] RIP:
>>>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12
>>>>> 15:29:08
>>>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 12
>>>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>>>> kernel: [1300619.423931] RBP: 0000000000000003 R08:
>>>>> 000062d940000000
>>>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>>> Here is caused by ocfs2_dentry_lock failed.
>>>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>>>
>>>> But why it failed still confuses me.
>>>>
>>>>
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08
>>>>> drs1p001
>>>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul
>>>>> 12
>>>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? path_init+0x1bd/0x300
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423995]
>>>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul
>>>>> 12
>>>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160
>>>>> Jul
>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ?
>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.424012]
>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.424015]
>>>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 15:29:08
>>>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0
>>>>> Jul
>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424022]
>>>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 0b
>>>>> 8b
>>>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb
>>>>> d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
>>>>> 00
>>>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057]
>>>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300628.967649] ------------[ cut here ]------------
>>>>>
>>>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>>>
>>>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>>>
>>>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Freitag, 11. Mai 2018 09:01
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>>>
>>>>>> * Install the LXC packages from the distribution
>>>>>> * run the command "lxc-create -n test1 -t download"
>>>>>> ** first run might prompt you to generate a
>>>>>> ~/.config/lxc/default.conf to define UID mappings
>>>>>> ** in a corporate environment it might be tricky to set the
>>>>>> http_proxy (and maybe even https_proxy) environment variables
>>>>>> correctly
>>>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>>>> * the container downloads to ~/.local/share/lxc/
>>>>>> * adapt the "config" file in that directory to add the shared
>>>>>> ocfs2 mount like in my example below
>>>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>>>> ** if you want to finally enable networking, most distributions
>>>>>> arrange a dedicated bridge (lxcbr0) which you can configure
>>>>>> similar to my example below
>>>>>> ** in my case I had to install cgroup related tools and reboot to
>>>>>> have all cgroups available, and to allow use of lxcbr0 bridge in
>>>>>> /etc/lxc/lxc-usernet
>>>>>>
>>>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> Quite an interesting issue.
>>>>>>
>>>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>>>
>>>>>> Do you have a script to build up your lxc environment?
>>>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>>>
>>>>>> Thanks,
>>>>>> Larry
>>>>>>
>>>>>>
>>>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> not sure if it helps, the issue wasn't there with Debian 8 and
>>>>>>> kernel
>>>>>>> 3.16 - but that's a long history. Unfortunately, the only machine
>>>>>>> where I could try to bisect, does not run any kernel < 4.16
>>>>>>> without other issues ?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> Thanks for your report.
>>>>>>> I'll try to reproduce this bug as you did.
>>>>>>>
>>>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Larry
>>>>>>>
>>>>>>>
>>>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>>>> Hi Larry,
>>>>>>>>
>>>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6abc9b8e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636885280224158395&amp;sdata=w5eZ7APO7D73%2FtggnLwFi8mOAOQ%2FMGVmlE%2F6O%2FLdkXs%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>>>
>>>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>> ----
>>>>>>>>
>>>>>>>> # Distribution configuration
>>>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>>>> lxc.arch = x86_64
>>>>>>>>
>>>>>>>> # Container specific configuration lxc.id_map = u 0 624288 65536
>>>>>>>> lxc.id_map = g 0 624288 65536
>>>>>>>>
>>>>>>>> lxc.utsname = container1
>>>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>>>
>>>>>>>> lxc.network.type = veth
>>>>>>>> lxc.network.flags = up
>>>>>>>> lxc.network.link = bridge1
>>>>>>>> lxc.network.name = eth0
>>>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway =
>>>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>>>
>>>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>>>
>>>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>>>
>>>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>>>
>>>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>>>
>>>>>>>> ----
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>;
>>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>>>> Hi Larry,
>>>>>>>>>
>>>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>>>
>>>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>>>          ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>>>
>>>>>>>> How did you setup your lxc or container?
>>>>>>>>
>>>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>>>
>>>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Larry
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Daniel
>>>>>>>>>
>>>>> Sorry for this delayed reply.
>>>>>
>>>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>>>
>>>>> But I can not reproduce your bugs.
>>>>>
>>>>> What I use is opensuse tumbleweed.
>>>>>
>>>>> The procedure I try to reproduce your bugs:
>>>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>>>       ?? mount /dev/xxx /mnt
>>>>>       ?? then it shows
>>>>>       ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>>>       ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>>>
>>>>> 1. Start Virtual Machine Manager.
>>>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>>>       ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>>>> 4. Activate Application container and click Forward.
>>>>>       ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>>>> Click Forward.
>>>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>>>       ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>>>
>>>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>>>
>>>>> Thanks
>>>>> Larry
>>>>>
>>>> _______________________________________________
>>>> Ocfs2-devel mailing list
>>>> Ocfs2-devel at oss.oracle.com
>>>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fur
>>>> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.
>>>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos%26d%3DDwIGaQ%
>>>> 26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAv
>>>> Tdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DeDT2dYwkSxcLa1NsepzLRIpUZlkC_N
>>>> ECl_Qk34Foqvo%26s%3DAiHVWnx-sunWZO4cbXP7v6z6Bw5vegbCZBA-wGNCoqA%26e&
>>>> amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6a
>>>> bc9b8e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636885280224158
>>>> 395&amp;sdata=jU%2FI8ickhjrnOfxrg6pDU5fnTgzrOuhQSxqreDkw5V8%3D&amp;r
>>>> eserved=0=
>>>> s
>>>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cda
>>>> n
>>>> i
>>>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b4
>>>> c
>>>> 6
>>>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJTp
>>>> I
>>>> R
>>>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foss.
>> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdanie
>> l.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6abc9b8e4%7C686ea1d3bc2b4c6f
>> a92cd99c5c301635%7C0%7C0%7C636885280224158395&amp;sdata=3wHi5VznbDmynM
>> ohhVO5H7mRmkx113SL06BHrfnDIcg%3D&amp;reserved=0
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foss.oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6abc9b8e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636885280224168405&amp;sdata=dInMY8cNOF6WbD%2BUb62XZA3tGfRlr3KpX1YhY4a3sbQ%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2019-03-26 21:24                                           ` Wengang Wang
@ 2019-03-27  7:57                                             ` Daniel Sobe
  2019-03-27 18:17                                               ` Wengang
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2019-03-27  7:57 UTC (permalink / raw)
  To: ocfs2-devel

In my setup, the system is still responsive (at least for a couple of minutes) after the BUG. Do I understand it correctly that you want me to setup "kdump" and provoke a crash manually after this BUG occurred, in order to receive an image file with all kernel memory inside?

Sorry for the stupid question, but I'm new to this.

Regards,

Daniel

-----Original Message-----
From: Wengang Wang <wen.gang.wang@oracle.com> 
Sent: Dienstag, 26. M?rz 2019 22:24
To: Daniel Sobe <daniel.sobe@nxp.com>; 'ocfs2-devel at oss.oracle.com' <ocfs2-devel@oss.oracle.com>
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Thank you, Daniel.

Wengang

On 2019/3/26 5:27, Daniel Sobe wrote:
> Hi Wengang,
>
> Thanks for confirming that this bug is reproducible! Long time I was under the impression that I'm the only one facing this issue.
>
> Unfortunately, I do not know what a "vmcore" is. Let me google it and then check whether I can reproduce the bug again easily and provide what you request.
>
> Regards,
>
> Daniel
>
> -----Original Message-----
> From: ocfs2-devel-bounces at oss.oracle.com 
> <ocfs2-devel-bounces@oss.oracle.com> On Behalf Of Wengang
> Sent: Montag, 18. M?rz 2019 18:46
> To: ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Hi,
>
> I also see this problem on a lower version at 4.1.12.xxx.
>
> The l_ro_holders is changed in ocfs2 layer, not DLM layer. And another thing is that the dentry lock is just used to get notification for file remote deleting. So the code is requesting PR lock and then releasing the it after PR is granted.? I am not sure, but I feel this is not a DLM issue, but a memory issue on the ocfs2_lock_res.? Do you have a vmcore available for this problem?
>
> Thanks,
> Wengang
>
>
> On 02/20/2019 12:48 AM, Daniel Sobe wrote:
>> Hi Larry,
>>
>> The issue still happens with 4.19 as well, but it took quite a while to trigger it:
>>
>> Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------ 
>> Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI 
>> Feb
>> 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24018 Comm: git Not tainted
>> 4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1 Feb 20 09:37:56
>> drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS
>> 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel: RIP:
>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 
>> 53
>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 
>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 
>> 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>> 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000)
>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
>> Feb
>> 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0 Feb 20
>> 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20
>> 09:37:56 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2] Feb 
>> 20
>> 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150 Feb 20 09:37:56
>> drs1p001 kernel:  lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001
>> kernel:  walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:
>> ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8 
>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 
>> b8
>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd 
>> 2b
>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>> 002b:00007fc258ff8d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>> 00007fc258ff8e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>> kernel: RDX: 00007fc258ff8d40 RSI: 00007fc258ff8d40 RDI:
>> 00007fc2300008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>> 0000000000000045 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>> 000000000000000d R14: 0000000000000015 R15: 000055ec17f94d58 Feb 20
>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag 
>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb 
>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree 
>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi 
>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl 
>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 
>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof 
>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev 
>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate 
>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm 
>> drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit 
>> intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe1 ]--- Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------ Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#2] SMP PTI
>> Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24024 Comm: git Tainted: G      D           4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
>> Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 
>> 5040/0R790T, BIOS 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel:
>> RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20
>> 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 
>> 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 
>> c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 
>> 00 00
>> 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP:
>> 0018:ffffaa68c8177af8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001
>> kernel: RAX: 0000000000000292 RBX: ffff95fdf3b23418 RCX:
>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX:
>> 0000000000000000 RSI: ffff95fdf3b23418 RDI: ffff95fdf3b23494 Feb 20
>> 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: ffff95fe91aa2620
>> R09: 0000000000000089 Feb 20 09:37:56 drs1p001 kernel: R10:
>> ffffaa68c8177ae0 R11: ffff95fe6e3efb40 R12: ffff95fdf3b23494 Feb 20
>> 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: Call Trace:
>> Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 
>> [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>> ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2] Feb 20 09:37:56 drs1p001
>> kernel:  ? d_splice_alias+0x29d/0x3f0 Feb 20 09:37:56 drs1p001 kernel:
>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>> __lookup_slow+0x97/0x150 Feb 20 09:37:56 drs1p001 kernel:
>> lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001 kernel:
>> walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:  ?
>> __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8 
>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 
>> b8
>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd 
>> 2b
>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>> 002b:00007fc24d7f9d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>> 00007fc24d7f9e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>> kernel: RDX: 00007fc24d7f9d40 RSI: 00007fc24d7f9d40 RDI:
>> 00007fc2100008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>> 0000000000000044 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>> 000000000000000f R14: 000000000000001d R15: 000055ec18015008 Feb 20
>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag 
>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb 
>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree 
>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi 
>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl 
>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 
>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof 
>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev 
>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate 
>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm 
>> drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit 
>> intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi 
>> video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56 
>> drs1p001
>> kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb 
>> sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd 
>> cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse 
>> xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb 
>> 20
>> 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe2 ]--- Feb 20
>> 09:37:56 drs1p001 kernel: RIP:
>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 
>> 53
>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 
>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 
>> 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>> 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000)
>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
>> Feb
>> 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3:
>> 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001
>> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3:
>> 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: Daniel Sobe
>> Sent: Dienstag, 11. September 2018 13:36
>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>> Subject: RE: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Larry,
>>
>> I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:
>>
>> Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------ 
>> Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
>> Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI 
>> Sep
>> 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted
>> 4.18.0-1-amd64 #1 Debian 4.18.6-1 Sep 11 13:08:51 drs1p002 kernel:
>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014 
>> Sep
>> 11 13:08:51 drs1p002 kernel: RIP:
>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11 13:08:51
>> drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50 
>> cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5 
>> eb
>> d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 
>> 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8
>> EFLAGS: 00010046 Sep 11 13:08:51 drs1p002 kernel: RAX:
>> 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100 Sep 11
>> 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000
>> RDI: ffff95cdbd985a94 Sep 11 13:08:51 drs1p002 kernel: RBP:
>> ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0 Sep 11 13:08:51 drs1p002 kernel: Call Trace:
>> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 
>> [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Sep 11 13:08:51 drs1p002
>> kernel:  ? d_splice_alias+0x299/0x410 Sep 11 13:08:51 drs1p002 kernel:
>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>> __lookup_slow+0x97/0x150 Sep 11 13:08:51 drs1p002 kernel:
>> lookup_slow+0x35/0x50 Sep 11 13:08:51 drs1p002 kernel:
>> walk_component+0x1c4/0x480 Sep 11 13:08:51 drs1p002 kernel:  ?
>> link_path_walk+0x27c/0x510 Sep 11 13:08:51 drs1p002 kernel:  ?
>> path_init+0x177/0x2f0 Sep 11 13:08:51 drs1p002 kernel:
>> path_lookupat+0x84/0x1f0 Sep 11 13:08:51 drs1p002 kernel:
>> filename_lookup+0xb6/0x190 Sep 11 13:08:51 drs1p002 kernel:  ?
>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>> ? __check_object_size+0xa7/0x1a0 Sep 11 13:08:51 drs1p002 kernel:  ?
>> strncpy_from_user+0x48/0x160 Sep 11 13:08:51 drs1p002 kernel:  ?
>> getname_flags+0x6a/0x1e0 Sep 11 13:08:51 drs1p002 kernel:  ?
>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>> __do_sys_newlstat+0x39/0x70 Sep 11 13:08:51 drs1p002 kernel:
>> do_syscall_64+0x55/0x110 Sep 11 13:08:51 drs1p002 kernel:
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995 Sep 11
>> 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8 
>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 
>> b8
>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4 
>> 0c
>> 00 f7 d8 64 89 Sep 11 13:08:51 drs1p002 kernel: RSP:
>> 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Sep
>> 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX:
>> 00007f2434e20390 RCX: 00007f24b6cc5995 Sep 11 13:08:51 drs1p002
>> kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI:
>> 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: RBP:
>> 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800 Sep 11
>> 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246
>> R12: 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: R13:
>> 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800 Sep 11
>> 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag 
>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
>> ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse 
>> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic 
>> nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp 
>> vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm 
>> irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas 
>> ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate 
>> intel_uncore intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me 
>> drm_kms_helper snd_timer snd soundcore pcspkr serio_raw efivars drm 
>> mei lpc_ich i2c_algo_bit sg ie31200_edac video pcc_cpufreq button 
>> drbd lru_cache libcrc32c parport_pc sunrpc ppdev lp parport efivarfs 
>> ip_tables x_tables autofs4 ext4 crc16 Sep 11 13:08:51 drs1p002 
>> kernel:  mbcache
>> jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>> aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801 
>> libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod 
>> usbcore e1000e usb_common thermal Sep 11 13:08:51 drs1p002 kernel:
>> ---[ end trace feba92ba6e432478 ]--- Sep 11 13:08:51 drs1p002 kernel:
>> RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11
>> 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 
>> 6e
>> 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2
>> 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 
>> 00
>> 00 00 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP:
>> 0018:ffffb1248eeb3af8 EFLAGS: 00010046 Sep 11 13:08:51 drs1p002
>> kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX:
>> 0000000000000100 Sep 11 13:08:51 drs1p002 kernel: RDX:
>> 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94 Sep 11
>> 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000
>> R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10:
>> ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11
>> 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000
>> R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:
>> 00007f2434e21700(0000) GS:ffff95ce9e200000(0000)
>> knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS:
>> 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel:
>> CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
>>
>>
>> All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: Larry Chen <lchen@suse.com>
>> Sent: Mittwoch, 18. Juli 2018 10:09
>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi Daniel,
>>
>> Which stack do you use? dlm or o2cb??
>>
>> I tried to reproduce the bug.
>>
>> I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.
>>
>> #! /bin/bash
>> for i in $(git tag)
>> do
>>            echo $i
>>            git checkout $i
>> done
>>
>> Bug could not be reproduced.
>>
>> According to the back trace, I think the bug is caused by the logic of holding a lock.
>>
>> If possible, I think the bug will recur, even without drdb, lvm or other components.
>>
>> Regards,
>> Larry
>>
>> On 07/17/2018 04:11 PM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
>>>
>>> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
>>>
>>> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
>>>
>>> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>>> -----Original Message-----
>>> From: Larry Chen [mailto:lchen at suse.com]
>>> Sent: Dienstag, 17. Juli 2018 04:54
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Could you please simplify your environment?
>>> Can I use several virtual machines to reproduce the bug??
>>>
>>> Thanks
>>> Larry
>>>
>>> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>>>> Hi,
>>>>
>>>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>>>
>>>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>>>
>>>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>>>
>>>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here
>>>> ]------------ Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>>>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI 
>>>> Jul
>>>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag 
>>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
>>>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2 
>>>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16
>>>> 13:40:24
>>>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>>>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014 
>>>> Jul
>>>> 16 13:40:24 drs1p002 kernel: RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16
>>>> 13:40:24
>>>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>>>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX:
>>>> ffff92559ee9f018
>>>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>>>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>>>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>>>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 
>>>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24
>>>> drs1p002
>>>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>>>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>>>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>>>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>>>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>>>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>>>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS:
>>>> 00000246
>>>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>>>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>>>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI:
>>>> 00007f0ce8ff8ce0
>>>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>>>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>>>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11:
>>>> 0000000000000246
>>>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>>>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>>>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d
>>>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 
>>>> 6c
>>>> 85
>>>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>>>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP:
>>>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace
>>>> a5a84fa62e77df42 ]---
>>>>
>>>> -----Original Message-----
>>>> From: ocfs2-devel-bounces at oss.oracle.com
>>>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel 
>>>> Sobe
>>>> Sent: Freitag, 13. Juli 2018 13:56
>>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Larry,
>>>>
>>>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>>>
>>>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>>>
>>>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Freitag, 13. Juli 2018 11:49
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Thanks for your effort to reproduce the bug.
>>>> I can confirm that there exist more than one bug.
>>>> I'll focus on this interesting issue.
>>>>
>>>>
>>>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>>>
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[ 
>>>>> cut here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel: 
>>>>> [1300619.423836] Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name:
>>>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12
>>>>> 15:29:08
>>>>> drs1p001 kernel: [1300619.423923] RIP:
>>>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12
>>>>> 15:29:08
>>>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul 
>>>>> 12
>>>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>>>> kernel: [1300619.423931] RBP: 0000000000000003 R08:
>>>>> 000062d940000000
>>>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>>> Here is caused by ocfs2_dentry_lock failed.
>>>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>>>
>>>> But why it failed still confuses me.
>>>>
>>>>
>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08
>>>>> drs1p001
>>>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul
>>>>> 12
>>>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ? 
>>>>> path_init+0x1bd/0x300 Jul 12 15:29:08 drs1p001 kernel: 
>>>>> [1300619.423995]
>>>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul
>>>>> 12
>>>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160 
>>>>> Jul
>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ?
>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.424012]
>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.424015]
>>>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12 
>>>>> 15:29:08
>>>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0 
>>>>> Jul
>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424022] 
>>>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f 
>>>>> 0b 8b
>>>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb
>>>>> d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 
>>>>> 00
>>>>> 00
>>>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057] 
>>>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>>>> [1300628.967649] ------------[ cut here ]------------
>>>>>
>>>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>>>
>>>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>>>
>>>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Freitag, 11. Mai 2018 09:01
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>>>
>>>>>> * Install the LXC packages from the distribution
>>>>>> * run the command "lxc-create -n test1 -t download"
>>>>>> ** first run might prompt you to generate a 
>>>>>> ~/.config/lxc/default.conf to define UID mappings
>>>>>> ** in a corporate environment it might be tricky to set the 
>>>>>> http_proxy (and maybe even https_proxy) environment variables 
>>>>>> correctly
>>>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>>>> * the container downloads to ~/.local/share/lxc/
>>>>>> * adapt the "config" file in that directory to add the shared
>>>>>> ocfs2 mount like in my example below
>>>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>>>> ** if you want to finally enable networking, most distributions 
>>>>>> arrange a dedicated bridge (lxcbr0) which you can configure 
>>>>>> similar to my example below
>>>>>> ** in my case I had to install cgroup related tools and reboot to 
>>>>>> have all cgroups available, and to allow use of lxcbr0 bridge in 
>>>>>> /etc/lxc/lxc-usernet
>>>>>>
>>>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> Quite an interesting issue.
>>>>>>
>>>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>>>
>>>>>> Do you have a script to build up your lxc environment?
>>>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>>>
>>>>>> Thanks,
>>>>>> Larry
>>>>>>
>>>>>>
>>>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> not sure if it helps, the issue wasn't there with Debian 8 and 
>>>>>>> kernel
>>>>>>> 3.16 - but that's a long history. Unfortunately, the only 
>>>>>>> machine where I could try to bisect, does not run any kernel < 
>>>>>>> 4.16 without other issues ?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; 
>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> Thanks for your report.
>>>>>>> I'll try to reproduce this bug as you did.
>>>>>>>
>>>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Larry
>>>>>>>
>>>>>>>
>>>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>>>> Hi Larry,
>>>>>>>>
>>>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C2fb7269bdec843a53cf208d6b2316ba1%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636892322672178859&amp;sdata=HDAUYJtjsBxfhYI5XZ8Y8F0daAwFtAn8bMRs4LNc1u4%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>>>
>>>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>> ----
>>>>>>>>
>>>>>>>> # Distribution configuration
>>>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>>>> lxc.arch = x86_64
>>>>>>>>
>>>>>>>> # Container specific configuration lxc.id_map = u 0 624288 
>>>>>>>> 65536 lxc.id_map = g 0 624288 65536
>>>>>>>>
>>>>>>>> lxc.utsname = container1
>>>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>>>
>>>>>>>> lxc.network.type = veth
>>>>>>>> lxc.network.flags = up
>>>>>>>> lxc.network.link = bridge1
>>>>>>>> lxc.network.name = eth0
>>>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway = 
>>>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>>>
>>>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>>>
>>>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>>>
>>>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>>>
>>>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>>>
>>>>>>>> ----
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; 
>>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>>>> Hi Larry,
>>>>>>>>>
>>>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>>>
>>>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>>>          ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>>>
>>>>>>>> How did you setup your lxc or container?
>>>>>>>>
>>>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>>>
>>>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Larry
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Daniel
>>>>>>>>>
>>>>> Sorry for this delayed reply.
>>>>>
>>>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>>>
>>>>> But I can not reproduce your bugs.
>>>>>
>>>>> What I use is opensuse tumbleweed.
>>>>>
>>>>> The procedure I try to reproduce your bugs:
>>>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>>>       ?? mount /dev/xxx /mnt
>>>>>       ?? then it shows
>>>>>       ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>>>       ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>>>
>>>>> 1. Start Virtual Machine Manager.
>>>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>>>       ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>>>> 4. Activate Application container and click Forward.
>>>>>       ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>>>> Click Forward.
>>>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>>>       ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>>>
>>>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>>>
>>>>> Thanks
>>>>> Larry
>>>>>
>>>> _______________________________________________
>>>> Ocfs2-devel mailing list
>>>> Ocfs2-devel at oss.oracle.com
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fu&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=dbN5ZPrBTKC2k7P5pcKbeQ2OLccOfexACRccMwjYChQ&s=NmLJjO1aaNYurHfZkWuvq_nk0yRRMdxR3I6wNv6ObtU&e=
>>>> r 
>>>> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.
>>>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos%26d%3DDwIGaQ
>>>> % 
>>>> 26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlA
>>>> v 
>>>> Tdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DeDT2dYwkSxcLa1NsepzLRIpUZlkC_
>>>> N 
>>>> ECl_Qk34Foqvo%26s%3DAiHVWnx-sunWZO4cbXP7v6z6Bw5vegbCZBA-wGNCoqA%26e
>>>> & 
>>>> amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6
>>>> a
>>>> bc9b8e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63688528022415
>>>> 8 
>>>> 395&amp;sdata=jU%2FI8ickhjrnOfxrg6pDU5fnTgzrOuhQSxqreDkw5V8%3D&amp;
>>>> r
>>>> eserved=0=
>>>> s
>>>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cd
>>>> a
>>>> n
>>>> i
>>>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b
>>>> 4
>>>> c
>>>> 6
>>>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJT
>>>> p
>>>> I
>>>> R
>>>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=dbN5ZPrBTKC2k7P5pcKbeQ2OLccOfexACRccMwjYChQ&s=-d-eAFOCxZ9FeEY2TqAbGCHpFEk-RcgyA94ih5IqyuQ&e=.
>> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdani
>> e 
>> l.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6abc9b8e4%7C686ea1d3bc2b4c6
>> f 
>> a92cd99c5c301635%7C0%7C0%7C636885280224158395&amp;sdata=3wHi5VznbDmyn
>> M
>> ohhVO5H7mRmkx113SL06BHrfnDIcg%3D&amp;reserved=0
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=dbN5ZPrBTKC2k7P5pcKbeQ2OLccOfexACRccMwjYChQ&s=-d-eAFOCxZ9FeEY2TqAbGCHpFEk-RcgyA94ih5IqyuQ&e=.
> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdanie
> l.sobe%40nxp.com%7C2fb7269bdec843a53cf208d6b2316ba1%7C686ea1d3bc2b4c6f
> a92cd99c5c301635%7C0%7C0%7C636892322672188859&amp;sdata=U6zpvh4ISrQDCG
> LpQuBlQ%2FogBSyiDGmHNSrlVhDc0AY%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2019-03-27  7:57                                             ` Daniel Sobe
@ 2019-03-27 18:17                                               ` Wengang
  2019-04-29 13:47                                                 ` Daniel Sobe
  0 siblings, 1 reply; 32+ messages in thread
From: Wengang @ 2019-03-27 18:17 UTC (permalink / raw)
  To: ocfs2-devel

BUG() would panic the kernel, dumping the callback trace on console, and 
then usually reboot the machine.

Kexec (when kdump is installed and configured properly) would, when 
production kernel panic,? start a new kernel (dump kernel) from a 
reserved block of memory and keep the other memory untouched. when dump 
kernel booted up, it performs memory collecting procedure against the 
untouched memory and save it to disk file (usually called "vmcore"). 
then reboot to product kernel again after the collecting work finished. 
So, yes, vmcore is a image file with kernel memory inside.

Here is an example regarding kdump configuration:

https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes

thanks,
wengang
On 03/27/2019 12:57 AM, Daniel Sobe wrote:
> In my setup, the system is still responsive (at least for a couple of minutes) after the BUG. Do I understand it correctly that you want me to setup "kdump" and provoke a crash manually after this BUG occurred, in order to receive an image file with all kernel memory inside?
>
> Sorry for the stupid question, but I'm new to this.
>
> Regards,
>
> Daniel
>
> -----Original Message-----
> From: Wengang Wang <wen.gang.wang@oracle.com>
> Sent: Dienstag, 26. M?rz 2019 22:24
> To: Daniel Sobe <daniel.sobe@nxp.com>; 'ocfs2-devel at oss.oracle.com' <ocfs2-devel@oss.oracle.com>
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Thank you, Daniel.
>
> Wengang
>
> On 2019/3/26 5:27, Daniel Sobe wrote:
>> Hi Wengang,
>>
>> Thanks for confirming that this bug is reproducible! Long time I was under the impression that I'm the only one facing this issue.
>>
>> Unfortunately, I do not know what a "vmcore" is. Let me google it and then check whether I can reproduce the bug again easily and provide what you request.
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: ocfs2-devel-bounces at oss.oracle.com
>> <ocfs2-devel-bounces@oss.oracle.com> On Behalf Of Wengang
>> Sent: Montag, 18. M?rz 2019 18:46
>> To: ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi,
>>
>> I also see this problem on a lower version at 4.1.12.xxx.
>>
>> The l_ro_holders is changed in ocfs2 layer, not DLM layer. And another thing is that the dentry lock is just used to get notification for file remote deleting. So the code is requesting PR lock and then releasing the it after PR is granted.? I am not sure, but I feel this is not a DLM issue, but a memory issue on the ocfs2_lock_res.? Do you have a vmcore available for this problem?
>>
>> Thanks,
>> Wengang
>>
>>
>> On 02/20/2019 12:48 AM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> The issue still happens with 4.19 as well, but it took quite a while to trigger it:
>>>
>>> Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------
>>> Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI
>>> Feb
>>> 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24018 Comm: git Not tainted
>>> 4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1 Feb 20 09:37:56
>>> drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS
>>> 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel: RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b
>>> 53
>>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f
>>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
>>> 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>>> 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000)
>>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> Feb
>>> 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0 Feb 20
>>> 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20
>>> 09:37:56 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2] Feb
>>> 20
>>> 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150 Feb 20 09:37:56
>>> drs1p001 kernel:  lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001
>>> kernel:  walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:
>>> ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8
>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>> b8
>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd
>>> 2b
>>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>> 002b:00007fc258ff8d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>>> 00007fc258ff8e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>>> kernel: RDX: 00007fc258ff8d40 RSI: 00007fc258ff8d40 RDI:
>>> 00007fc2300008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>> 0000000000000045 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>>> 000000000000000d R14: 0000000000000015 R15: 000055ec17f94d58 Feb 20
>>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag
>>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb
>>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree
>>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi
>>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl
>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915
>>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof
>>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev
>>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate
>>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm
>>> drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit
>>> intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe1 ]--- Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------ Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#2] SMP PTI
>>> Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24024 Comm: git Tainted: G      D           4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
>>> Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex
>>> 5040/0R790T, BIOS 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel:
>>> RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20
>>> 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de
>>> 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74
>>> c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00
>>> 00 00
>>> 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>> 0018:ffffaa68c8177af8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001
>>> kernel: RAX: 0000000000000292 RBX: ffff95fdf3b23418 RCX:
>>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX:
>>> 0000000000000000 RSI: ffff95fdf3b23418 RDI: ffff95fdf3b23494 Feb 20
>>> 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: ffff95fe91aa2620
>>> R09: 0000000000000089 Feb 20 09:37:56 drs1p001 kernel: R10:
>>> ffffaa68c8177ae0 R11: ffff95fe6e3efb40 R12: ffff95fdf3b23494 Feb 20
>>> 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: Call Trace:
>>> Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>> [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>>> ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2] Feb 20 09:37:56 drs1p001
>>> kernel:  ? d_splice_alias+0x29d/0x3f0 Feb 20 09:37:56 drs1p001 kernel:
>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>>> __lookup_slow+0x97/0x150 Feb 20 09:37:56 drs1p001 kernel:
>>> lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001 kernel:
>>> walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8
>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>> b8
>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd
>>> 2b
>>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>> 002b:00007fc24d7f9d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>>> 00007fc24d7f9e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>>> kernel: RDX: 00007fc24d7f9d40 RSI: 00007fc24d7f9d40 RDI:
>>> 00007fc2100008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>> 0000000000000044 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>>> 000000000000000f R14: 000000000000001d R15: 000055ec18015008 Feb 20
>>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag
>>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb
>>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree
>>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi
>>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl
>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915
>>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof
>>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev
>>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate
>>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm
>>> drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit
>>> intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi
>>> video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56
>>> drs1p001
>>> kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb
>>> sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd
>>> cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse
>>> xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb
>>> 20
>>> 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe2 ]--- Feb 20
>>> 09:37:56 drs1p001 kernel: RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b
>>> 53
>>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f
>>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
>>> 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>>> 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000)
>>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> Feb
>>> 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3:
>>> 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001
>>> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3:
>>> 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> -----Original Message-----
>>> From: Daniel Sobe
>>> Sent: Dienstag, 11. September 2018 13:36
>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>> Subject: RE: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Larry,
>>>
>>> I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:
>>>
>>> Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------
>>> Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
>>> Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>> Sep
>>> 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted
>>> 4.18.0-1-amd64 #1 Debian 4.18.6-1 Sep 11 13:08:51 drs1p002 kernel:
>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>> Sep
>>> 11 13:08:51 drs1p002 kernel: RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11 13:08:51
>>> drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50
>>> cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5
>>> eb
>>> d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00
>>> 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8
>>> EFLAGS: 00010046 Sep 11 13:08:51 drs1p002 kernel: RAX:
>>> 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100 Sep 11
>>> 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000
>>> RDI: ffff95cdbd985a94 Sep 11 13:08:51 drs1p002 kernel: RBP:
>>> ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0 Sep 11 13:08:51 drs1p002 kernel: Call Trace:
>>> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>> [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Sep 11 13:08:51 drs1p002
>>> kernel:  ? d_splice_alias+0x299/0x410 Sep 11 13:08:51 drs1p002 kernel:
>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>> __lookup_slow+0x97/0x150 Sep 11 13:08:51 drs1p002 kernel:
>>> lookup_slow+0x35/0x50 Sep 11 13:08:51 drs1p002 kernel:
>>> walk_component+0x1c4/0x480 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> link_path_walk+0x27c/0x510 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> path_init+0x177/0x2f0 Sep 11 13:08:51 drs1p002 kernel:
>>> path_lookupat+0x84/0x1f0 Sep 11 13:08:51 drs1p002 kernel:
>>> filename_lookup+0xb6/0x190 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>> ? __check_object_size+0xa7/0x1a0 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> strncpy_from_user+0x48/0x160 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> getname_flags+0x6a/0x1e0 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>>> __do_sys_newlstat+0x39/0x70 Sep 11 13:08:51 drs1p002 kernel:
>>> do_syscall_64+0x55/0x110 Sep 11 13:08:51 drs1p002 kernel:
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995 Sep 11
>>> 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8
>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>> b8
>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4
>>> 0c
>>> 00 f7 d8 64 89 Sep 11 13:08:51 drs1p002 kernel: RSP:
>>> 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Sep
>>> 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX:
>>> 00007f2434e20390 RCX: 00007f24b6cc5995 Sep 11 13:08:51 drs1p002
>>> kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI:
>>> 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: RBP:
>>> 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800 Sep 11
>>> 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246
>>> R12: 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: R13:
>>> 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800 Sep 11
>>> 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>> ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse
>>> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
>>> nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp
>>> vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm
>>> irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas
>>> ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate
>>> intel_uncore intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me
>>> drm_kms_helper snd_timer snd soundcore pcspkr serio_raw efivars drm
>>> mei lpc_ich i2c_algo_bit sg ie31200_edac video pcc_cpufreq button
>>> drbd lru_cache libcrc32c parport_pc sunrpc ppdev lp parport efivarfs
>>> ip_tables x_tables autofs4 ext4 crc16 Sep 11 13:08:51 drs1p002
>>> kernel:  mbcache
>>> jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801
>>> libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod
>>> usbcore e1000e usb_common thermal Sep 11 13:08:51 drs1p002 kernel:
>>> ---[ end trace feba92ba6e432478 ]--- Sep 11 13:08:51 drs1p002 kernel:
>>> RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11
>>> 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9
>>> 6e
>>> 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2
>>> 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00
>>> 00
>>> 00 00 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP:
>>> 0018:ffffb1248eeb3af8 EFLAGS: 00010046 Sep 11 13:08:51 drs1p002
>>> kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX:
>>> 0000000000000100 Sep 11 13:08:51 drs1p002 kernel: RDX:
>>> 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94 Sep 11
>>> 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000
>>> R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10:
>>> ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11
>>> 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000
>>> R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:
>>> 00007f2434e21700(0000) GS:ffff95ce9e200000(0000)
>>> knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS:
>>> 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel:
>>> CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
>>>
>>>
>>> All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> -----Original Message-----
>>> From: Larry Chen <lchen@suse.com>
>>> Sent: Mittwoch, 18. Juli 2018 10:09
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Which stack do you use? dlm or o2cb??
>>>
>>> I tried to reproduce the bug.
>>>
>>> I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.
>>>
>>> #! /bin/bash
>>> for i in $(git tag)
>>> do
>>>             echo $i
>>>             git checkout $i
>>> done
>>>
>>> Bug could not be reproduced.
>>>
>>> According to the back trace, I think the bug is caused by the logic of holding a lock.
>>>
>>> If possible, I think the bug will recur, even without drdb, lvm or other components.
>>>
>>> Regards,
>>> Larry
>>>
>>> On 07/17/2018 04:11 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
>>>>
>>>> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
>>>>
>>>> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
>>>>
>>>> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Dienstag, 17. Juli 2018 04:54
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Could you please simplify your environment?
>>>> Can I use several virtual machines to reproduce the bug??
>>>>
>>>> Thanks
>>>> Larry
>>>>
>>>> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>>>>> Hi,
>>>>>
>>>>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>>>>
>>>>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>>>>
>>>>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>>>>
>>>>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here
>>>>> ]------------ Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>>>>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>>>> Jul
>>>>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>>>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2
>>>>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>>>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16
>>>>> 13:40:24
>>>>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>>>>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>>>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>>>> Jul
>>>>> 16 13:40:24 drs1p002 kernel: RIP:
>>>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16
>>>>> 13:40:24
>>>>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>>>>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX:
>>>>> ffff92559ee9f018
>>>>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>>>>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>>>>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>>>>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>>>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24
>>>>> drs1p002
>>>>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>>>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>>>>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>>>>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>>>>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>>>>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>>>>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>>>>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS:
>>>>> 00000246
>>>>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>>>>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>>>>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI:
>>>>> 00007f0ce8ff8ce0
>>>>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>>>>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>>>>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11:
>>>>> 0000000000000246
>>>>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>>>>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>>>>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d
>>>>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53
>>>>> 6c
>>>>> 85
>>>>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>>>>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP:
>>>>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace
>>>>> a5a84fa62e77df42 ]---
>>>>>
>>>>> -----Original Message-----
>>>>> From: ocfs2-devel-bounces at oss.oracle.com
>>>>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel
>>>>> Sobe
>>>>> Sent: Freitag, 13. Juli 2018 13:56
>>>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Larry,
>>>>>
>>>>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>>>>
>>>>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>>>>
>>>>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Freitag, 13. Juli 2018 11:49
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Thanks for your effort to reproduce the bug.
>>>>> I can confirm that there exist more than one bug.
>>>>> I'll focus on this interesting issue.
>>>>>
>>>>>
>>>>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>>>>
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[
>>>>>> cut here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>>>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423836] Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name:
>>>>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12
>>>>>> 15:29:08
>>>>>> drs1p001 kernel: [1300619.423923] RIP:
>>>>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12
>>>>>> 15:29:08
>>>>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>>>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>>>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul
>>>>>> 12
>>>>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>>>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>>>>> kernel: [1300619.423931] RBP: 0000000000000003 R08:
>>>>>> 000062d940000000
>>>>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>>>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>>>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>>>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>>>> Here is caused by ocfs2_dentry_lock failed.
>>>>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>>>>
>>>>> But why it failed still confuses me.
>>>>>
>>>>>
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08
>>>>>> drs1p001
>>>>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>>>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul
>>>>>> 12
>>>>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ?
>>>>>> path_init+0x1bd/0x300 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423995]
>>>>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>>>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul
>>>>>> 12
>>>>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>>>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>>>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160
>>>>>> Jul
>>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ?
>>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.424012]
>>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.424015]
>>>>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12
>>>>>> 15:29:08
>>>>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0
>>>>>> Jul
>>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424022]
>>>>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f
>>>>>> 0b 8b
>>>>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb
>>>>>> d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00
>>>>>> 00
>>>>>> 00
>>>>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>>>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>>>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057]
>>>>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300628.967649] ------------[ cut here ]------------
>>>>>>
>>>>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>>>>
>>>>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>>>>
>>>>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Freitag, 11. Mai 2018 09:01
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>>>>
>>>>>>> * Install the LXC packages from the distribution
>>>>>>> * run the command "lxc-create -n test1 -t download"
>>>>>>> ** first run might prompt you to generate a
>>>>>>> ~/.config/lxc/default.conf to define UID mappings
>>>>>>> ** in a corporate environment it might be tricky to set the
>>>>>>> http_proxy (and maybe even https_proxy) environment variables
>>>>>>> correctly
>>>>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>>>>> * the container downloads to ~/.local/share/lxc/
>>>>>>> * adapt the "config" file in that directory to add the shared
>>>>>>> ocfs2 mount like in my example below
>>>>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>>>>> ** if you want to finally enable networking, most distributions
>>>>>>> arrange a dedicated bridge (lxcbr0) which you can configure
>>>>>>> similar to my example below
>>>>>>> ** in my case I had to install cgroup related tools and reboot to
>>>>>>> have all cgroups available, and to allow use of lxcbr0 bridge in
>>>>>>> /etc/lxc/lxc-usernet
>>>>>>>
>>>>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> Quite an interesting issue.
>>>>>>>
>>>>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>>>>
>>>>>>> Do you have a script to build up your lxc environment?
>>>>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Larry
>>>>>>>
>>>>>>>
>>>>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>>>>> Hi Larry,
>>>>>>>>
>>>>>>>> not sure if it helps, the issue wasn't there with Debian 8 and
>>>>>>>> kernel
>>>>>>>> 3.16 - but that's a long history. Unfortunately, the only
>>>>>>>> machine where I could try to bisect, does not run any kernel <
>>>>>>>> 4.16 without other issues ?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>;
>>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> Thanks for your report.
>>>>>>>> I'll try to reproduce this bug as you did.
>>>>>>>>
>>>>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Larry
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>>>>> Hi Larry,
>>>>>>>>>
>>>>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C2fb7269bdec843a53cf208d6b2316ba1%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636892322672178859&amp;sdata=HDAUYJtjsBxfhYI5XZ8Y8F0daAwFtAn8bMRs4LNc1u4%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>>>>
>>>>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Daniel
>>>>>>>>>
>>>>>>>>> ----
>>>>>>>>>
>>>>>>>>> # Distribution configuration
>>>>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>>>>> lxc.arch = x86_64
>>>>>>>>>
>>>>>>>>> # Container specific configuration lxc.id_map = u 0 624288
>>>>>>>>> 65536 lxc.id_map = g 0 624288 65536
>>>>>>>>>
>>>>>>>>> lxc.utsname = container1
>>>>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>>>>
>>>>>>>>> lxc.network.type = veth
>>>>>>>>> lxc.network.flags = up
>>>>>>>>> lxc.network.link = bridge1
>>>>>>>>> lxc.network.name = eth0
>>>>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway =
>>>>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>>>>
>>>>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>>>>
>>>>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>>>>
>>>>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>>>>
>>>>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>>>>
>>>>>>>>> ----
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>;
>>>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>>>>> Hi Larry,
>>>>>>>>>>
>>>>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>>>>
>>>>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>>>>           ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>>>>
>>>>>>>>> How did you setup your lxc or container?
>>>>>>>>>
>>>>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>>>>
>>>>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Larry
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Daniel
>>>>>>>>>>
>>>>>> Sorry for this delayed reply.
>>>>>>
>>>>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>>>>
>>>>>> But I can not reproduce your bugs.
>>>>>>
>>>>>> What I use is opensuse tumbleweed.
>>>>>>
>>>>>> The procedure I try to reproduce your bugs:
>>>>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>>>>        ?? mount /dev/xxx /mnt
>>>>>>        ?? then it shows
>>>>>>        ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>>>>        ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>>>>
>>>>>> 1. Start Virtual Machine Manager.
>>>>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>>>>        ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>>>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>>>>> 4. Activate Application container and click Forward.
>>>>>>        ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>>>>> Click Forward.
>>>>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>>>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>>>>        ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>>>>
>>>>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>>>>
>>>>>> Thanks
>>>>>> Larry
>>>>>>
>>>>> _______________________________________________
>>>>> Ocfs2-devel mailing list
>>>>> Ocfs2-devel at oss.oracle.com
>>>>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fu
>>>>> r
>>>>> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.
>>>>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos%26d%3DDwIGaQ
>>>>> %
>>>>> 26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlA
>>>>> v
>>>>> Tdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DeDT2dYwkSxcLa1NsepzLRIpUZlkC_
>>>>> N
>>>>> ECl_Qk34Foqvo%26s%3DAiHVWnx-sunWZO4cbXP7v6z6Bw5vegbCZBA-wGNCoqA%26e
>>>>> &
>>>>> amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6
>>>>> a
>>>>> bc9b8e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63688528022415
>>>>> 8
>>>>> 395&amp;sdata=jU%2FI8ickhjrnOfxrg6pDU5fnTgzrOuhQSxqreDkw5V8%3D&amp;
>>>>> r
>>>>> eserved=0=
>>>>> s
>>>>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cd
>>>>> a
>>>>> n
>>>>> i
>>>>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b
>>>>> 4
>>>>> c
>>>>> 6
>>>>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJT
>>>>> p
>>>>> I
>>>>> R
>>>>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>>>>
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foss.
>>> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdani
>>> e
>>> l.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6abc9b8e4%7C686ea1d3bc2b4c6
>>> f
>>> a92cd99c5c301635%7C0%7C0%7C636885280224158395&amp;sdata=3wHi5VznbDmyn
>>> M
>>> ohhVO5H7mRmkx113SL06BHrfnDIcg%3D&amp;reserved=0
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foss.
>> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdanie
>> l.sobe%40nxp.com%7C2fb7269bdec843a53cf208d6b2316ba1%7C686ea1d3bc2b4c6f
>> a92cd99c5c301635%7C0%7C0%7C636892322672188859&amp;sdata=U6zpvh4ISrQDCG
>> LpQuBlQ%2FogBSyiDGmHNSrlVhDc0AY%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2019-03-27 18:17                                               ` Wengang
@ 2019-04-29 13:47                                                 ` Daniel Sobe
  2019-04-29 15:57                                                   ` Wengang Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Sobe @ 2019-04-29 13:47 UTC (permalink / raw)
  To: ocfs2-devel

Hi Wengang,

today I could reproduce the bug once again. Seems like kdump/kexec does not trigger, maybe because the system was still alive a while after the bug occurred.

What can I do now? This is really a nasty bug that I'm hunting for over a year now, without any clue or progress.

You can be lucky to trigger it by heavily using "git filter-branch", but I do not have a reliable procedure, just wildly using a subdirectory filter on the kernel sources doesn't seem to do the trick.

Here is the new log output:

Apr 29 15:26:01 drs1p001 kernel: ------------[ cut here ]------------
Apr 29 15:26:01 drs1p001 kernel: kernel BUG at /build/linux-tpKJY9/linux-4.19.28/fs/ocfs2/dlmglue.c:849!
Apr 29 15:26:01 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI
Apr 29 15:26:01 drs1p001 kernel: CPU: 2 PID: 29028 Comm: git Not tainted 4.19.0-0.bpo.4-amd64 #1 Debian 4.19.28-2~bpo9+1
Apr 29 15:26:01 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
Apr 29 15:26:01 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
Apr 29 15:26:01 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 f1 32 5a fb 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
Apr 29 15:26:01 drs1p001 kernel: RSP: 0018:ffffad8d0df07af8 EFLAGS: 00010046
Apr 29 15:26:01 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff8d630e9fd818 RCX: 0000000000000000
Apr 29 15:26:01 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff8d630e9fd818 RDI: ffff8d630e9fd894
Apr 29 15:26:01 drs1p001 kernel: RBP: 0000000000000003 R08: 0000729cc0000000 R09: 00000000000214c0
Apr 29 15:26:01 drs1p001 kernel: R10: ffffad8d0df07ae0 R11: 0000000000000018 R12: ffff8d630e9fd894
Apr 29 15:26:01 drs1p001 kernel: R13: ffff8d6455c8e000 R14: 0000000000000000 R15: ffffffffc0c3c2c0
Apr 29 15:26:01 drs1p001 kernel: FS:  00007f8104e1d700(0000) GS:ffff8d6511b00000(0000) knlGS:0000000000000000
Apr 29 15:26:01 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 29 15:26:01 drs1p001 kernel: CR2: 00007f80d0000010 CR3: 0000000115180005 CR4: 00000000003606e0
Apr 29 15:26:01 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 29 15:26:01 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 29 15:26:01 drs1p001 kernel: Call Trace:
Apr 29 15:26:01 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
Apr 29 15:26:01 drs1p001 kernel:  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
Apr 29 15:26:01 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0
Apr 29 15:26:01 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2]
Apr 29 15:26:01 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2]
Apr 29 15:26:01 drs1p001 kernel:  __lookup_slow+0x97/0x150
Apr 29 15:26:01 drs1p001 kernel:  lookup_slow+0x35/0x50
Apr 29 15:26:01 drs1p001 kernel:  walk_component+0x1c6/0x360
Apr 29 15:26:01 drs1p001 kernel:  ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2]
Apr 29 15:26:01 drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80
Apr 29 15:26:01 drs1p001 kernel:  path_lookupat+0x67/0x200
Apr 29 15:26:01 drs1p001 kernel:  ? ___bpf_prog_run+0xb96/0xf20
Apr 29 15:26:01 drs1p001 kernel:  filename_lookup+0xb8/0x1a0
Apr 29 15:26:01 drs1p001 kernel:  ? seccomp_run_filters+0x58/0xb0
Apr 29 15:26:01 drs1p001 kernel:  ? __check_object_size+0x161/0x1a0
Apr 29 15:26:01 drs1p001 kernel:  ? strncpy_from_user+0x48/0x160
Apr 29 15:26:01 drs1p001 kernel:  ? getname_flags+0x6a/0x1e0
Apr 29 15:26:01 drs1p001 kernel:  ? vfs_statx+0x73/0xe0
Apr 29 15:26:01 drs1p001 kernel:  vfs_statx+0x73/0xe0
Apr 29 15:26:01 drs1p001 kernel:  __do_sys_newlstat+0x39/0x70
Apr 29 15:26:01 drs1p001 kernel:  ? syscall_trace_enter+0x117/0x2c0
Apr 29 15:26:01 drs1p001 kernel:  do_syscall_64+0x55/0x110
Apr 29 15:26:01 drs1p001 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 29 15:26:01 drs1p001 kernel: RIP: 0033:0x7f8108f01335
Apr 29 15:26:01 drs1p001 kernel: Code: 69 db 2b 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 31 db 2b 00 f7 d8 64 89
Apr 29 15:26:01 drs1p001 kernel: RSP: 002b:00007f8104e1cd08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 29 15:26:01 drs1p001 kernel: RAX: ffffffffffffffda RBX: 00007f8104e1ce50 RCX: 00007f8108f01335
Apr 29 15:26:01 drs1p001 kernel: RDX: 00007f8104e1cd40 RSI: 00007f8104e1cd40 RDI: 00007f80d80008c0
Apr 29 15:26:01 drs1p001 kernel: RBP: 000000000000003f R08: 0000000000000003 R09: 0000000000000000
Apr 29 15:26:01 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
Apr 29 15:26:01 drs1p001 kernel: R13: 0000000000000006 R14: 0000000000000017 R15: 000055d17cc6c568
Apr 29 15:26:01 drs1p001 kernel: Modules linked in: ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs tcp_diag inet_diag unix_diag appletalk psnap ax25 veth bridge stp llc iptable_filter snd_hda_codec_hdmi rfkill snd_hda_codec_realtek snd_hda_codec_generic fuse intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp mei_wdt kvm_intel i915 wmi_bmof dell_wmi sparse_keymap dell_smbios dell_wmi_descriptor evdev kvm snd_hda_intel drm_kms_helper irqbypass snd_hda_codec crct10dif_pclmul snd_hda_core crc32_pclmul snd_hwdep drm dcdbas snd_pcm ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf mei_me snd_timer iTCO_wdt snd serio_raw iTCO_vendor_support i2c_algo_bit soundcore sg mei pcspkr intel_pch_thermal wmi video acpi_pad button pcc_cpufreq drbd lru_cache libcrc32c
Apr 29 15:26:01 drs1p001 kernel:  ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci libata xhci_pci xhci_hcd scsi_mod i2c_i801 e1000e usbcore e1000 usb_common thermal fan [last unloaded: configfs]
Apr 29 15:26:01 drs1p001 kernel: ---[ end trace f720d1de63741a88 ]---
Apr 29 15:26:01 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
Apr 29 15:26:01 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 f1 32 5a fb 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
Apr 29 15:26:01 drs1p001 kernel: RSP: 0018:ffffad8d0df07af8 EFLAGS: 00010046
Apr 29 15:26:01 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff8d630e9fd818 RCX: 0000000000000000
Apr 29 15:26:01 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff8d630e9fd818 RDI: ffff8d630e9fd894
Apr 29 15:26:01 drs1p001 kernel: RBP: 0000000000000003 R08: 0000729cc0000000 R09: 00000000000214c0
Apr 29 15:26:01 drs1p001 kernel: R10: ffffad8d0df07ae0 R11: 0000000000000018 R12: ffff8d630e9fd894
Apr 29 15:26:01 drs1p001 kernel: R13: ffff8d6455c8e000 R14: 0000000000000000 R15: ffffffffc0c3c2c0
Apr 29 15:26:01 drs1p001 kernel: FS:  00007f8104e1d700(0000) GS:ffff8d6511b00000(0000) knlGS:0000000000000000
Apr 29 15:26:01 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 29 15:26:01 drs1p001 kernel: CR2: 00007f80d0000010 CR3: 0000000115180005 CR4: 00000000003606e0
Apr 29 15:26:01 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 29 15:26:01 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400



-----Original Message-----
From: Wengang <wen.gang.wang@oracle.com> 
Sent: Mittwoch, 27. M?rz 2019 19:18
To: Daniel Sobe <daniel.sobe@nxp.com>; 'ocfs2-devel at oss.oracle.com' <ocfs2-devel@oss.oracle.com>
Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

BUG() would panic the kernel, dumping the callback trace on console, and 
then usually reboot the machine.

Kexec (when kdump is installed and configured properly) would, when 
production kernel panic,? start a new kernel (dump kernel) from a 
reserved block of memory and keep the other memory untouched. when dump 
kernel booted up, it performs memory collecting procedure against the 
untouched memory and save it to disk file (usually called "vmcore"). 
then reboot to product kernel again after the collecting work finished. 
So, yes, vmcore is a image file with kernel memory inside.

Here is an example regarding kdump configuration:

https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Ffedoraproject.org-252Fwiki-252FHow-5Fto-5Fuse-5Fkdump-5Fto-5Fdebug-5Fkernel-5Fcrashes-26amp-3Bdata-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257Ce9afd2a203f240fb16cc08d6b2e02b9f-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636893073215576998-26amp-3Bsdata-3DCm-252BSrNs0fHSHqYYnMorfw2sE9g9WOkwHB-252Fvoa8GtWSE-253D-26amp-3Breserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=8Xz_DrMlII1ADowiXTGX-hQ1dpsTYUXn-GBPOBNZG8g&s=dyN8LerHugXU6m5zX3IzCWUUFyKlEKcSsJyZmsvYWFE&e=

thanks,
wengang
On 03/27/2019 12:57 AM, Daniel Sobe wrote:
> In my setup, the system is still responsive (at least for a couple of minutes) after the BUG. Do I understand it correctly that you want me to setup "kdump" and provoke a crash manually after this BUG occurred, in order to receive an image file with all kernel memory inside?
>
> Sorry for the stupid question, but I'm new to this.
>
> Regards,
>
> Daniel
>
> -----Original Message-----
> From: Wengang Wang <wen.gang.wang@oracle.com>
> Sent: Dienstag, 26. M?rz 2019 22:24
> To: Daniel Sobe <daniel.sobe@nxp.com>; 'ocfs2-devel at oss.oracle.com' <ocfs2-devel@oss.oracle.com>
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> Thank you, Daniel.
>
> Wengang
>
> On 2019/3/26 5:27, Daniel Sobe wrote:
>> Hi Wengang,
>>
>> Thanks for confirming that this bug is reproducible! Long time I was under the impression that I'm the only one facing this issue.
>>
>> Unfortunately, I do not know what a "vmcore" is. Let me google it and then check whether I can reproduce the bug again easily and provide what you request.
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: ocfs2-devel-bounces at oss.oracle.com
>> <ocfs2-devel-bounces@oss.oracle.com> On Behalf Of Wengang
>> Sent: Montag, 18. M?rz 2019 18:46
>> To: ocfs2-devel at oss.oracle.com
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Hi,
>>
>> I also see this problem on a lower version at 4.1.12.xxx.
>>
>> The l_ro_holders is changed in ocfs2 layer, not DLM layer. And another thing is that the dentry lock is just used to get notification for file remote deleting. So the code is requesting PR lock and then releasing the it after PR is granted.? I am not sure, but I feel this is not a DLM issue, but a memory issue on the ocfs2_lock_res.? Do you have a vmcore available for this problem?
>>
>> Thanks,
>> Wengang
>>
>>
>> On 02/20/2019 12:48 AM, Daniel Sobe wrote:
>>> Hi Larry,
>>>
>>> The issue still happens with 4.19 as well, but it took quite a while to trigger it:
>>>
>>> Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------
>>> Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI
>>> Feb
>>> 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24018 Comm: git Not tainted
>>> 4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1 Feb 20 09:37:56
>>> drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS
>>> 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel: RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b
>>> 53
>>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f
>>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
>>> 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>>> 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000)
>>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> Feb
>>> 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0 Feb 20
>>> 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20
>>> 09:37:56 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2] Feb
>>> 20
>>> 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150 Feb 20 09:37:56
>>> drs1p001 kernel:  lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001
>>> kernel:  walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:
>>> ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8
>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>> b8
>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd
>>> 2b
>>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>> 002b:00007fc258ff8d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>>> 00007fc258ff8e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>>> kernel: RDX: 00007fc258ff8d40 RSI: 00007fc258ff8d40 RDI:
>>> 00007fc2300008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>> 0000000000000045 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>>> 000000000000000d R14: 0000000000000015 R15: 000055ec17f94d58 Feb 20
>>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag
>>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb
>>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree
>>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi
>>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl
>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915
>>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof
>>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev
>>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate
>>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm
>>> drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit
>>> intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe1 ]--- Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------ Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#2] SMP PTI
>>> Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24024 Comm: git Tainted: G      D           4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
>>> Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex
>>> 5040/0R790T, BIOS 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel:
>>> RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20
>>> 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de
>>> 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74
>>> c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00
>>> 00 00
>>> 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>> 0018:ffffaa68c8177af8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001
>>> kernel: RAX: 0000000000000292 RBX: ffff95fdf3b23418 RCX:
>>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX:
>>> 0000000000000000 RSI: ffff95fdf3b23418 RDI: ffff95fdf3b23494 Feb 20
>>> 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: ffff95fe91aa2620
>>> R09: 0000000000000089 Feb 20 09:37:56 drs1p001 kernel: R10:
>>> ffffaa68c8177ae0 R11: ffff95fe6e3efb40 R12: ffff95fdf3b23494 Feb 20
>>> 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: Call Trace:
>>> Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>> [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>>> ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2] Feb 20 09:37:56 drs1p001
>>> kernel:  ? d_splice_alias+0x29d/0x3f0 Feb 20 09:37:56 drs1p001 kernel:
>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>>> __lookup_slow+0x97/0x150 Feb 20 09:37:56 drs1p001 kernel:
>>> lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001 kernel:
>>> walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8
>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>> b8
>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd
>>> 2b
>>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>> 002b:00007fc24d7f9d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>>> 00007fc24d7f9e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>>> kernel: RDX: 00007fc24d7f9d40 RSI: 00007fc24d7f9d40 RDI:
>>> 00007fc2100008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>> 0000000000000044 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>>> 000000000000000f R14: 000000000000001d R15: 000055ec18015008 Feb 20
>>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag
>>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb
>>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree
>>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi
>>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl
>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915
>>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof
>>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev
>>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate
>>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm
>>> drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit
>>> intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi
>>> video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56
>>> drs1p001
>>> kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb
>>> sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd
>>> cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse
>>> xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb
>>> 20
>>> 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe2 ]--- Feb 20
>>> 09:37:56 drs1p001 kernel: RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b
>>> 53
>>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f
>>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
>>> 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>>> 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000)
>>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> Feb
>>> 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3:
>>> 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001
>>> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3:
>>> 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> -----Original Message-----
>>> From: Daniel Sobe
>>> Sent: Dienstag, 11. September 2018 13:36
>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>> Subject: RE: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Larry,
>>>
>>> I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:
>>>
>>> Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------
>>> Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
>>> Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>> Sep
>>> 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted
>>> 4.18.0-1-amd64 #1 Debian 4.18.6-1 Sep 11 13:08:51 drs1p002 kernel:
>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>> Sep
>>> 11 13:08:51 drs1p002 kernel: RIP:
>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11 13:08:51
>>> drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50
>>> cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5
>>> eb
>>> d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00
>>> 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8
>>> EFLAGS: 00010046 Sep 11 13:08:51 drs1p002 kernel: RAX:
>>> 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100 Sep 11
>>> 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000
>>> RDI: ffff95cdbd985a94 Sep 11 13:08:51 drs1p002 kernel: RBP:
>>> ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0 Sep 11 13:08:51 drs1p002 kernel: Call Trace:
>>> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>> [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Sep 11 13:08:51 drs1p002
>>> kernel:  ? d_splice_alias+0x299/0x410 Sep 11 13:08:51 drs1p002 kernel:
>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>> __lookup_slow+0x97/0x150 Sep 11 13:08:51 drs1p002 kernel:
>>> lookup_slow+0x35/0x50 Sep 11 13:08:51 drs1p002 kernel:
>>> walk_component+0x1c4/0x480 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> link_path_walk+0x27c/0x510 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> path_init+0x177/0x2f0 Sep 11 13:08:51 drs1p002 kernel:
>>> path_lookupat+0x84/0x1f0 Sep 11 13:08:51 drs1p002 kernel:
>>> filename_lookup+0xb6/0x190 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>> ? __check_object_size+0xa7/0x1a0 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> strncpy_from_user+0x48/0x160 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> getname_flags+0x6a/0x1e0 Sep 11 13:08:51 drs1p002 kernel:  ?
>>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>>> __do_sys_newlstat+0x39/0x70 Sep 11 13:08:51 drs1p002 kernel:
>>> do_syscall_64+0x55/0x110 Sep 11 13:08:51 drs1p002 kernel:
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995 Sep 11
>>> 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8
>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>> b8
>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4
>>> 0c
>>> 00 f7 d8 64 89 Sep 11 13:08:51 drs1p002 kernel: RSP:
>>> 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Sep
>>> 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX:
>>> 00007f2434e20390 RCX: 00007f24b6cc5995 Sep 11 13:08:51 drs1p002
>>> kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI:
>>> 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: RBP:
>>> 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800 Sep 11
>>> 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246
>>> R12: 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: R13:
>>> 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800 Sep 11
>>> 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>> ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse
>>> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
>>> nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp
>>> vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm
>>> irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas
>>> ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate
>>> intel_uncore intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me
>>> drm_kms_helper snd_timer snd soundcore pcspkr serio_raw efivars drm
>>> mei lpc_ich i2c_algo_bit sg ie31200_edac video pcc_cpufreq button
>>> drbd lru_cache libcrc32c parport_pc sunrpc ppdev lp parport efivarfs
>>> ip_tables x_tables autofs4 ext4 crc16 Sep 11 13:08:51 drs1p002
>>> kernel:  mbcache
>>> jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801
>>> libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod
>>> usbcore e1000e usb_common thermal Sep 11 13:08:51 drs1p002 kernel:
>>> ---[ end trace feba92ba6e432478 ]--- Sep 11 13:08:51 drs1p002 kernel:
>>> RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11
>>> 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9
>>> 6e
>>> 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2
>>> 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00
>>> 00
>>> 00 00 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP:
>>> 0018:ffffb1248eeb3af8 EFLAGS: 00010046 Sep 11 13:08:51 drs1p002
>>> kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX:
>>> 0000000000000100 Sep 11 13:08:51 drs1p002 kernel: RDX:
>>> 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94 Sep 11
>>> 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000
>>> R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10:
>>> ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11
>>> 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000
>>> R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:
>>> 00007f2434e21700(0000) GS:ffff95ce9e200000(0000)
>>> knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS:
>>> 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel:
>>> CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
>>>
>>>
>>> All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> -----Original Message-----
>>> From: Larry Chen <lchen@suse.com>
>>> Sent: Mittwoch, 18. Juli 2018 10:09
>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi Daniel,
>>>
>>> Which stack do you use? dlm or o2cb??
>>>
>>> I tried to reproduce the bug.
>>>
>>> I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.
>>>
>>> #! /bin/bash
>>> for i in $(git tag)
>>> do
>>>             echo $i
>>>             git checkout $i
>>> done
>>>
>>> Bug could not be reproduced.
>>>
>>> According to the back trace, I think the bug is caused by the logic of holding a lock.
>>>
>>> If possible, I think the bug will recur, even without drdb, lvm or other components.
>>>
>>> Regards,
>>> Larry
>>>
>>> On 07/17/2018 04:11 PM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
>>>>
>>>> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
>>>>
>>>> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
>>>>
>>>> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>> Sent: Dienstag, 17. Juli 2018 04:54
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Could you please simplify your environment?
>>>> Can I use several virtual machines to reproduce the bug??
>>>>
>>>> Thanks
>>>> Larry
>>>>
>>>> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>>>>> Hi,
>>>>>
>>>>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>>>>
>>>>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>>>>
>>>>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>>>>
>>>>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here
>>>>> ]------------ Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>>>>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>>>> Jul
>>>>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>>>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2
>>>>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>>>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16
>>>>> 13:40:24
>>>>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>>>>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>>>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>>>> Jul
>>>>> 16 13:40:24 drs1p002 kernel: RIP:
>>>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16
>>>>> 13:40:24
>>>>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>>>>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX:
>>>>> ffff92559ee9f018
>>>>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>>>>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>>>>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>>>>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>>>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24
>>>>> drs1p002
>>>>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>>>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>>>>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>>>>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>>>>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>>>>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>>>>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>>>>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS:
>>>>> 00000246
>>>>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>>>>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>>>>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI:
>>>>> 00007f0ce8ff8ce0
>>>>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>>>>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>>>>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11:
>>>>> 0000000000000246
>>>>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>>>>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>>>>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d
>>>>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53
>>>>> 6c
>>>>> 85
>>>>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>>>>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP:
>>>>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace
>>>>> a5a84fa62e77df42 ]---
>>>>>
>>>>> -----Original Message-----
>>>>> From: ocfs2-devel-bounces at oss.oracle.com
>>>>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel
>>>>> Sobe
>>>>> Sent: Freitag, 13. Juli 2018 13:56
>>>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Larry,
>>>>>
>>>>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>>>>
>>>>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>>>>
>>>>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Freitag, 13. Juli 2018 11:49
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Thanks for your effort to reproduce the bug.
>>>>> I can confirm that there exist more than one bug.
>>>>> I'll focus on this interesting issue.
>>>>>
>>>>>
>>>>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>>>>> Hi Larry,
>>>>>>
>>>>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>>>>
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[
>>>>>> cut here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>>>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423836] Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name:
>>>>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12
>>>>>> 15:29:08
>>>>>> drs1p001 kernel: [1300619.423923] RIP:
>>>>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12
>>>>>> 15:29:08
>>>>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>>>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>>>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul
>>>>>> 12
>>>>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>>>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>>>>> kernel: [1300619.423931] RBP: 0000000000000003 R08:
>>>>>> 000062d940000000
>>>>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>>>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>>>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>>>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>>>> Here is caused by ocfs2_dentry_lock failed.
>>>>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>>>>
>>>>> But why it failed still confuses me.
>>>>>
>>>>>
>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08
>>>>>> drs1p001
>>>>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>>>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul
>>>>>> 12
>>>>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ?
>>>>>> path_init+0x1bd/0x300 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423995]
>>>>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>>>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul
>>>>>> 12
>>>>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>>>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>>>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160
>>>>>> Jul
>>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ?
>>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.424012]
>>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.424015]
>>>>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12
>>>>>> 15:29:08
>>>>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0
>>>>>> Jul
>>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424022]
>>>>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f
>>>>>> 0b 8b
>>>>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb
>>>>>> d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00
>>>>>> 00
>>>>>> 00
>>>>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>>>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>>>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057]
>>>>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>>>>> [1300628.967649] ------------[ cut here ]------------
>>>>>>
>>>>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>>>>
>>>>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>>>>
>>>>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Freitag, 11. Mai 2018 09:01
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>>>>
>>>>>>> * Install the LXC packages from the distribution
>>>>>>> * run the command "lxc-create -n test1 -t download"
>>>>>>> ** first run might prompt you to generate a
>>>>>>> ~/.config/lxc/default.conf to define UID mappings
>>>>>>> ** in a corporate environment it might be tricky to set the
>>>>>>> http_proxy (and maybe even https_proxy) environment variables
>>>>>>> correctly
>>>>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>>>>> * the container downloads to ~/.local/share/lxc/
>>>>>>> * adapt the "config" file in that directory to add the shared
>>>>>>> ocfs2 mount like in my example below
>>>>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>>>>> ** if you want to finally enable networking, most distributions
>>>>>>> arrange a dedicated bridge (lxcbr0) which you can configure
>>>>>>> similar to my example below
>>>>>>> ** in my case I had to install cgroup related tools and reboot to
>>>>>>> have all cgroups available, and to allow use of lxcbr0 bridge in
>>>>>>> /etc/lxc/lxc-usernet
>>>>>>>
>>>>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> Quite an interesting issue.
>>>>>>>
>>>>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>>>>
>>>>>>> Do you have a script to build up your lxc environment?
>>>>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Larry
>>>>>>>
>>>>>>>
>>>>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>>>>> Hi Larry,
>>>>>>>>
>>>>>>>> not sure if it helps, the issue wasn't there with Debian 8 and
>>>>>>>> kernel
>>>>>>>> 3.16 - but that's a long history. Unfortunately, the only
>>>>>>>> machine where I could try to bisect, does not run any kernel <
>>>>>>>> 4.16 without other issues ?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>;
>>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> Thanks for your report.
>>>>>>>> I'll try to reproduce this bug as you did.
>>>>>>>>
>>>>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Larry
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>>>>> Hi Larry,
>>>>>>>>>
>>>>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7Ce9afd2a203f240fb16cc08d6b2e02b9f%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636893073215576998&amp;sdata=GLTcu1UurT9UGgdNAa%2B71EMWWwC82ZD5tPl4lDY8460%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>>>>
>>>>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Daniel
>>>>>>>>>
>>>>>>>>> ----
>>>>>>>>>
>>>>>>>>> # Distribution configuration
>>>>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>>>>> lxc.arch = x86_64
>>>>>>>>>
>>>>>>>>> # Container specific configuration lxc.id_map = u 0 624288
>>>>>>>>> 65536 lxc.id_map = g 0 624288 65536
>>>>>>>>>
>>>>>>>>> lxc.utsname = container1
>>>>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>>>>
>>>>>>>>> lxc.network.type = veth
>>>>>>>>> lxc.network.flags = up
>>>>>>>>> lxc.network.link = bridge1
>>>>>>>>> lxc.network.name = eth0
>>>>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway =
>>>>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>>>>
>>>>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>>>>
>>>>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>>>>
>>>>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>>>>
>>>>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>>>>
>>>>>>>>> ----
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>;
>>>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>>>>> Hi Larry,
>>>>>>>>>>
>>>>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>>>>
>>>>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>>>>           ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>>>>
>>>>>>>>> How did you setup your lxc or container?
>>>>>>>>>
>>>>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>>>>
>>>>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Larry
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Daniel
>>>>>>>>>>
>>>>>> Sorry for this delayed reply.
>>>>>>
>>>>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>>>>
>>>>>> But I can not reproduce your bugs.
>>>>>>
>>>>>> What I use is opensuse tumbleweed.
>>>>>>
>>>>>> The procedure I try to reproduce your bugs:
>>>>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>>>>        ?? mount /dev/xxx /mnt
>>>>>>        ?? then it shows
>>>>>>        ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>>>>        ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>>>>
>>>>>> 1. Start Virtual Machine Manager.
>>>>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>>>>        ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>>>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>>>>> 4. Activate Application container and click Forward.
>>>>>>        ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>>>>> Click Forward.
>>>>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>>>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>>>>        ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>>>>
>>>>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>>>>
>>>>>> Thanks
>>>>>> Larry
>>>>>>
>>>>> _______________________________________________
>>>>> Ocfs2-devel mailing list
>>>>> Ocfs2-devel at oss.oracle.com
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fu&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=8Xz_DrMlII1ADowiXTGX-hQ1dpsTYUXn-GBPOBNZG8g&s=kxXlrb-sGu6HKuIpcC6__8vhmg00_Wc0Du-pUzfkkVA&e=
>>>>> r
>>>>> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.
>>>>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos%26d%3DDwIGaQ
>>>>> %
>>>>> 26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlA
>>>>> v
>>>>> Tdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DeDT2dYwkSxcLa1NsepzLRIpUZlkC_
>>>>> N
>>>>> ECl_Qk34Foqvo%26s%3DAiHVWnx-sunWZO4cbXP7v6z6Bw5vegbCZBA-wGNCoqA%26e
>>>>> &
>>>>> amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6
>>>>> a
>>>>> bc9b8e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63688528022415
>>>>> 8
>>>>> 395&amp;sdata=jU%2FI8ickhjrnOfxrg6pDU5fnTgzrOuhQSxqreDkw5V8%3D&amp;
>>>>> r
>>>>> eserved=0=
>>>>> s
>>>>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cd
>>>>> a
>>>>> n
>>>>> i
>>>>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b
>>>>> 4
>>>>> c
>>>>> 6
>>>>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJT
>>>>> p
>>>>> I
>>>>> R
>>>>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>>>>
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=8Xz_DrMlII1ADowiXTGX-hQ1dpsTYUXn-GBPOBNZG8g&s=UekLIDSkDwHe4KyDG4zh0COThnsXfgnzcbzYCwAM5WM&e=.
>>> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdani
>>> e
>>> l.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6abc9b8e4%7C686ea1d3bc2b4c6
>>> f
>>> a92cd99c5c301635%7C0%7C0%7C636885280224158395&amp;sdata=3wHi5VznbDmyn
>>> M
>>> ohhVO5H7mRmkx113SL06BHrfnDIcg%3D&amp;reserved=0
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=8Xz_DrMlII1ADowiXTGX-hQ1dpsTYUXn-GBPOBNZG8g&s=UekLIDSkDwHe4KyDG4zh0COThnsXfgnzcbzYCwAM5WM&e=.
>> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdanie
>> l.sobe%40nxp.com%7C2fb7269bdec843a53cf208d6b2316ba1%7C686ea1d3bc2b4c6f
>> a92cd99c5c301635%7C0%7C0%7C636892322672188859&amp;sdata=U6zpvh4ISrQDCG
>> LpQuBlQ%2FogBSyiDGmHNSrlVhDc0AY%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] OCFS2 BUG with 2 different kernels
  2019-04-29 13:47                                                 ` Daniel Sobe
@ 2019-04-29 15:57                                                   ` Wengang Wang
  2019-05-03  7:10                                                     ` [Ocfs2-devel] [EXT] " Daniel Sobe
  0 siblings, 1 reply; 32+ messages in thread
From: Wengang Wang @ 2019-04-29 15:57 UTC (permalink / raw)
  To: ocfs2-devel

Hi Daniel,

No worries, I was lucky to get a vmcore for this issue.? A patch was 
under testing. Will update it here shortly.

Thanks,
Wengang

On 2019/4/29 6:47, Daniel Sobe wrote:
> Hi Wengang,
>
> today I could reproduce the bug once again. Seems like kdump/kexec does not trigger, maybe because the system was still alive a while after the bug occurred.
>
> What can I do now? This is really a nasty bug that I'm hunting for over a year now, without any clue or progress.
>
> You can be lucky to trigger it by heavily using "git filter-branch", but I do not have a reliable procedure, just wildly using a subdirectory filter on the kernel sources doesn't seem to do the trick.
>
> Here is the new log output:
>
> Apr 29 15:26:01 drs1p001 kernel: ------------[ cut here ]------------
> Apr 29 15:26:01 drs1p001 kernel: kernel BUG at /build/linux-tpKJY9/linux-4.19.28/fs/ocfs2/dlmglue.c:849!
> Apr 29 15:26:01 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI
> Apr 29 15:26:01 drs1p001 kernel: CPU: 2 PID: 29028 Comm: git Not tainted 4.19.0-0.bpo.4-amd64 #1 Debian 4.19.28-2~bpo9+1
> Apr 29 15:26:01 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
> Apr 29 15:26:01 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 f1 32 5a fb 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
> Apr 29 15:26:01 drs1p001 kernel: RSP: 0018:ffffad8d0df07af8 EFLAGS: 00010046
> Apr 29 15:26:01 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff8d630e9fd818 RCX: 0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff8d630e9fd818 RDI: ffff8d630e9fd894
> Apr 29 15:26:01 drs1p001 kernel: RBP: 0000000000000003 R08: 0000729cc0000000 R09: 00000000000214c0
> Apr 29 15:26:01 drs1p001 kernel: R10: ffffad8d0df07ae0 R11: 0000000000000018 R12: ffff8d630e9fd894
> Apr 29 15:26:01 drs1p001 kernel: R13: ffff8d6455c8e000 R14: 0000000000000000 R15: ffffffffc0c3c2c0
> Apr 29 15:26:01 drs1p001 kernel: FS:  00007f8104e1d700(0000) GS:ffff8d6511b00000(0000) knlGS:0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 29 15:26:01 drs1p001 kernel: CR2: 00007f80d0000010 CR3: 0000000115180005 CR4: 00000000003606e0
> Apr 29 15:26:01 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Apr 29 15:26:01 drs1p001 kernel: Call Trace:
> Apr 29 15:26:01 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel:  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0
> Apr 29 15:26:01 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel:  __lookup_slow+0x97/0x150
> Apr 29 15:26:01 drs1p001 kernel:  lookup_slow+0x35/0x50
> Apr 29 15:26:01 drs1p001 kernel:  walk_component+0x1c6/0x360
> Apr 29 15:26:01 drs1p001 kernel:  ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80
> Apr 29 15:26:01 drs1p001 kernel:  path_lookupat+0x67/0x200
> Apr 29 15:26:01 drs1p001 kernel:  ? ___bpf_prog_run+0xb96/0xf20
> Apr 29 15:26:01 drs1p001 kernel:  filename_lookup+0xb8/0x1a0
> Apr 29 15:26:01 drs1p001 kernel:  ? seccomp_run_filters+0x58/0xb0
> Apr 29 15:26:01 drs1p001 kernel:  ? __check_object_size+0x161/0x1a0
> Apr 29 15:26:01 drs1p001 kernel:  ? strncpy_from_user+0x48/0x160
> Apr 29 15:26:01 drs1p001 kernel:  ? getname_flags+0x6a/0x1e0
> Apr 29 15:26:01 drs1p001 kernel:  ? vfs_statx+0x73/0xe0
> Apr 29 15:26:01 drs1p001 kernel:  vfs_statx+0x73/0xe0
> Apr 29 15:26:01 drs1p001 kernel:  __do_sys_newlstat+0x39/0x70
> Apr 29 15:26:01 drs1p001 kernel:  ? syscall_trace_enter+0x117/0x2c0
> Apr 29 15:26:01 drs1p001 kernel:  do_syscall_64+0x55/0x110
> Apr 29 15:26:01 drs1p001 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Apr 29 15:26:01 drs1p001 kernel: RIP: 0033:0x7f8108f01335
> Apr 29 15:26:01 drs1p001 kernel: Code: 69 db 2b 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 31 db 2b 00 f7 d8 64 89
> Apr 29 15:26:01 drs1p001 kernel: RSP: 002b:00007f8104e1cd08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
> Apr 29 15:26:01 drs1p001 kernel: RAX: ffffffffffffffda RBX: 00007f8104e1ce50 RCX: 00007f8108f01335
> Apr 29 15:26:01 drs1p001 kernel: RDX: 00007f8104e1cd40 RSI: 00007f8104e1cd40 RDI: 00007f80d80008c0
> Apr 29 15:26:01 drs1p001 kernel: RBP: 000000000000003f R08: 0000000000000003 R09: 0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
> Apr 29 15:26:01 drs1p001 kernel: R13: 0000000000000006 R14: 0000000000000017 R15: 000055d17cc6c568
> Apr 29 15:26:01 drs1p001 kernel: Modules linked in: ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs tcp_diag inet_diag unix_diag appletalk psnap ax25 veth bridge stp llc iptable_filter snd_hda_codec_hdmi rfkill snd_hda_codec_realtek snd_hda_codec_generic fuse intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp mei_wdt kvm_intel i915 wmi_bmof dell_wmi sparse_keymap dell_smbios dell_wmi_descriptor evdev kvm snd_hda_intel drm_kms_helper irqbypass snd_hda_codec crct10dif_pclmul snd_hda_core crc32_pclmul snd_hwdep drm dcdbas snd_pcm ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf mei_me snd_timer iTCO_wdt snd serio_raw iTCO_vendor_support i2c_algo_bit soundcore sg mei pcspkr intel_pch_thermal wmi video acpi_pad button pcc_cpufreq drbd lru_cache libcrc32c
> Apr 29 15:26:01 drs1p001 kernel:  ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci libata xhci_pci xhci_hcd scsi_mod i2c_i801 e1000e usbcore e1000 usb_common thermal fan [last unloaded: configfs]
> Apr 29 15:26:01 drs1p001 kernel: ---[ end trace f720d1de63741a88 ]---
> Apr 29 15:26:01 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 f1 32 5a fb 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
> Apr 29 15:26:01 drs1p001 kernel: RSP: 0018:ffffad8d0df07af8 EFLAGS: 00010046
> Apr 29 15:26:01 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff8d630e9fd818 RCX: 0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff8d630e9fd818 RDI: ffff8d630e9fd894
> Apr 29 15:26:01 drs1p001 kernel: RBP: 0000000000000003 R08: 0000729cc0000000 R09: 00000000000214c0
> Apr 29 15:26:01 drs1p001 kernel: R10: ffffad8d0df07ae0 R11: 0000000000000018 R12: ffff8d630e9fd894
> Apr 29 15:26:01 drs1p001 kernel: R13: ffff8d6455c8e000 R14: 0000000000000000 R15: ffffffffc0c3c2c0
> Apr 29 15:26:01 drs1p001 kernel: FS:  00007f8104e1d700(0000) GS:ffff8d6511b00000(0000) knlGS:0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 29 15:26:01 drs1p001 kernel: CR2: 00007f80d0000010 CR3: 0000000115180005 CR4: 00000000003606e0
> Apr 29 15:26:01 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
>
>
> -----Original Message-----
> From: Wengang <wen.gang.wang@oracle.com>
> Sent: Mittwoch, 27. M?rz 2019 19:18
> To: Daniel Sobe <daniel.sobe@nxp.com>; 'ocfs2-devel at oss.oracle.com' <ocfs2-devel@oss.oracle.com>
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> BUG() would panic the kernel, dumping the callback trace on console, and
> then usually reboot the machine.
>
> Kexec (when kdump is installed and configured properly) would, when
> production kernel panic,? start a new kernel (dump kernel) from a
> reserved block of memory and keep the other memory untouched. when dump
> kernel booted up, it performs memory collecting procedure against the
> untouched memory and save it to disk file (usually called "vmcore").
> then reboot to product kernel again after the collecting work finished.
> So, yes, vmcore is a image file with kernel memory inside.
>
> Here is an example regarding kdump configuration:
>
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffedoraproject.org%2Fwiki%2FHow_to_use_kdump_to_debug_kernel_crashes&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7Ce9afd2a203f240fb16cc08d6b2e02b9f%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636893073215576998&amp;sdata=Cm%2BSrNs0fHSHqYYnMorfw2sE9g9WOkwHB%2Fvoa8GtWSE%3D&amp;reserved=0
>
> thanks,
> wengang
> On 03/27/2019 12:57 AM, Daniel Sobe wrote:
>> In my setup, the system is still responsive (at least for a couple of minutes) after the BUG. Do I understand it correctly that you want me to setup "kdump" and provoke a crash manually after this BUG occurred, in order to receive an image file with all kernel memory inside?
>>
>> Sorry for the stupid question, but I'm new to this.
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: Wengang Wang <wen.gang.wang@oracle.com>
>> Sent: Dienstag, 26. M?rz 2019 22:24
>> To: Daniel Sobe <daniel.sobe@nxp.com>; 'ocfs2-devel at oss.oracle.com' <ocfs2-devel@oss.oracle.com>
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Thank you, Daniel.
>>
>> Wengang
>>
>> On 2019/3/26 5:27, Daniel Sobe wrote:
>>> Hi Wengang,
>>>
>>> Thanks for confirming that this bug is reproducible! Long time I was under the impression that I'm the only one facing this issue.
>>>
>>> Unfortunately, I do not know what a "vmcore" is. Let me google it and then check whether I can reproduce the bug again easily and provide what you request.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> -----Original Message-----
>>> From: ocfs2-devel-bounces at oss.oracle.com
>>> <ocfs2-devel-bounces@oss.oracle.com> On Behalf Of Wengang
>>> Sent: Montag, 18. M?rz 2019 18:46
>>> To: ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi,
>>>
>>> I also see this problem on a lower version at 4.1.12.xxx.
>>>
>>> The l_ro_holders is changed in ocfs2 layer, not DLM layer. And another thing is that the dentry lock is just used to get notification for file remote deleting. So the code is requesting PR lock and then releasing the it after PR is granted.? I am not sure, but I feel this is not a DLM issue, but a memory issue on the ocfs2_lock_res.? Do you have a vmcore available for this problem?
>>>
>>> Thanks,
>>> Wengang
>>>
>>>
>>> On 02/20/2019 12:48 AM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> The issue still happens with 4.19 as well, but it took quite a while to trigger it:
>>>>
>>>> Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------
>>>> Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>>>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI
>>>> Feb
>>>> 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24018 Comm: git Not tainted
>>>> 4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1 Feb 20 09:37:56
>>>> drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS
>>>> 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel: RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>>>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b
>>>> 53
>>>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f
>>>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
>>>> 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>>>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>>>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>>>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>>>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>>>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>>>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>>>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>>>> 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000)
>>>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>>>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> Feb
>>>> 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0 Feb 20
>>>> 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20
>>>> 09:37:56 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2] Feb
>>>> 20
>>>> 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150 Feb 20 09:37:56
>>>> drs1p001 kernel:  lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001
>>>> kernel:  walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:
>>>> ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>>>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>>>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>>>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>>>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>>>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>>>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8
>>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>>> b8
>>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd
>>>> 2b
>>>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>>> 002b:00007fc258ff8d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>>>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>>>> 00007fc258ff8e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>>>> kernel: RDX: 00007fc258ff8d40 RSI: 00007fc258ff8d40 RDI:
>>>> 00007fc2300008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>>> 0000000000000045 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>>>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>>>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>>>> 000000000000000d R14: 0000000000000015 R15: 000055ec17f94d58 Feb 20
>>>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag
>>>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb
>>>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree
>>>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi
>>>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl
>>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915
>>>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof
>>>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev
>>>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate
>>>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm
>>>> drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit
>>>> intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe1 ]--- Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------ Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>>>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#2] SMP PTI
>>>> Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24024 Comm: git Tainted: G      D           4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
>>>> Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex
>>>> 5040/0R790T, BIOS 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel:
>>>> RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20
>>>> 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de
>>>> 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74
>>>> c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00
>>>> 00 00
>>>> 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>>> 0018:ffffaa68c8177af8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001
>>>> kernel: RAX: 0000000000000292 RBX: ffff95fdf3b23418 RCX:
>>>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX:
>>>> 0000000000000000 RSI: ffff95fdf3b23418 RDI: ffff95fdf3b23494 Feb 20
>>>> 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: ffff95fe91aa2620
>>>> R09: 0000000000000089 Feb 20 09:37:56 drs1p001 kernel: R10:
>>>> ffffaa68c8177ae0 R11: ffff95fe6e3efb40 R12: ffff95fdf3b23494 Feb 20
>>>> 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: Call Trace:
>>>> Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>>> [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>>>> ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2] Feb 20 09:37:56 drs1p001
>>>> kernel:  ? d_splice_alias+0x29d/0x3f0 Feb 20 09:37:56 drs1p001 kernel:
>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>>>> __lookup_slow+0x97/0x150 Feb 20 09:37:56 drs1p001 kernel:
>>>> lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001 kernel:
>>>> walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>>>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>>>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>>>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>>>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>>>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>>>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8
>>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>>> b8
>>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd
>>>> 2b
>>>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>>> 002b:00007fc24d7f9d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>>>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>>>> 00007fc24d7f9e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>>>> kernel: RDX: 00007fc24d7f9d40 RSI: 00007fc24d7f9d40 RDI:
>>>> 00007fc2100008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>>> 0000000000000044 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>>>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>>>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>>>> 000000000000000f R14: 000000000000001d R15: 000055ec18015008 Feb 20
>>>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag
>>>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb
>>>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree
>>>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi
>>>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl
>>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915
>>>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof
>>>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev
>>>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate
>>>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm
>>>> drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit
>>>> intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi
>>>> video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56
>>>> drs1p001
>>>> kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb
>>>> sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd
>>>> cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse
>>>> xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb
>>>> 20
>>>> 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe2 ]--- Feb 20
>>>> 09:37:56 drs1p001 kernel: RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>>>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b
>>>> 53
>>>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f
>>>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
>>>> 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>>>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>>>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>>>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>>>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>>>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>>>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>>>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>>>> 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000)
>>>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>>>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> Feb
>>>> 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3:
>>>> 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001
>>>> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3:
>>>> 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message-----
>>>> From: Daniel Sobe
>>>> Sent: Dienstag, 11. September 2018 13:36
>>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: RE: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Larry,
>>>>
>>>> I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:
>>>>
>>>> Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------
>>>> Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
>>>> Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>>> Sep
>>>> 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted
>>>> 4.18.0-1-amd64 #1 Debian 4.18.6-1 Sep 11 13:08:51 drs1p002 kernel:
>>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>>> Sep
>>>> 11 13:08:51 drs1p002 kernel: RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11 13:08:51
>>>> drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50
>>>> cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5
>>>> eb
>>>> d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00
>>>> 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8
>>>> EFLAGS: 00010046 Sep 11 13:08:51 drs1p002 kernel: RAX:
>>>> 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100 Sep 11
>>>> 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000
>>>> RDI: ffff95cdbd985a94 Sep 11 13:08:51 drs1p002 kernel: RBP:
>>>> ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0 Sep 11 13:08:51 drs1p002 kernel: Call Trace:
>>>> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>>> [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Sep 11 13:08:51 drs1p002
>>>> kernel:  ? d_splice_alias+0x299/0x410 Sep 11 13:08:51 drs1p002 kernel:
>>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>>> __lookup_slow+0x97/0x150 Sep 11 13:08:51 drs1p002 kernel:
>>>> lookup_slow+0x35/0x50 Sep 11 13:08:51 drs1p002 kernel:
>>>> walk_component+0x1c4/0x480 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> link_path_walk+0x27c/0x510 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> path_init+0x177/0x2f0 Sep 11 13:08:51 drs1p002 kernel:
>>>> path_lookupat+0x84/0x1f0 Sep 11 13:08:51 drs1p002 kernel:
>>>> filename_lookup+0xb6/0x190 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>>> ? __check_object_size+0xa7/0x1a0 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> strncpy_from_user+0x48/0x160 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> getname_flags+0x6a/0x1e0 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>>>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>>>> __do_sys_newlstat+0x39/0x70 Sep 11 13:08:51 drs1p002 kernel:
>>>> do_syscall_64+0x55/0x110 Sep 11 13:08:51 drs1p002 kernel:
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995 Sep 11
>>>> 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8
>>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>>> b8
>>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4
>>>> 0c
>>>> 00 f7 d8 64 89 Sep 11 13:08:51 drs1p002 kernel: RSP:
>>>> 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Sep
>>>> 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX:
>>>> 00007f2434e20390 RCX: 00007f24b6cc5995 Sep 11 13:08:51 drs1p002
>>>> kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI:
>>>> 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: RBP:
>>>> 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800 Sep 11
>>>> 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246
>>>> R12: 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: R13:
>>>> 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800 Sep 11
>>>> 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>>> ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse
>>>> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
>>>> nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp
>>>> vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm
>>>> irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas
>>>> ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate
>>>> intel_uncore intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me
>>>> drm_kms_helper snd_timer snd soundcore pcspkr serio_raw efivars drm
>>>> mei lpc_ich i2c_algo_bit sg ie31200_edac video pcc_cpufreq button
>>>> drbd lru_cache libcrc32c parport_pc sunrpc ppdev lp parport efivarfs
>>>> ip_tables x_tables autofs4 ext4 crc16 Sep 11 13:08:51 drs1p002
>>>> kernel:  mbcache
>>>> jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801
>>>> libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod
>>>> usbcore e1000e usb_common thermal Sep 11 13:08:51 drs1p002 kernel:
>>>> ---[ end trace feba92ba6e432478 ]--- Sep 11 13:08:51 drs1p002 kernel:
>>>> RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11
>>>> 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9
>>>> 6e
>>>> 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2
>>>> 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00
>>>> 00
>>>> 00 00 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP:
>>>> 0018:ffffb1248eeb3af8 EFLAGS: 00010046 Sep 11 13:08:51 drs1p002
>>>> kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX:
>>>> 0000000000000100 Sep 11 13:08:51 drs1p002 kernel: RDX:
>>>> 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94 Sep 11
>>>> 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000
>>>> R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10:
>>>> ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11
>>>> 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000
>>>> R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:
>>>> 00007f2434e21700(0000) GS:ffff95ce9e200000(0000)
>>>> knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS:
>>>> 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel:
>>>> CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
>>>>
>>>>
>>>> All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen <lchen@suse.com>
>>>> Sent: Mittwoch, 18. Juli 2018 10:09
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Which stack do you use? dlm or o2cb??
>>>>
>>>> I tried to reproduce the bug.
>>>>
>>>> I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.
>>>>
>>>> #! /bin/bash
>>>> for i in $(git tag)
>>>> do
>>>>              echo $i
>>>>              git checkout $i
>>>> done
>>>>
>>>> Bug could not be reproduced.
>>>>
>>>> According to the back trace, I think the bug is caused by the logic of holding a lock.
>>>>
>>>> If possible, I think the bug will recur, even without drdb, lvm or other components.
>>>>
>>>> Regards,
>>>> Larry
>>>>
>>>> On 07/17/2018 04:11 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
>>>>>
>>>>> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
>>>>>
>>>>> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
>>>>>
>>>>> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Dienstag, 17. Juli 2018 04:54
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Could you please simplify your environment?
>>>>> Can I use several virtual machines to reproduce the bug??
>>>>>
>>>>> Thanks
>>>>> Larry
>>>>>
>>>>> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>>>>>> Hi,
>>>>>>
>>>>>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>>>>>
>>>>>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>>>>>
>>>>>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>>>>>
>>>>>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here
>>>>>> ]------------ Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>>>>>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>>>>> Jul
>>>>>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>>>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>>>>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2
>>>>>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>>>>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16
>>>>>> 13:40:24
>>>>>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>>>>>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>>>>> Jul
>>>>>> 16 13:40:24 drs1p002 kernel: RIP:
>>>>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16
>>>>>> 13:40:24
>>>>>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX:
>>>>>> ffff92559ee9f018
>>>>>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>>>>>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>>>>>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>>>>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24
>>>>>> drs1p002
>>>>>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS:
>>>>>> 00000246
>>>>>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>>>>>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI:
>>>>>> 00007f0ce8ff8ce0
>>>>>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>>>>>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11:
>>>>>> 0000000000000246
>>>>>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>>>>>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d
>>>>>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53
>>>>>> 6c
>>>>>> 85
>>>>>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>>>>>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP:
>>>>>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace
>>>>>> a5a84fa62e77df42 ]---
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ocfs2-devel-bounces at oss.oracle.com
>>>>>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel
>>>>>> Sobe
>>>>>> Sent: Freitag, 13. Juli 2018 13:56
>>>>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Larry,
>>>>>>
>>>>>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>>>>>
>>>>>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>>>>>
>>>>>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Freitag, 13. Juli 2018 11:49
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> Thanks for your effort to reproduce the bug.
>>>>>> I can confirm that there exist more than one bug.
>>>>>> I'll focus on this interesting issue.
>>>>>>
>>>>>>
>>>>>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>>>>>
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[
>>>>>>> cut here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>>>>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423836] Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name:
>>>>>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12
>>>>>>> 15:29:08
>>>>>>> drs1p001 kernel: [1300619.423923] RIP:
>>>>>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12
>>>>>>> 15:29:08
>>>>>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>>>>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>>>>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul
>>>>>>> 12
>>>>>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>>>>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>>>>>> kernel: [1300619.423931] RBP: 0000000000000003 R08:
>>>>>>> 000062d940000000
>>>>>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>>>>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>>>>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>>>>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>>>>> Here is caused by ocfs2_dentry_lock failed.
>>>>>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>>>>>
>>>>>> But why it failed still confuses me.
>>>>>>
>>>>>>
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>>>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08
>>>>>>> drs1p001
>>>>>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>>>>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul
>>>>>>> 12
>>>>>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ?
>>>>>>> path_init+0x1bd/0x300 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423995]
>>>>>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>>>>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul
>>>>>>> 12
>>>>>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>>>>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>>>>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160
>>>>>>> Jul
>>>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ?
>>>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.424012]
>>>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.424015]
>>>>>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12
>>>>>>> 15:29:08
>>>>>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0
>>>>>>> Jul
>>>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424022]
>>>>>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f
>>>>>>> 0b 8b
>>>>>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb
>>>>>>> d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00
>>>>>>> 00
>>>>>>> 00
>>>>>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>>>>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>>>>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057]
>>>>>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300628.967649] ------------[ cut here ]------------
>>>>>>>
>>>>>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>>>>>
>>>>>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>>>>>
>>>>>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>> Sent: Freitag, 11. Mai 2018 09:01
>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>>>>>> Hi Larry,
>>>>>>>>
>>>>>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>>>>>
>>>>>>>> * Install the LXC packages from the distribution
>>>>>>>> * run the command "lxc-create -n test1 -t download"
>>>>>>>> ** first run might prompt you to generate a
>>>>>>>> ~/.config/lxc/default.conf to define UID mappings
>>>>>>>> ** in a corporate environment it might be tricky to set the
>>>>>>>> http_proxy (and maybe even https_proxy) environment variables
>>>>>>>> correctly
>>>>>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>>>>>> * the container downloads to ~/.local/share/lxc/
>>>>>>>> * adapt the "config" file in that directory to add the shared
>>>>>>>> ocfs2 mount like in my example below
>>>>>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>>>>>> ** if you want to finally enable networking, most distributions
>>>>>>>> arrange a dedicated bridge (lxcbr0) which you can configure
>>>>>>>> similar to my example below
>>>>>>>> ** in my case I had to install cgroup related tools and reboot to
>>>>>>>> have all cgroups available, and to allow use of lxcbr0 bridge in
>>>>>>>> /etc/lxc/lxc-usernet
>>>>>>>>
>>>>>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> Quite an interesting issue.
>>>>>>>>
>>>>>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>>>>>
>>>>>>>> Do you have a script to build up your lxc environment?
>>>>>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Larry
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>>>>>> Hi Larry,
>>>>>>>>>
>>>>>>>>> not sure if it helps, the issue wasn't there with Debian 8 and
>>>>>>>>> kernel
>>>>>>>>> 3.16 - but that's a long history. Unfortunately, the only
>>>>>>>>> machine where I could try to bisect, does not run any kernel <
>>>>>>>>> 4.16 without other issues ?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Daniel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>;
>>>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>>
>>>>>>>>> Hi Daniel,
>>>>>>>>>
>>>>>>>>> Thanks for your report.
>>>>>>>>> I'll try to reproduce this bug as you did.
>>>>>>>>>
>>>>>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Larry
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>>>>>> Hi Larry,
>>>>>>>>>>
>>>>>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7Ce9afd2a203f240fb16cc08d6b2e02b9f%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636893073215576998&amp;sdata=GLTcu1UurT9UGgdNAa%2B71EMWWwC82ZD5tPl4lDY8460%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>>>>>
>>>>>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Daniel
>>>>>>>>>>
>>>>>>>>>> ----
>>>>>>>>>>
>>>>>>>>>> # Distribution configuration
>>>>>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>>>>>> lxc.arch = x86_64
>>>>>>>>>>
>>>>>>>>>> # Container specific configuration lxc.id_map = u 0 624288
>>>>>>>>>> 65536 lxc.id_map = g 0 624288 65536
>>>>>>>>>>
>>>>>>>>>> lxc.utsname = container1
>>>>>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>>>>>
>>>>>>>>>> lxc.network.type = veth
>>>>>>>>>> lxc.network.flags = up
>>>>>>>>>> lxc.network.link = bridge1
>>>>>>>>>> lxc.network.name = eth0
>>>>>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway =
>>>>>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>>>>>
>>>>>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>>>>>
>>>>>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>>>>>
>>>>>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>>>>>
>>>>>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>>>>>
>>>>>>>>>> ----
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>;
>>>>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>>>>>> Hi Larry,
>>>>>>>>>>>
>>>>>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>>>>>
>>>>>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>>>>>            ?? docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>>>>>
>>>>>>>>>> How did you setup your lxc or container?
>>>>>>>>>>
>>>>>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>>>>>
>>>>>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Larry
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Daniel
>>>>>>>>>>>
>>>>>>> Sorry for this delayed reply.
>>>>>>>
>>>>>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>>>>>
>>>>>>> But I can not reproduce your bugs.
>>>>>>>
>>>>>>> What I use is opensuse tumbleweed.
>>>>>>>
>>>>>>> The procedure I try to reproduce your bugs:
>>>>>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>>>>>         ?? mount /dev/xxx /mnt
>>>>>>>         ?? then it shows
>>>>>>>         ?? 207 65 254:16 / /mnt rw,relatime shared:94
>>>>>>>         ?? I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>>>>>
>>>>>>> 1. Start Virtual Machine Manager.
>>>>>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>>>>>         ?? Select LXC (Linux Containers) as the hypervisor and click Connect.
>>>>>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>>>>>> 4. Activate Application container and click Forward.
>>>>>>>         ?? Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>>>>>> Click Forward.
>>>>>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>>>>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>>>>>         ?? Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>>>>>
>>>>>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Larry
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Ocfs2-devel mailing list
>>>>>> Ocfs2-devel at oss.oracle.com
>>>>>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fu
>>>>>> r
>>>>>> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.
>>>>>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos%26d%3DDwIGaQ
>>>>>> %
>>>>>> 26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlA
>>>>>> v
>>>>>> Tdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DeDT2dYwkSxcLa1NsepzLRIpUZlkC_
>>>>>> N
>>>>>> ECl_Qk34Foqvo%26s%3DAiHVWnx-sunWZO4cbXP7v6z6Bw5vegbCZBA-wGNCoqA%26e
>>>>>> &
>>>>>> amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6
>>>>>> a
>>>>>> bc9b8e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63688528022415
>>>>>> 8
>>>>>> 395&amp;sdata=jU%2FI8ickhjrnOfxrg6pDU5fnTgzrOuhQSxqreDkw5V8%3D&amp;
>>>>>> r
>>>>>> eserved=0=
>>>>>> s
>>>>>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cd
>>>>>> a
>>>>>> n
>>>>>> i
>>>>>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b
>>>>>> 4
>>>>>> c
>>>>>> 6
>>>>>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJT
>>>>>> p
>>>>>> I
>>>>>> R
>>>>>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>>>>>
>>>> _______________________________________________
>>>> Ocfs2-devel mailing list
>>>> Ocfs2-devel at oss.oracle.com
>>>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foss.
>>>> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdani
>>>> e
>>>> l.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6abc9b8e4%7C686ea1d3bc2b4c6
>>>> f
>>>> a92cd99c5c301635%7C0%7C0%7C636885280224158395&amp;sdata=3wHi5VznbDmyn
>>>> M
>>>> ohhVO5H7mRmkx113SL06BHrfnDIcg%3D&amp;reserved=0
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Foss.
>>> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdanie
>>> l.sobe%40nxp.com%7C2fb7269bdec843a53cf208d6b2316ba1%7C686ea1d3bc2b4c6f
>>> a92cd99c5c301635%7C0%7C0%7C636892322672188859&amp;sdata=U6zpvh4ISrQDCG
>>> LpQuBlQ%2FogBSyiDGmHNSrlVhDc0AY%3D&amp;reserved=0
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Ocfs2-devel] [EXT] Re:  OCFS2 BUG with 2 different kernels
  2019-04-29 15:57                                                   ` Wengang Wang
@ 2019-05-03  7:10                                                     ` Daniel Sobe
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Sobe @ 2019-05-03  7:10 UTC (permalink / raw)
  To: ocfs2-devel

Hi Wengang,

that is great news! I'd like to test the patch on my environment as well, can you make it available somehow?

Regards,

Daniel

-----Original Message-----
From: Wengang Wang <wen.gang.wang@oracle.com> 
Sent: Montag, 29. April 2019 17:58
To: Daniel Sobe <daniel.sobe@nxp.com>; 'ocfs2-devel at oss.oracle.com' <ocfs2-devel@oss.oracle.com>
Subject: [EXT] Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels

Caution: EXT Email

Hi Daniel,

No worries, I was lucky to get a vmcore for this issue.  A patch was
under testing. Will update it here shortly.

Thanks,
Wengang

On 2019/4/29 6:47, Daniel Sobe wrote:
> Hi Wengang,
>
> today I could reproduce the bug once again. Seems like kdump/kexec does not trigger, maybe because the system was still alive a while after the bug occurred.
>
> What can I do now? This is really a nasty bug that I'm hunting for over a year now, without any clue or progress.
>
> You can be lucky to trigger it by heavily using "git filter-branch", but I do not have a reliable procedure, just wildly using a subdirectory filter on the kernel sources doesn't seem to do the trick.
>
> Here is the new log output:
>
> Apr 29 15:26:01 drs1p001 kernel: ------------[ cut here ]------------
> Apr 29 15:26:01 drs1p001 kernel: kernel BUG at /build/linux-tpKJY9/linux-4.19.28/fs/ocfs2/dlmglue.c:849!
> Apr 29 15:26:01 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI
> Apr 29 15:26:01 drs1p001 kernel: CPU: 2 PID: 29028 Comm: git Not tainted 4.19.0-0.bpo.4-amd64 #1 Debian 4.19.28-2~bpo9+1
> Apr 29 15:26:01 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016
> Apr 29 15:26:01 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 f1 32 5a fb 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
> Apr 29 15:26:01 drs1p001 kernel: RSP: 0018:ffffad8d0df07af8 EFLAGS: 00010046
> Apr 29 15:26:01 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff8d630e9fd818 RCX: 0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff8d630e9fd818 RDI: ffff8d630e9fd894
> Apr 29 15:26:01 drs1p001 kernel: RBP: 0000000000000003 R08: 0000729cc0000000 R09: 00000000000214c0
> Apr 29 15:26:01 drs1p001 kernel: R10: ffffad8d0df07ae0 R11: 0000000000000018 R12: ffff8d630e9fd894
> Apr 29 15:26:01 drs1p001 kernel: R13: ffff8d6455c8e000 R14: 0000000000000000 R15: ffffffffc0c3c2c0
> Apr 29 15:26:01 drs1p001 kernel: FS:  00007f8104e1d700(0000) GS:ffff8d6511b00000(0000) knlGS:0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 29 15:26:01 drs1p001 kernel: CR2: 00007f80d0000010 CR3: 0000000115180005 CR4: 00000000003606e0
> Apr 29 15:26:01 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Apr 29 15:26:01 drs1p001 kernel: Call Trace:
> Apr 29 15:26:01 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel:  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0
> Apr 29 15:26:01 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel:  __lookup_slow+0x97/0x150
> Apr 29 15:26:01 drs1p001 kernel:  lookup_slow+0x35/0x50
> Apr 29 15:26:01 drs1p001 kernel:  walk_component+0x1c6/0x360
> Apr 29 15:26:01 drs1p001 kernel:  ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80
> Apr 29 15:26:01 drs1p001 kernel:  path_lookupat+0x67/0x200
> Apr 29 15:26:01 drs1p001 kernel:  ? ___bpf_prog_run+0xb96/0xf20
> Apr 29 15:26:01 drs1p001 kernel:  filename_lookup+0xb8/0x1a0
> Apr 29 15:26:01 drs1p001 kernel:  ? seccomp_run_filters+0x58/0xb0
> Apr 29 15:26:01 drs1p001 kernel:  ? __check_object_size+0x161/0x1a0
> Apr 29 15:26:01 drs1p001 kernel:  ? strncpy_from_user+0x48/0x160
> Apr 29 15:26:01 drs1p001 kernel:  ? getname_flags+0x6a/0x1e0
> Apr 29 15:26:01 drs1p001 kernel:  ? vfs_statx+0x73/0xe0
> Apr 29 15:26:01 drs1p001 kernel:  vfs_statx+0x73/0xe0
> Apr 29 15:26:01 drs1p001 kernel:  __do_sys_newlstat+0x39/0x70
> Apr 29 15:26:01 drs1p001 kernel:  ? syscall_trace_enter+0x117/0x2c0
> Apr 29 15:26:01 drs1p001 kernel:  do_syscall_64+0x55/0x110
> Apr 29 15:26:01 drs1p001 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Apr 29 15:26:01 drs1p001 kernel: RIP: 0033:0x7f8108f01335
> Apr 29 15:26:01 drs1p001 kernel: Code: 69 db 2b 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 31 db 2b 00 f7 d8 64 89
> Apr 29 15:26:01 drs1p001 kernel: RSP: 002b:00007f8104e1cd08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
> Apr 29 15:26:01 drs1p001 kernel: RAX: ffffffffffffffda RBX: 00007f8104e1ce50 RCX: 00007f8108f01335
> Apr 29 15:26:01 drs1p001 kernel: RDX: 00007f8104e1cd40 RSI: 00007f8104e1cd40 RDI: 00007f80d80008c0
> Apr 29 15:26:01 drs1p001 kernel: RBP: 000000000000003f R08: 0000000000000003 R09: 0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
> Apr 29 15:26:01 drs1p001 kernel: R13: 0000000000000006 R14: 0000000000000017 R15: 000055d17cc6c568
> Apr 29 15:26:01 drs1p001 kernel: Modules linked in: ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs tcp_diag inet_diag unix_diag appletalk psnap ax25 veth bridge stp llc iptable_filter snd_hda_codec_hdmi rfkill snd_hda_codec_realtek snd_hda_codec_generic fuse intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp mei_wdt kvm_intel i915 wmi_bmof dell_wmi sparse_keymap dell_smbios dell_wmi_descriptor evdev kvm snd_hda_intel drm_kms_helper irqbypass snd_hda_codec crct10dif_pclmul snd_hda_core crc32_pclmul snd_hwdep drm dcdbas snd_pcm ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf mei_me snd_timer iTCO_wdt snd serio_raw iTCO_vendor_support i2c_algo_bit soundcore sg mei pcspkr intel_pch_thermal wmi video acpi_pad button pcc_cpufreq drbd lru_cache libcrc32c
> Apr 29 15:26:01 drs1p001 kernel:  ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci libata xhci_pci xhci_hcd scsi_mod i2c_i801 e1000e usbcore e1000 usb_common thermal fan [last unloaded: configfs]
> Apr 29 15:26:01 drs1p001 kernel: ---[ end trace f720d1de63741a88 ]---
> Apr 29 15:26:01 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2]
> Apr 29 15:26:01 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 f1 32 5a fb 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
> Apr 29 15:26:01 drs1p001 kernel: RSP: 0018:ffffad8d0df07af8 EFLAGS: 00010046
> Apr 29 15:26:01 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff8d630e9fd818 RCX: 0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff8d630e9fd818 RDI: ffff8d630e9fd894
> Apr 29 15:26:01 drs1p001 kernel: RBP: 0000000000000003 R08: 0000729cc0000000 R09: 00000000000214c0
> Apr 29 15:26:01 drs1p001 kernel: R10: ffffad8d0df07ae0 R11: 0000000000000018 R12: ffff8d630e9fd894
> Apr 29 15:26:01 drs1p001 kernel: R13: ffff8d6455c8e000 R14: 0000000000000000 R15: ffffffffc0c3c2c0
> Apr 29 15:26:01 drs1p001 kernel: FS:  00007f8104e1d700(0000) GS:ffff8d6511b00000(0000) knlGS:0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 29 15:26:01 drs1p001 kernel: CR2: 00007f80d0000010 CR3: 0000000115180005 CR4: 00000000003606e0
> Apr 29 15:26:01 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 29 15:26:01 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
>
>
> -----Original Message-----
> From: Wengang <wen.gang.wang@oracle.com>
> Sent: Mittwoch, 27. M?rz 2019 19:18
> To: Daniel Sobe <daniel.sobe@nxp.com>; 'ocfs2-devel at oss.oracle.com' <ocfs2-devel@oss.oracle.com>
> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>
> BUG() would panic the kernel, dumping the callback trace on console, and
> then usually reboot the machine.
>
> Kexec (when kdump is installed and configured properly) would, when
> production kernel panic,  start a new kernel (dump kernel) from a
> reserved block of memory and keep the other memory untouched. when dump
> kernel booted up, it performs memory collecting procedure against the
> untouched memory and save it to disk file (usually called "vmcore").
> then reboot to product kernel again after the collecting work finished.
> So, yes, vmcore is a image file with kernel memory inside.
>
> Here is an example regarding kdump configuration:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Ffedoraproject.org-252Fwiki-252FHow-5Fto-5Fuse-5Fkdump-5Fto-5Fdebug-5Fkernel-5Fcrashes-26amp-3Bdata-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C507772380fe040c9a3d508d6ccbb6768-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636921502610713576-26amp-3Bsdata-3Dl-252Fjh-252B1WvPim-252B-252Ffit4GadRvXsUfYANhVpaSvDl3Z-252BNkg-253D-26amp-3Breserved-3D0&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ZjlJicmNhARQFWtlvSHjG4GpCwgSx-rzbzr-0_Eib0Y&s=LnJ4P4f1O4eMy7DI799IXP7q8JxEUGLr6GuDO8iQ_EE&e=
>
> thanks,
> wengang
> On 03/27/2019 12:57 AM, Daniel Sobe wrote:
>> In my setup, the system is still responsive (at least for a couple of minutes) after the BUG. Do I understand it correctly that you want me to setup "kdump" and provoke a crash manually after this BUG occurred, in order to receive an image file with all kernel memory inside?
>>
>> Sorry for the stupid question, but I'm new to this.
>>
>> Regards,
>>
>> Daniel
>>
>> -----Original Message-----
>> From: Wengang Wang <wen.gang.wang@oracle.com>
>> Sent: Dienstag, 26. M?rz 2019 22:24
>> To: Daniel Sobe <daniel.sobe@nxp.com>; 'ocfs2-devel at oss.oracle.com' <ocfs2-devel@oss.oracle.com>
>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>
>> Thank you, Daniel.
>>
>> Wengang
>>
>> On 2019/3/26 5:27, Daniel Sobe wrote:
>>> Hi Wengang,
>>>
>>> Thanks for confirming that this bug is reproducible! Long time I was under the impression that I'm the only one facing this issue.
>>>
>>> Unfortunately, I do not know what a "vmcore" is. Let me google it and then check whether I can reproduce the bug again easily and provide what you request.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>> -----Original Message-----
>>> From: ocfs2-devel-bounces at oss.oracle.com
>>> <ocfs2-devel-bounces@oss.oracle.com> On Behalf Of Wengang
>>> Sent: Montag, 18. M?rz 2019 18:46
>>> To: ocfs2-devel at oss.oracle.com
>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>
>>> Hi,
>>>
>>> I also see this problem on a lower version at 4.1.12.xxx.
>>>
>>> The l_ro_holders is changed in ocfs2 layer, not DLM layer. And another thing is that the dentry lock is just used to get notification for file remote deleting. So the code is requesting PR lock and then releasing the it after PR is granted.  I am not sure, but I feel this is not a DLM issue, but a memory issue on the ocfs2_lock_res.  Do you have a vmcore available for this problem?
>>>
>>> Thanks,
>>> Wengang
>>>
>>>
>>> On 02/20/2019 12:48 AM, Daniel Sobe wrote:
>>>> Hi Larry,
>>>>
>>>> The issue still happens with 4.19 as well, but it took quite a while to trigger it:
>>>>
>>>> Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------
>>>> Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>>>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#1] SMP PTI
>>>> Feb
>>>> 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24018 Comm: git Not tainted
>>>> 4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1 Feb 20 09:37:56
>>>> drs1p001 kernel: Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS
>>>> 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel: RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>>>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b
>>>> 53
>>>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f
>>>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
>>>> 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>>>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>>>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>>>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>>>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>>>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>>>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>>>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>>>> 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000)
>>>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>>>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> Feb
>>>> 20 09:37:56 drs1p001 kernel:  ? d_splice_alias+0x139/0x3f0 Feb 20
>>>> 09:37:56 drs1p001 kernel:  ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20
>>>> 09:37:56 drs1p001 kernel:  ? ocfs2_permission+0x79/0xe0 [ocfs2] Feb
>>>> 20
>>>> 09:37:56 drs1p001 kernel:  __lookup_slow+0x97/0x150 Feb 20 09:37:56
>>>> drs1p001 kernel:  lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001
>>>> kernel:  walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:
>>>> ? __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>>>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>>>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>>>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>>>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>>>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>>>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8
>>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>>> b8
>>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd
>>>> 2b
>>>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>>> 002b:00007fc258ff8d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>>>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>>>> 00007fc258ff8e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>>>> kernel: RDX: 00007fc258ff8d40 RSI: 00007fc258ff8d40 RDI:
>>>> 00007fc2300008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>>> 0000000000000045 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>>>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>>>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>>>> 000000000000000d R14: 0000000000000015 R15: 000055ec17f94d58 Feb 20
>>>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag
>>>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb
>>>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree
>>>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi
>>>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl
>>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915
>>>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof
>>>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev
>>>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate
>>>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm
>>>> drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit
>>>> intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56 drs1p001 kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb 20 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe1 ]--- Feb 20 09:37:56 drs1p001 kernel: RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX: 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618 RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc258ff9700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: ------------[ cut here ]------------ Feb 20 09:37:56 drs1p001 kernel: kernel BUG at /build/linux-Ut6wTa/linux-4.19.12/fs/ocfs2/dlmglue.c:849!
>>>> Feb 20 09:37:56 drs1p001 kernel: invalid opcode: 0000 [#2] SMP PTI
>>>> Feb 20 09:37:56 drs1p001 kernel: CPU: 1 PID: 24024 Comm: git Tainted: G      D           4.19.0-0.bpo.1-amd64 #1 Debian 4.19.12-1~bpo9+1
>>>> Feb 20 09:37:56 drs1p001 kernel: Hardware name: Dell Inc. OptiPlex
>>>> 5040/0R790T, BIOS 1.2.7 01/15/2016 Feb 20 09:37:56 drs1p001 kernel:
>>>> RIP: 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20
>>>> 09:37:56 drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de
>>>> 0f 0b 8b 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74
>>>> c3 eb d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00
>>>> 00 00
>>>> 00 00 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>>> 0018:ffffaa68c8177af8 EFLAGS: 00010046 Feb 20 09:37:56 drs1p001
>>>> kernel: RAX: 0000000000000292 RBX: ffff95fdf3b23418 RCX:
>>>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: RDX:
>>>> 0000000000000000 RSI: ffff95fdf3b23418 RDI: ffff95fdf3b23494 Feb 20
>>>> 09:37:56 drs1p001 kernel: RBP: 0000000000000003 R08: ffff95fe91aa2620
>>>> R09: 0000000000000089 Feb 20 09:37:56 drs1p001 kernel: R10:
>>>> ffffaa68c8177ae0 R11: ffff95fe6e3efb40 R12: ffff95fdf3b23494 Feb 20
>>>> 09:37:56 drs1p001 kernel: R13: ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000) GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56 drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3: 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 20 09:37:56 drs1p001 kernel: Call Trace:
>>>> Feb 20 09:37:56 drs1p001 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>>> [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>>>> ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2] Feb 20 09:37:56 drs1p001
>>>> kernel:  ? d_splice_alias+0x29d/0x3f0 Feb 20 09:37:56 drs1p001 kernel:
>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Feb 20 09:37:56 drs1p001 kernel:
>>>> __lookup_slow+0x97/0x150 Feb 20 09:37:56 drs1p001 kernel:
>>>> lookup_slow+0x35/0x50 Feb 20 09:37:56 drs1p001 kernel:
>>>> walk_component+0x1c6/0x360 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> __ocfs2_cluster_lock.isra.37+0x62d/0x7b0 [ocfs2] Feb 20 09:37:56
>>>> drs1p001 kernel:  ? __aa_path_perm.part.6+0x6b/0x80 Feb 20 09:37:56
>>>> drs1p001 kernel:  path_lookupat+0x67/0x200 Feb 20 09:37:56 drs1p001
>>>> kernel:  filename_lookup+0xb8/0x1a0 Feb 20 09:37:56 drs1p001 kernel:
>>>> ? seccomp_run_filters+0x58/0xb0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> __check_object_size+0x9d/0x1a0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> strncpy_from_user+0x48/0x160 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> getname_flags+0x6a/0x1e0 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>>> vfs_statx+0x73/0xe0 Feb 20 09:37:56 drs1p001 kernel:
>>>> __do_sys_newlstat+0x39/0x70 Feb 20 09:37:56 drs1p001 kernel:  ?
>>>> syscall_trace_enter+0x117/0x2c0 Feb 20 09:37:56 drs1p001 kernel:
>>>> do_syscall_64+0x55/0x110 Feb 20 09:37:56 drs1p001 kernel:
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> Feb 20 09:37:56 drs1p001 kernel: RIP: 0033:0x7fc2622d80f5 Feb 20
>>>> 09:37:56 drs1p001 kernel: Code: a9 dd 2b 00 64 c7 00 16 00 00 00 b8
>>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>>> b8
>>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 71 dd
>>>> 2b
>>>> 00 f7 d8 64 89 Feb 20 09:37:56 drs1p001 kernel: RSP:
>>>> 002b:00007fc24d7f9d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Feb
>>>> 20 09:37:56 drs1p001 kernel: RAX: ffffffffffffffda RBX:
>>>> 00007fc24d7f9e50 RCX: 00007fc2622d80f5 Feb 20 09:37:56 drs1p001
>>>> kernel: RDX: 00007fc24d7f9d40 RSI: 00007fc24d7f9d40 RDI:
>>>> 00007fc2100008c0 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>>> 0000000000000044 R08: 0000000000000003 R09: 0000000000000000 Feb 20
>>>> 09:37:56 drs1p001 kernel: R10: 0000000000000000 R11: 0000000000000246
>>>> R12: 0000000000000005 Feb 20 09:37:56 drs1p001 kernel: R13:
>>>> 000000000000000f R14: 000000000000001d R15: 000055ec18015008 Feb 20
>>>> 09:37:56 drs1p001 kernel: Modules linked in: tcp_diag inet_diag
>>>> unix_diag appletalk psnap ax25 veth fuse ocfs2_dlmfs ocfs2_stack_o2cb
>>>> ocfs2_dlm ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue quota_tree
>>>> dm_mod drbd lru_cache libcrc32c bridge stp llc snd_hda_codec_hdmi
>>>> rfkill snd_hda_codec_realtek snd_hda_codec_generic intel_rapl
>>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915
>>>> irqbypass crct10dif_pclmul crc32_pclmul dell_wmi dell_smbios wmi_bmof
>>>> sparse_keymap snd_hda_intel dell_wmi_descriptor evdev
>>>> ghash_clmulni_intel snd_hda_codec drm_kms_helper intel_cstate
>>>> snd_hda_core intel_uncore snd_hwdep dcdbas intel_rapl_perf snd_pcm
>>>> drm snd_timer snd mei_me soundcore pcspkr i2c_algo_bit
>>>> intel_pch_thermal iTCO_wdt mei serio_raw iTCO_vendor_support sg wmi
>>>> video button acpi_pad pcc_cpufreq ip_tables x_tables Feb 20 09:37:56
>>>> drs1p001
>>>> kernel:  autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb
>>>> sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd
>>>> cryptd ahci glue_helper libahci e1000 libata xhci_pci psmouse
>>>> xhci_hcd scsi_mod e1000e usbcore i2c_i801 usb_common thermal fan Feb
>>>> 20
>>>> 09:37:56 drs1p001 kernel: ---[ end trace b0fe45be8de9bbe2 ]--- Feb 20
>>>> 09:37:56 drs1p001 kernel: RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.38+0x9d/0xb0 [ocfs2] Feb 20 09:37:56
>>>> drs1p001 kernel: Code: c6 5b 5d 41 5c 41 5d e9 41 0d ec de 0f 0b 8b
>>>> 53
>>>> 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb d1 0f
>>>> 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
>>>> 0f 1f 44 Feb 20 09:37:56 drs1p001 kernel: RSP: 0018:ffffaa68c813faf8
>>>> EFLAGS: 00010046 Feb 20 09:37:56 drs1p001 kernel: RAX:
>>>> 0000000000000292 RBX: ffff95fe8cec9618 RCX: 0000000000000000 Feb 20
>>>> 09:37:56 drs1p001 kernel: RDX: 0000000000000000 RSI: ffff95fe8cec9618
>>>> RDI: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: RBP:
>>>> 0000000000000003 R08: 00006a0340000000 R09: 0000000000000153 Feb 20
>>>> 09:37:56 drs1p001 kernel: R10: ffffaa68c813fae0 R11: 000000000000000b
>>>> R12: ffff95fe8cec9694 Feb 20 09:37:56 drs1p001 kernel: R13:
>>>> ffff95fe8a876000 R14: 0000000000000000 R15: ffffffffc0f122c0 Feb 20
>>>> 09:37:56 drs1p001 kernel: FS:  00007fc24d7fa700(0000)
>>>> GS:ffff95fe91a80000(0000) knlGS:0000000000000000 Feb 20 09:37:56
>>>> drs1p001 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> Feb
>>>> 20 09:37:56 drs1p001 kernel: CR2: 00007fc224000010 CR3:
>>>> 00000001617fc002 CR4: 00000000003606e0 Feb 20 09:37:56 drs1p001
>>>> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000 Feb 20 09:37:56 drs1p001 kernel: DR3:
>>>> 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message-----
>>>> From: Daniel Sobe
>>>> Sent: Dienstag, 11. September 2018 13:36
>>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: RE: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Larry,
>>>>
>>>> I tested your script and indeed it does not provoke the error. Meanwhile I used a newer kernel which makes it harder to provoke it, here is the stacktrace:
>>>>
>>>> Sep 11 13:08:51 drs1p002 kernel: ------------[ cut here ]------------
>>>> Sep 11 13:08:51 drs1p002 kernel: kernel BUG at /build/linux-hJelb7/linux-4.18.6/fs/ocfs2/dlmglue.c:847!
>>>> Sep 11 13:08:51 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>>> Sep
>>>> 11 13:08:51 drs1p002 kernel: CPU: 0 PID: 21443 Comm: java Not tainted
>>>> 4.18.0-1-amd64 #1 Debian 4.18.6-1 Sep 11 13:08:51 drs1p002 kernel:
>>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>>> Sep
>>>> 11 13:08:51 drs1p002 kernel: RIP:
>>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11 13:08:51
>>>> drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9 6e 12 50
>>>> cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2 74 c5
>>>> eb
>>>> d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00
>>>> 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP: 0018:ffffb1248eeb3af8
>>>> EFLAGS: 00010046 Sep 11 13:08:51 drs1p002 kernel: RAX:
>>>> 0000000000000292 RBX: ffff95cdbd985a18 RCX: 0000000000000100 Sep 11
>>>> 13:08:51 drs1p002 kernel: RDX: 0000000000000000 RSI: 0000000000000000
>>>> RDI: ffff95cdbd985a94 Sep 11 13:08:51 drs1p002 kernel: RBP:
>>>> ffff95cdbd985a94 R08: 0000000000000000 R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10: ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000 R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:  00007f2434e21700(0000) GS:ffff95ce9e200000(0000) knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel: CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0 Sep 11 13:08:51 drs1p002 kernel: Call Trace:
>>>> Sep 11 13:08:51 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>>> [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Sep 11 13:08:51 drs1p002
>>>> kernel:  ? d_splice_alias+0x299/0x410 Sep 11 13:08:51 drs1p002 kernel:
>>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>>> __lookup_slow+0x97/0x150 Sep 11 13:08:51 drs1p002 kernel:
>>>> lookup_slow+0x35/0x50 Sep 11 13:08:51 drs1p002 kernel:
>>>> walk_component+0x1c4/0x480 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> link_path_walk+0x27c/0x510 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> path_init+0x177/0x2f0 Sep 11 13:08:51 drs1p002 kernel:
>>>> path_lookupat+0x84/0x1f0 Sep 11 13:08:51 drs1p002 kernel:
>>>> filename_lookup+0xb6/0x190 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Sep 11 13:08:51 drs1p002 kernel:
>>>> ? __check_object_size+0xa7/0x1a0 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> strncpy_from_user+0x48/0x160 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> getname_flags+0x6a/0x1e0 Sep 11 13:08:51 drs1p002 kernel:  ?
>>>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>>>> vfs_statx+0x73/0xe0 Sep 11 13:08:51 drs1p002 kernel:
>>>> __do_sys_newlstat+0x39/0x70 Sep 11 13:08:51 drs1p002 kernel:
>>>> do_syscall_64+0x55/0x110 Sep 11 13:08:51 drs1p002 kernel:
>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> Sep 11 13:08:51 drs1p002 kernel: RIP: 0033:0x7f24b6cc5995 Sep 11
>>>> 13:08:51 drs1p002 kernel: Code: f9 e4 0c 00 64 c7 00 16 00 00 00 b8
>>>> ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6
>>>> b8
>>>> 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 c1 e4
>>>> 0c
>>>> 00 f7 d8 64 89 Sep 11 13:08:51 drs1p002 kernel: RSP:
>>>> 002b:00007f2434e20388 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 Sep
>>>> 11 13:08:51 drs1p002 kernel: RAX: ffffffffffffffda RBX:
>>>> 00007f2434e20390 RCX: 00007f24b6cc5995 Sep 11 13:08:51 drs1p002
>>>> kernel: RDX: 00007f2434e20390 RSI: 00007f2434e20390 RDI:
>>>> 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: RBP:
>>>> 00007f2434e20450 R08: 0000000000000000 R09: 0000000000000800 Sep 11
>>>> 13:08:51 drs1p002 kernel: R10: 00007f24a2bcec15 R11: 0000000000000246
>>>> R12: 00007f24640dd9d0 Sep 11 13:08:51 drs1p002 kernel: R13:
>>>> 00007f24181d29e0 R14: 00007f2434e20468 R15: 00007f24181d2800 Sep 11
>>>> 13:08:51 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>>> ocfs2_nodemanager ocfs2_stackglue configfs iptable_filter fuse
>>>> snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic
>>>> nls_ascii nls_cp437 intel_rapl x86_pkg_temp_thermal intel_powerclamp
>>>> vfat coretemp fat kvm_intel iTCO_wdt iTCO_vendor_support evdev kvm
>>>> irqbypass crct10dif_pclmul crc32_pclmul i915 snd_hda_intel dcdbas
>>>> ghash_clmulni_intel efi_pstore snd_hda_codec intel_cstate
>>>> intel_uncore intel_rapl_perf snd_hda_core snd_hwdep snd_pcm mei_me
>>>> drm_kms_helper snd_timer snd soundcore pcspkr serio_raw efivars drm
>>>> mei lpc_ich i2c_algo_bit sg ie31200_edac video pcc_cpufreq button
>>>> drbd lru_cache libcrc32c parport_pc sunrpc ppdev lp parport efivarfs
>>>> ip_tables x_tables autofs4 ext4 crc16 Sep 11 13:08:51 drs1p002
>>>> kernel:  mbcache
>>>> jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod crc32c_intel ahci i2c_i801
>>>> libahci xhci_pci ehci_pci libata xhci_hcd ehci_hcd psmouse scsi_mod
>>>> usbcore e1000e usb_common thermal Sep 11 13:08:51 drs1p002 kernel:
>>>> ---[ end trace feba92ba6e432478 ]--- Sep 11 13:08:51 drs1p002 kernel:
>>>> RIP: 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Sep 11
>>>> 13:08:51 drs1p002 kernel: Code: 89 ef 48 89 c6 5b 5d 41 5c 41 5d e9
>>>> 6e
>>>> 12 50 cc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53 6c 85 d2
>>>> 74 c5 eb d3 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 2e 0f 1f 84 00 00
>>>> 00
>>>> 00 00 90 0f 1f Sep 11 13:08:51 drs1p002 kernel: RSP:
>>>> 0018:ffffb1248eeb3af8 EFLAGS: 00010046 Sep 11 13:08:51 drs1p002
>>>> kernel: RAX: 0000000000000292 RBX: ffff95cdbd985a18 RCX:
>>>> 0000000000000100 Sep 11 13:08:51 drs1p002 kernel: RDX:
>>>> 0000000000000000 RSI: 0000000000000000 RDI: ffff95cdbd985a94 Sep 11
>>>> 13:08:51 drs1p002 kernel: RBP: ffff95cdbd985a94 R08: 0000000000000000
>>>> R09: 000000000000aa47 Sep 11 13:08:51 drs1p002 kernel: R10:
>>>> ffffb1248eeb3ae0 R11: 0000000000000002 R12: 0000000000000003 Sep 11
>>>> 13:08:51 drs1p002 kernel: R13: ffff95ce87dfe000 R14: 0000000000000000
>>>> R15: ffffffffc0ab3240 Sep 11 13:08:51 drs1p002 kernel: FS:
>>>> 00007f2434e21700(0000) GS:ffff95ce9e200000(0000)
>>>> knlGS:0000000000000000 Sep 11 13:08:51 drs1p002 kernel: CS:  0010 DS:
>>>> 0000 ES: 0000 CR0: 0000000080050033 Sep 11 13:08:51 drs1p002 kernel:
>>>> CR2: 00007f01eaa48000 CR3: 000000003dd86001 CR4: 00000000001606f0
>>>>
>>>>
>>>> All I can say is that I was excessively using GIT when this happened (In eclipse, synchronizing GIT workspace). It took me around 30 minutes to see the bug again.
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message-----
>>>> From: Larry Chen <lchen@suse.com>
>>>> Sent: Mittwoch, 18. Juli 2018 10:09
>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>
>>>> Hi Daniel,
>>>>
>>>> Which stack do you use? dlm or o2cb??
>>>>
>>>> I tried to reproduce the bug.
>>>>
>>>> I have set up 2 virtual machines that share one block device(as a qcow2 file on host). And I was using dlm stack instead of o2cb. Kernel version is 4.12.14. I clone linux kernel tree from github and execute the following shell script.
>>>>
>>>> #! /bin/bash
>>>> for i in $(git tag)
>>>> do
>>>>              echo $i
>>>>              git checkout $i
>>>> done
>>>>
>>>> Bug could not be reproduced.
>>>>
>>>> According to the back trace, I think the bug is caused by the logic of holding a lock.
>>>>
>>>> If possible, I think the bug will recur, even without drdb, lvm or other components.
>>>>
>>>> Regards,
>>>> Larry
>>>>
>>>> On 07/17/2018 04:11 PM, Daniel Sobe wrote:
>>>>> Hi Larry,
>>>>>
>>>>> I think that with the most recent crash, I have a pretty simple environment already. All it takes is an OCFS2 formatted /home volume and a GIT repository on that volume, which generates a lot of disk IO upon "git checkout" to switch branches. VMs or containers are no longer involved.
>>>>>
>>>>> The only additional simplification that I can think of are the layers on top of the SSD. Currently I have:
>>>>>
>>>>> SSD partition --> LVM2 --> LVM volumes --> DRBD --> OCFS2
>>>>>
>>>>> I can easily remove the DRBD layer. Removing LVM will be more difficult, but possible. Do you think any of these make sense to try?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Daniel
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>> Sent: Dienstag, 17. Juli 2018 04:54
>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> Could you please simplify your environment?
>>>>> Can I use several virtual machines to reproduce the bug??
>>>>>
>>>>> Thanks
>>>>> Larry
>>>>>
>>>>> On 07/16/2018 07:49 PM, Daniel Sobe wrote:
>>>>>> Hi,
>>>>>>
>>>>>> the same issue happens with 4.17.6 kernel from Debian unstable.
>>>>>>
>>>>>> This time no namespaces were involved, so it is now confirmed that the issue is not related to namespaces, containers and such.
>>>>>>
>>>>>> All I did was to again run "git checkout" on a git repository that is placed on an OCFS2 volume.
>>>>>>
>>>>>> After the issue occurs, I have ~ 2 mins before the system becomes unusable. Anything I can do during that time to aid debugging? I don't know what else to try to help fix this issue.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>>
>>>>>> Jul 16 13:40:24 drs1p002 kernel: ------------[ cut here
>>>>>> ]------------ Jul 16 13:40:24 drs1p002 kernel: kernel BUG at /build/linux-fVnMBb/linux-4.17.6/fs/ocfs2/dlmglue.c:848!
>>>>>> Jul 16 13:40:24 drs1p002 kernel: invalid opcode: 0000 [#1] SMP PTI
>>>>>> Jul
>>>>>> 16 13:40:24 drs1p002 kernel: Modules linked in: tcp_diag inet_diag
>>>>>> unix_diag ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
>>>>>> ocfs2_nodemanager oc Jul 16 13:40:24 drs1p002 kernel:  jbd2
>>>>>> crc32c_generic fscrypto ecb crypto_simd cryptd glue_helper
>>>>>> aes_x86_64 dm_mod sr_mod cdrom sd_mod i2c_i801 ahci libahci Jul 16
>>>>>> 13:40:24
>>>>>> drs1p002 kernel: CPU: 1 PID: 22459 Comm: git Not tainted
>>>>>> 4.17.0-1-amd64 #1 Debian 4.17.6-1 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> Hardware name: Dell Inc. OptiPlex 7010/0WR7PY, BIOS A18 04/30/2014
>>>>>> Jul
>>>>>> 16 13:40:24 drs1p002 kernel: RIP:
>>>>>> 0010:__ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] Jul 16
>>>>>> 13:40:24
>>>>>> drs1p002 kernel: RSP: 0018:ffff9e57887dfaf8 EFLAGS: 00010046 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: RAX: 0000000000000292 RBX:
>>>>>> ffff92559ee9f018
>>>>>> RCX: 00000000000501e7 Jul 16 13:40:24 drs1p002 kernel: RDX:
>>>>>> 0000000000000000 RSI: ffff92559ee9f018 RDI: ffff92559ee9f094 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: RBP: ffff92559ee9f094 R08: 0000000000000000 R09: 0000000000008763 Jul 16 13:40:24 drs1p002 kernel: R10: ffff9e57887dfae0 R11: 0000000000000010 R12: 0000000000000003 Jul 16 13:40:24 drs1p002 kernel: R13: ffff9256127d6000 R14: 0000000000000000 R15: ffffffffc0d35200 Jul 16 13:40:24 drs1p002 kernel: FS:  00007f0ce8ff9700(0000) GS:ffff92561e280000(0000) knlGS:0000000000000000 Jul 16 13:40:24 drs1p002 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 16 13:40:24 drs1p002 kernel: CR2: 00007f0cac000010 CR3: 000000009ef52006 CR4: 00000000001606e0 Jul 16 13:40:24 drs1p002 kernel: Call Trace:
>>>>>> Jul 16 13:40:24 drs1p002 kernel:  ? ocfs2_dentry_unlock+0x35/0x80
>>>>>> [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>>> ocfs2_dentry_attach_lock+0x245/0x420 [ocfs2] Jul 16 13:40:24
>>>>>> drs1p002
>>>>>> kernel:  ? d_splice_alias+0x2a5/0x410 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> ocfs2_lookup+0x233/0x2c0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>>> __lookup_slow+0x97/0x150 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> lookup_slow+0x35/0x50 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> walk_component+0x1c4/0x470 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> link_path_walk+0x27c/0x510 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> ktime_get+0x3e/0xa0 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> path_lookupat+0x84/0x1f0 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> filename_lookup+0xb6/0x190 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> ocfs2_inode_unlock+0xe4/0xf0 [ocfs2] Jul 16 13:40:24 drs1p002 kernel:
>>>>>> ? __check_object_size+0xa7/0x1a0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> strncpy_from_user+0x48/0x160 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> getname_flags+0x6a/0x1e0 Jul 16 13:40:24 drs1p002 kernel:  ?
>>>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> vfs_statx+0x73/0xe0 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> __do_sys_newlstat+0x39/0x70 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> do_syscall_64+0x55/0x110 Jul 16 13:40:24 drs1p002 kernel:
>>>>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>>> Jul 16 13:40:24 drs1p002 kernel: RIP: 0033:0x7f0cf43ac995 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: RSP: 002b:00007f0ce8ff8cb8 EFLAGS:
>>>>>> 00000246
>>>>>> ORIG_RAX: 0000000000000006 Jul 16 13:40:24 drs1p002 kernel: RAX:
>>>>>> ffffffffffffffda RBX: 00007f0ce8ff8df0 RCX: 00007f0cf43ac995 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: RDX: 00007f0ce8ff8ce0 RSI:
>>>>>> 00007f0ce8ff8ce0
>>>>>> RDI: 00007f0cb0000b20 Jul 16 13:40:24 drs1p002 kernel: RBP:
>>>>>> 0000000000000017 R08: 0000000000000003 R09: 0000000000000000 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: R10: 0000000000000000 R11:
>>>>>> 0000000000000246
>>>>>> R12: 00007f0ce8ff8dc4 Jul 16 13:40:24 drs1p002 kernel: R13:
>>>>>> 0000000000000008 R14: 00005573fd0aa758 R15: 0000000000000005 Jul 16
>>>>>> 13:40:24 drs1p002 kernel: Code: 48 89 ef 48 89 c6 5b 5d 41 5c 41 5d
>>>>>> e9 2e 3c a6 dc 8b 53 68 85 d2 74 13 83 ea 01 89 53 68 eb b1 8b 53
>>>>>> 6c
>>>>>> 85
>>>>>> d2 74 c5 e Jul 16 13:40:24 drs1p002 kernel: RIP:
>>>>>> __ocfs2_cluster_unlock.isra.39+0x9c/0xb0 [ocfs2] RSP:
>>>>>> ffff9e57887dfaf8 Jul 16 13:40:24 drs1p002 kernel: ---[ end trace
>>>>>> a5a84fa62e77df42 ]---
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ocfs2-devel-bounces at oss.oracle.com
>>>>>> [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Daniel
>>>>>> Sobe
>>>>>> Sent: Freitag, 13. Juli 2018 13:56
>>>>>> To: Larry Chen <lchen@suse.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Larry,
>>>>>>
>>>>>> I'm running a playground with 3 Dell PCs with Intel CPUs, standard consumer hardware. All 3 disks are SSD and partitioned with LVM. I have added 2 logical volumes on each system, and set up a 3-way replication using DRBD (on a separate local network). I'm still using DRBB 8 as it is shipped with Debian 9. 2 of those PCs are set up for the "stacked primary" volumes, on which I have created the OCFS2 volumes, as cluster of 2 nodes, using the same private network as DRDB does. Heartbeat is local (I guess since I did not change the default and did not do anything explicitly).
>>>>>>
>>>>>> Again I was using a LXC container for remote X via X2go. Inside the X session I opened a terminal and was compiling some code with "make -j" on my OCFS2 home directory. The next crash I reported was while doing "git checkout", triggering a lot of change in workspace files.
>>>>>>
>>>>>> Next I will be using kernel 4.17.6 now as it was recently packed for Debian unstable. Additionally I will work on the PC directly, to exclude that the issue is related to namespaces, control groups and what else that is only present in a container.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Daniel
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>> Sent: Freitag, 13. Juli 2018 11:49
>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> Thanks for your effort to reproduce the bug.
>>>>>> I can confirm that there exist more than one bug.
>>>>>> I'll focus on this interesting issue.
>>>>>>
>>>>>>
>>>>>> On 07/12/2018 10:24 PM, Daniel Sobe wrote:
>>>>>>> Hi Larry,
>>>>>>>
>>>>>>> sorry for not responding any earlier. It took me quite a while to reproduce the issue on a "playground" installation. Here's todays kernel BUG log:
>>>>>>>
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423826] ------------[
>>>>>>> cut here ]------------ Jul 12 15:29:08 drs1p001 kernel: [1300619.423827] kernel BUG at /build/linux-6BBPzq/linux-4.16.5/fs/ocfs2/dlmglue.c:848!
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423835] invalid opcode:
>>>>>>> 0000 [#1] SMP PTI Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423836] Modules linked in: btrfs zstd_compress zstd_decompress xxhash xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs tcp_diag inet_diag unix_diag appletalk ax25 ipx(C) p8023 p8022 psnap veth ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc iptable_filter fuse snd_hda_codec_hdmi rfkill intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_intel dell_wmi dell_smbios sparse_keymap irqbypass snd_hda_codec wmi_bmof dell_wmi_descriptor crct10dif_pclmul evdev crc32_pclmul i915 dcdbas snd_hda_core ghash_clmulni_intel intel_cstate snd_hwdep drm_kms_helper snd_pcm intel_uncore intel_rapl_perf snd_timer drm snd serio_raw pcspkr mei_me iTCO_wdt i2c_algo_bit Jul 12 15:29:08 drs1p001 kernel: [1300619.423870]  soundcore iTCO_vendor_support mei shpchp sg intel_pch_thermal wmi video acpi_pad button drbd lru_cache libcrc32c ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb dm_mod sr_mod cdrom sd_mod crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse ahci libahci xhci_pci libata e1000e xhci_hcd i2c_i801 e1000 scsi_mod usbcore usb_common fan thermal [last unloaded: configfs]
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423892] CPU: 2 PID: 13603 Comm: cc1 Tainted: G         C       4.16.0-0.bpo.1-amd64 #1 Debian 4.16.5-1~bpo9+1
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423894] Hardware name:
>>>>>>> Dell Inc. OptiPlex 5040/0R790T, BIOS 1.2.7 01/15/2016 Jul 12
>>>>>>> 15:29:08
>>>>>>> drs1p001 kernel: [1300619.423923] RIP:
>>>>>>> 0010:__ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] Jul 12
>>>>>>> 15:29:08
>>>>>>> drs1p001 kernel: [1300619.423925] RSP: 0018:ffffb14b4a133b10 EFLAGS:
>>>>>>> 00010046 Jul 12 15:29:08 drs1p001 kernel: [1300619.423927] RAX:
>>>>>>> 0000000000000282 RBX: ffff9d269d990018 RCX: 0000000000000000 Jul
>>>>>>> 12
>>>>>>> 15:29:08 drs1p001 kernel: [1300619.423929] RDX: 0000000000000000 RSI:
>>>>>>> ffff9d269d990018 RDI: ffff9d269d990094 Jul 12 15:29:08 drs1p001
>>>>>>> kernel: [1300619.423931] RBP: 0000000000000003 R08:
>>>>>>> 000062d940000000
>>>>>>> R09: 000000000000036a Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423933] R10: ffffb14b4a133af8 R11: 0000000000000068 R12:
>>>>>>> ffff9d269d990094 Jul 12 15:29:08 drs1p001 kernel: [1300619.423934]
>>>>>>> R13: ffff9d2882baa000 R14: 0000000000000000 R15: ffffffffc0bf3940 Jul 12 15:29:08 drs1p001 kernel: [1300619.423936] FS:  0000000000000000(0000) GS:ffff9d2899d00000(0063) knlGS:00000000f7c99d00 Jul 12 15:29:08 drs1p001 kernel: [1300619.423938] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033 Jul 12 15:29:08 drs1p001 kernel: [1300619.423940] CR2: 00007ff9c7f3e8dc CR3: 00000001725f0002 CR4: 00000000003606e0 Jul 12 15:29:08 drs1p001 kernel: [1300619.423942] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 12 15:29:08 drs1p001 kernel: [1300619.423944] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 12 15:29:08 drs1p001 kernel: [1300619.423945] Call Trace:
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423958]  ?
>>>>>>> ocfs2_dentry_unlock+0x35/0x80 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423969]  ocfs2_dentry_attach_lock+0x2cb/0x420 [ocfs2]
>>>>>> Here is caused by ocfs2_dentry_lock failed.
>>>>>> I'll fix it by prevent ocfs2 from calling ocfs2_dentry_unlock on the failure of ocfs2_dentry_lock.
>>>>>>
>>>>>> But why it failed still confuses me.
>>>>>>
>>>>>>
>>>>>>> Jul 12 15:29:08 drs1p001 kernel: [1300619.423981]
>>>>>>> ocfs2_lookup+0x199/0x2e0 [ocfs2] Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423986]  ? _cond_resched+0x16/0x40 Jul 12 15:29:08
>>>>>>> drs1p001
>>>>>>> kernel: [1300619.423989]  lookup_slow+0xa9/0x170 Jul 12 15:29:08
>>>>>>> drs1p001 kernel: [1300619.423991]  walk_component+0x1c6/0x350 Jul
>>>>>>> 12
>>>>>>> 15:29:08 drs1p001 kernel: [1300619.423993]  ?
>>>>>>> path_init+0x1bd/0x300 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423995]
>>>>>>> path_lookupat+0x73/0x220 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.423998]  ? ___bpf_prog_run+0xba7/0x1260 Jul 12 15:29:08
>>>>>>> drs1p001 kernel: [1300619.424000]  filename_lookup+0xb8/0x1a0 Jul
>>>>>>> 12
>>>>>>> 15:29:08 drs1p001 kernel: [1300619.424003]  ?
>>>>>>> seccomp_run_filters+0x58/0xb0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.424005]  ? __check_object_size+0x98/0x1a0 Jul 12 15:29:08
>>>>>>> drs1p001 kernel: [1300619.424008]  ? strncpy_from_user+0x48/0x160
>>>>>>> Jul
>>>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424010]  ?
>>>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.424012]
>>>>>>> vfs_statx+0x73/0xe0 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.424015]
>>>>>>> C_SYSC_x86_stat64+0x39/0x70 Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.424018]  ? syscall_trace_enter+0x117/0x2c0 Jul 12
>>>>>>> 15:29:08
>>>>>>> drs1p001 kernel: [1300619.424020]  do_fast_syscall_32+0xab/0x1f0
>>>>>>> Jul
>>>>>>> 12 15:29:08 drs1p001 kernel: [1300619.424022]
>>>>>>> entry_SYSENTER_compat+0x7f/0x8e Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300619.424025] Code: 89 c6 5b 5d 41 5c 41 5d e9 a1 77 78 db 0f
>>>>>>> 0b 8b
>>>>>>> 53 68 85 d2 74 15 83 ea 01 89 53 68 eb af 8b 53 6c 85 d2 74 c3 eb
>>>>>>> d1 0f 0b 0f 0b <0f> 0b 0f 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00
>>>>>>> 00
>>>>>>> 00
>>>>>>> 00 0f 1f Jul 12 15:29:08 drs1p001 kernel: [1300619.424055] RIP:
>>>>>>> __ocfs2_cluster_unlock.isra.36+0x9d/0xb0 [ocfs2] RSP:
>>>>>>> ffffb14b4a133b10 Jul 12 15:29:08 drs1p001 kernel: [1300619.424057]
>>>>>>> ---[ end trace aea789961795b75f ]--- Jul 12 15:29:08 drs1p001 kernel:
>>>>>>> [1300628.967649] ------------[ cut here ]------------
>>>>>>>
>>>>>>> As this occurred while compiling C code with "-j" I think we were on the wrong track, it is not about mount sharing, but rather a multicore issue. That would be in line with the other report that I found (I referenced it when I was reporting my issue), who claimed the issue went away after he restricted to 1 active CPU core.
>>>>>>>
>>>>>>> Unfortunately I could not do much with the machine afterwards. Probably the OCFS2 mechanism to reboot the node if the local heartbeat isn't updated anymore kicked in, so there was no way I could have SSHed in and run some debugging.
>>>>>>>
>>>>>>> I have now updated to the kernel Debian package of 4.16.16 backported for Debian 9. I guess I will hit the bug again and let you know.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>> Sent: Freitag, 11. Mai 2018 09:01
>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>; ocfs2-devel at oss.oracle.com
>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> On 04/12/2018 08:20 PM, Daniel Sobe wrote:
>>>>>>>> Hi Larry,
>>>>>>>>
>>>>>>>> this is, in a nutshell, what I do to create a LXC container as "ordinary user":
>>>>>>>>
>>>>>>>> * Install the LXC packages from the distribution
>>>>>>>> * run the command "lxc-create -n test1 -t download"
>>>>>>>> ** first run might prompt you to generate a
>>>>>>>> ~/.config/lxc/default.conf to define UID mappings
>>>>>>>> ** in a corporate environment it might be tricky to set the
>>>>>>>> http_proxy (and maybe even https_proxy) environment variables
>>>>>>>> correctly
>>>>>>>> ** once the list of images is shown, select for instance "debian" "jessie" "amd64"
>>>>>>>> * the container downloads to ~/.local/share/lxc/
>>>>>>>> * adapt the "config" file in that directory to add the shared
>>>>>>>> ocfs2 mount like in my example below
>>>>>>>> * if you're lucky, then "lxc-start -d -n test1" already works, which you can confirm by "lxc-ls --fancy", and attach to the container with "lxc-attach -n test1"
>>>>>>>> ** if you want to finally enable networking, most distributions
>>>>>>>> arrange a dedicated bridge (lxcbr0) which you can configure
>>>>>>>> similar to my example below
>>>>>>>> ** in my case I had to install cgroup related tools and reboot to
>>>>>>>> have all cgroups available, and to allow use of lxcbr0 bridge in
>>>>>>>> /etc/lxc/lxc-usernet
>>>>>>>>
>>>>>>>> Now if you access the mount-shared OCFS2 file system from with several containers, the bug will (hopefully) trigger on your side as well. I don't know the conditions under which this will occur, unfortunately.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>> Sent: Donnerstag, 12. April 2018 11:20
>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>
>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> Quite an interesting issue.
>>>>>>>>
>>>>>>>> I'm not familiar with lxc tools, so it may take some time to reproduce it.
>>>>>>>>
>>>>>>>> Do you have a script to build up your lxc environment?
>>>>>>>> Because I want to make sure that my environment is quite the same as yours.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Larry
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/12/2018 03:45 PM, Daniel Sobe wrote:
>>>>>>>>> Hi Larry,
>>>>>>>>>
>>>>>>>>> not sure if it helps, the issue wasn't there with Debian 8 and
>>>>>>>>> kernel
>>>>>>>>> 3.16 - but that's a long history. Unfortunately, the only
>>>>>>>>> machine where I could try to bisect, does not run any kernel <
>>>>>>>>> 4.16 without other issues ?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Daniel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>>> Sent: Donnerstag, 12. April 2018 05:17
>>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>;
>>>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>>
>>>>>>>>> Hi Daniel,
>>>>>>>>>
>>>>>>>>> Thanks for your report.
>>>>>>>>> I'll try to reproduce this bug as you did.
>>>>>>>>>
>>>>>>>>> I'm afraid there may be some bugs on the collaboration of cgroups and ocfs2.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Larry
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 04/11/2018 08:24 PM, Daniel Sobe wrote:
>>>>>>>>>> Hi Larry,
>>>>>>>>>>
>>>>>>>>>> below is an example config file like I use it for LXC containers. I followed the instructions (https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fwiki.debian.org-252FLXC-26data-3D02-257C01-257Cdaniel.sobe-2540nxp.com-257C11fd4f062e694faa287a08d5a023f22b-257C686ea1d3bc2b4c6fa92cd99c5c301635-257C0-257C0-257C636590998614059943-26sdata-3DZSqSTx3Vjxy-252FbfKrXdIVGvUqieRFxVl4FFnr-252FPTGAhc-253D-26reserved-3D0%26d%3DDwIGaQ%26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DVTW6gNWhTVlF5KmjZv2fMhm45jgdtPllvAbYDQ0PNYA%26s%3DtGYkPHaAU3tSeeEGrlORRLY9rDQAl6YdYtD0RJ7HBHw%26e&amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C507772380fe040c9a3d508d6ccbb6768%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636921502610713576&amp;sdata=8B19GwXEAoyZdpv7LU3HMqvoFo9odeaFYh2ZlcBuZ5w%3D&amp;reserved=0=) and downloaded a Debian 8 container as user (unprivileged) and adapted the config file. Several of those containers run on one host and share the OCFS2 directory as you can see at the "lxc.mount.entry" line.
>>>>>>>>>>
>>>>>>>>>> Meanwhile I'm trying whether the problem can be reproduced with shared mounts in one namespace, as you suggested. So far with no success, will report once anything happens.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Daniel
>>>>>>>>>>
>>>>>>>>>> ----
>>>>>>>>>>
>>>>>>>>>> # Distribution configuration
>>>>>>>>>> lxc.include = /usr/share/lxc/config/debian.common.conf
>>>>>>>>>> lxc.include = /usr/share/lxc/config/debian.userns.conf
>>>>>>>>>> lxc.arch = x86_64
>>>>>>>>>>
>>>>>>>>>> # Container specific configuration lxc.id_map = u 0 624288
>>>>>>>>>> 65536 lxc.id_map = g 0 624288 65536
>>>>>>>>>>
>>>>>>>>>> lxc.utsname = container1
>>>>>>>>>> lxc.rootfs = /storage/uvirtuals/unpriv/container1/rootfs
>>>>>>>>>>
>>>>>>>>>> lxc.network.type = veth
>>>>>>>>>> lxc.network.flags = up
>>>>>>>>>> lxc.network.link = bridge1
>>>>>>>>>> lxc.network.name = eth0
>>>>>>>>>> lxc.network.veth.pair = aabbccddeeff
>>>>>>>>>> lxc.network.ipv4 = XX.XX.XX.XX/YY lxc.network.ipv4.gateway =
>>>>>>>>>> ZZ.ZZ.ZZ.ZZ
>>>>>>>>>>
>>>>>>>>>> lxc.cgroup.cpuset.cpus = 63-86
>>>>>>>>>>
>>>>>>>>>> lxc.mount.entry = /storage/ocfs2/sw            sw            none bind 0 0
>>>>>>>>>>
>>>>>>>>>> lxc.cgroup.memory.limit_in_bytes       = 240G
>>>>>>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 240G
>>>>>>>>>>
>>>>>>>>>> lxc.include = /usr/share/lxc/config/common.conf.d/00-lxcfs.conf
>>>>>>>>>>
>>>>>>>>>> ----
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Larry Chen [mailto:lchen at suse.com]
>>>>>>>>>> Sent: Mittwoch, 11. April 2018 13:31
>>>>>>>>>> To: Daniel Sobe <daniel.sobe@nxp.com>;
>>>>>>>>>> ocfs2-devel at oss.oracle.com
>>>>>>>>>> Subject: Re: [Ocfs2-devel] OCFS2 BUG with 2 different kernels
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 04/11/2018 07:17 PM, Daniel Sobe wrote:
>>>>>>>>>>> Hi Larry,
>>>>>>>>>>>
>>>>>>>>>>> this is what I was doing. The 2nd node, while being "declared" in the cluster.conf, does not exist yet, and thus everything was happening on one node only.
>>>>>>>>>>>
>>>>>>>>>>> I do not know in detail how LXC does the mount sharing, but I assume it simply calls "mount --bind /original/mount/point /new/mount/point" in a separate namespace (or, somehow unshares the mount from the original namespace afterwards).
>>>>>>>>>> I thought of there is a way to share a directory between host and docker container, like
>>>>>>>>>>               docker run -v /host/directory:/container/directory -other -options image_name command_to_run That's different from yours.
>>>>>>>>>>
>>>>>>>>>> How did you setup your lxc or container?
>>>>>>>>>>
>>>>>>>>>> If you could, show me the procedure, I'll try to reproduce it.
>>>>>>>>>>
>>>>>>>>>> And by the way, if you get rid of lxc, and just mount ocfs2 on several different mount point of local host, will the problem recur?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Larry
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Daniel
>>>>>>>>>>>
>>>>>>> Sorry for this delayed reply.
>>>>>>>
>>>>>>> I tried with lxc + ocfs2 in your mount-shared way.
>>>>>>>
>>>>>>> But I can not reproduce your bugs.
>>>>>>>
>>>>>>> What I use is opensuse tumbleweed.
>>>>>>>
>>>>>>> The procedure I try to reproduce your bugs:
>>>>>>> 0. set-up ha cluster stack and mount ocfs2 fs on host's /mnt with command
>>>>>>>            mount /dev/xxx /mnt
>>>>>>>            then it shows
>>>>>>>            207 65 254:16 / /mnt rw,relatime shared:94
>>>>>>>            I think this *shared* is what you want. And this mount point will be shared within multiple namespaces.
>>>>>>>
>>>>>>> 1. Start Virtual Machine Manager.
>>>>>>> 2. add a local LXC connection by clicking File ? Add Connection.
>>>>>>>            Select LXC (Linux Containers) as the hypervisor and click Connect.
>>>>>>> 3. Select the localhost (LXC) connection and click File New Virtual Machine menu.
>>>>>>> 4. Activate Application container and click Forward.
>>>>>>>            Set the path to the application to be launched. As an example, the field is filled with /bin/sh, which is fine to create a first container.
>>>>>>> Click Forward.
>>>>>>> 5. Choose the maximum amount of memory and CPUs to allocate to the container. Click Forward.
>>>>>>> 6. Type in a name for the container. This name will be used for all virsh commands on the container.
>>>>>>>            Click Advanced options. Select the network to connect the container to and click Finish. The container will be created and started. A console will be opened automatically.
>>>>>>>
>>>>>>> If possible, could you please provide a shell script to show what you did with you mount point.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Larry
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Ocfs2-devel mailing list
>>>>>> Ocfs2-devel at oss.oracle.com
>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fu&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ZjlJicmNhARQFWtlvSHjG4GpCwgSx-rzbzr-0_Eib0Y&s=rbJibUiG5BAnwCUCuui6Fw1PHSV_-rYqa86QrzkbVqk&e=
>>>>>> r
>>>>>> ldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__emea01.safelinks.
>>>>>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fos%26d%3DDwIGaQ
>>>>>> %
>>>>>> 26c%3DRoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE%26r%3DC7gAd4uDxlA
>>>>>> v
>>>>>> Tdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y%26m%3DeDT2dYwkSxcLa1NsepzLRIpUZlkC_
>>>>>> N
>>>>>> ECl_Qk34Foqvo%26s%3DAiHVWnx-sunWZO4cbXP7v6z6Bw5vegbCZBA-wGNCoqA%26e
>>>>>> &
>>>>>> amp;data=02%7C01%7Cdaniel.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6
>>>>>> a
>>>>>> bc9b8e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63688528022415
>>>>>> 8
>>>>>> 395&amp;sdata=jU%2FI8ickhjrnOfxrg6pDU5fnTgzrOuhQSxqreDkw5V8%3D&amp;
>>>>>> r
>>>>>> eserved=0=
>>>>>> s
>>>>>> .oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cd
>>>>>> a
>>>>>> n
>>>>>> i
>>>>>> el.sobe%40nxp.com%7C9befd428db39400d656308d5e8b7b97d%7C686ea1d3bc2b
>>>>>> 4
>>>>>> c
>>>>>> 6
>>>>>> fa92cd99c5c301635%7C0%7C0%7C636670798149970770&amp;sdata=dc%2BBrbJT
>>>>>> p
>>>>>> I
>>>>>> R
>>>>>> AEs8NHtosqLOejDR1auX9%2FaSFXda0TIo%3D&amp;reserved=0
>>>>>>
>>>> _______________________________________________
>>>> Ocfs2-devel mailing list
>>>> Ocfs2-devel at oss.oracle.com
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ZjlJicmNhARQFWtlvSHjG4GpCwgSx-rzbzr-0_Eib0Y&s=AEjXFB3FmKYglGARwvhcJdncwGS2M0tjQ8JTlBqe2b8&e=.
>>>> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdani
>>>> e
>>>> l.sobe%40nxp.com%7C083d3c6f8d5847b9ba2508d6abc9b8e4%7C686ea1d3bc2b4c6
>>>> f
>>>> a92cd99c5c301635%7C0%7C0%7C636885280224158395&amp;sdata=3wHi5VznbDmyn
>>>> M
>>>> ohhVO5H7mRmkx113SL06BHrfnDIcg%3D&amp;reserved=0
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Foss&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=ZjlJicmNhARQFWtlvSHjG4GpCwgSx-rzbzr-0_Eib0Y&s=AEjXFB3FmKYglGARwvhcJdncwGS2M0tjQ8JTlBqe2b8&e=.
>>> oracle.com%2Fmailman%2Flistinfo%2Focfs2-devel&amp;data=02%7C01%7Cdanie
>>> l.sobe%40nxp.com%7C2fb7269bdec843a53cf208d6b2316ba1%7C686ea1d3bc2b4c6f
>>> a92cd99c5c301635%7C0%7C0%7C636892322672188859&amp;sdata=U6zpvh4ISrQDCG
>>> LpQuBlQ%2FogBSyiDGmHNSrlVhDc0AY%3D&amp;reserved=0
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-05-03  7:10 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <HE1PR0401MB25389DFE7BFEC86453C3CB21EDBE0@HE1PR0401MB2538.eurprd04.prod.outlook.com>
2018-04-11  9:45 ` [Ocfs2-devel] OCFS2 BUG with 2 different kernels Daniel Sobe
2018-04-11 10:43   ` Larry Chen
2018-04-11 11:17     ` Daniel Sobe
2018-04-11 11:31       ` Larry Chen
2018-04-11 12:24         ` Daniel Sobe
2018-04-12  3:17           ` Larry Chen
2018-04-12  7:45             ` Daniel Sobe
     [not found]               ` <8c66f7cd-2de3-ad9e-8ec5-dc6ab934f16a@suse.com>
     [not found]                 ` <HE1PR0401MB2538AE0195A130AD527C2B31EDBC0@HE1PR0401MB2538.eurprd04.prod.outlook.com>
2018-05-11  7:01                   ` Larry Chen
2018-07-12 14:24                     ` Daniel Sobe
2018-07-13  9:35                       ` Daniel Sobe
2018-07-13  9:51                         ` Larry Chen
2018-07-13  9:48                       ` Larry Chen
2018-07-13 10:06                         ` Larry Chen
2018-07-13 11:55                         ` Daniel Sobe
2018-07-16 11:49                           ` Daniel Sobe
2018-07-17  2:54                             ` Larry Chen
2018-07-17  8:11                               ` Daniel Sobe
2018-07-18  8:08                                 ` Larry Chen
2018-07-19 12:36                                   ` Daniel Sobe
2018-09-11 11:35                                   ` Daniel Sobe
2018-09-12  7:03                                     ` Larry Chen
2019-02-20  8:48                                     ` Daniel Sobe
2019-03-18 17:45                                       ` Wengang
2019-03-26 12:27                                         ` Daniel Sobe
2019-03-26 21:24                                           ` Wengang Wang
2019-03-27  7:57                                             ` Daniel Sobe
2019-03-27 18:17                                               ` Wengang
2019-04-29 13:47                                                 ` Daniel Sobe
2019-04-29 15:57                                                   ` Wengang Wang
2019-05-03  7:10                                                     ` [Ocfs2-devel] [EXT] " Daniel Sobe
2018-04-13  1:54   ` [Ocfs2-devel] " Changwei Ge
2018-04-13  7:10     ` Daniel Sobe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.