[Bug 217572] New: Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13

* [Bug 217572] New: Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13
@ 2023-06-19  8:29 bugzilla-daemon
  2023-06-20 15:10 ` Christian Theune
                   ` (26 more replies)
  0 siblings, 27 replies; 32+ messages in thread
From: bugzilla-daemon @ 2023-06-19  8:29 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=217572

            Bug ID: 217572
           Summary: Initial blocked tasks causing deterioration over hours
                    until (nearly) complete system lockup and data loss
                    with PostgreSQL 13
           Product: File System
           Version: 2.5
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: high
          Priority: P3
         Component: XFS
          Assignee: filesystem_xfs@kernel-bugs.kernel.org
          Reporter: ct@flyingcircus.io
        Regression: No

Last Friday we experienced the following hung task messages with PostgreSQL
while performing our nightly backup using pg_dump. Normally this takes at most
a few minutes with IO being stressed. This time it caused high SYS CPU time,
went on for almost 20 minutes and caused the PostgreSQL dump to fail with
inconsistent data.

Around 3:50 AM we got this:

[330289.821046] INFO: task .postgres-wrapp:11884 blocked for more than 122
seconds.
[330289.821830]       Not tainted 6.1.31 #1-NixOS
[330289.822285] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[330289.823098] task:.postgres-wrapp state:D stack:0     pid:11884 ppid:11881 
flags:0x00000002
[330289.823911] Call Trace:
[330289.824221]  <TASK>
[330289.824451]  __schedule+0x35d/0x1370
[330289.824858]  ? mntput_no_expire+0x4a/0x250
[330289.825307]  schedule+0x5d/0xe0
[330289.825630]  rwsem_down_write_slowpath+0x34e/0x730
[330289.826128]  xfs_ilock+0xeb/0xf0 [xfs]
[330289.826599]  xfs_file_buffered_write+0x119/0x300 [xfs]
[330289.827212]  ? selinux_file_permission+0x10b/0x150
[330289.827683]  vfs_write+0x244/0x400
[330289.828049]  __x64_sys_pwrite64+0x94/0xc0
[330289.828459]  do_syscall_64+0x3a/0x90
[330289.828801]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[330289.829302] RIP: 0033:0x7ff8de90e7f7
[330289.829681] RSP: 002b:00007fff52069b08 EFLAGS: 00000202 ORIG_RAX:
0000000000000012
[330289.830408] RAX: ffffffffffffffda RBX: 0000562bb434f510 RCX:
00007ff8de90e7f7
[330289.831073] RDX: 0000000000002000 RSI: 00007ff888b52e80 RDI:
000000000000003b
[330289.831762] RBP: 00000000000021b0 R08: 000000000a000010 R09:
0000000000000040
[330289.832440] R10: 0000000004ed8000 R11: 0000000000000202 R12:
0000000000002000
[330289.833130] R13: 0000000004ed8000 R14: 00007ff8de8176c8 R15:
0000562bb434af75
[330289.833803]  </TASK>
[330289.834064] INFO: task .postgres-wrapp:1245532 blocked for more than 122
seconds.
[330289.834771]       Not tainted 6.1.31 #1-NixOS
[330289.835209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[330289.835926] task:.postgres-wrapp state:D stack:0     pid:1245532 ppid:11881
 flags:0x00000002
[330289.836752] Call Trace:
[330289.837010]  <TASK>
[330289.837258]  __schedule+0x35d/0x1370
[330289.837622]  ? page_add_file_rmap+0xba/0x2f0
[330289.838047]  ? do_set_pte+0x174/0x1c0
[330289.838420]  ? unix_stream_read_generic+0x223/0xa60
[330289.838887]  schedule+0x5d/0xe0
[330289.839265]  schedule_preempt_disabled+0x14/0x30
[330289.839758]  rwsem_down_read_slowpath+0x29e/0x4f0
[330289.840521]  down_read+0x47/0xb0
[330289.840853]  xfs_ilock+0x79/0xf0 [xfs]
[330289.841346]  xfs_file_buffered_read+0x44/0xd0 [xfs]
[330289.841945]  xfs_file_read_iter+0x6a/0xd0 [xfs]
[330289.842496]  vfs_read+0x23c/0x310
[330289.842845]  __x64_sys_pread64+0x94/0xc0
[330289.843303]  do_syscall_64+0x3a/0x90
[330289.843857]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[330289.844420] RIP: 0033:0x7ff8de90e747
[330289.844898] RSP: 002b:00007fff5206aae8 EFLAGS: 00000202 ORIG_RAX:
0000000000000011
[330289.845705] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007ff8de90e747
[330289.846488] RDX: 0000000000002000 RSI: 00007ff8896e0e80 RDI:
0000000000000010
[330289.847133] RBP: 0000000000000001 R08: 000000000a00000d R09:
0000000000000000
[330289.847627] R10: 0000000018fb8000 R11: 0000000000000202 R12:
00007ff8cdd91278
[330289.848113] R13: 0000562bb434af75 R14: 0000562bb434f510 R15:
0000562bb5e89e70
[330289.848631]  </TASK>

This keeps going on for a few minutes until:

[330584.618978] rcu: INFO: rcu_preempt self-detected stall on CPU
[330584.619413] rcu:    1-....: (20999 ticks this GP)
idle=1204/1/0x4000000000000000 softirq=53073547/53073547 fqs=5231
[330584.620052]         (t=21000 jiffies g=142859597 q=21168 ncpus=3)
[330584.620409] CPU: 1 PID: 1246481 Comm: .postgres-wrapp Not tainted 6.1.31
#1-NixOS
[330584.620880] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[330584.621588] RIP: 0010:xas_load+0xb/0x40
[330584.621844] Code: 04 48 8b 44 c2 08 c3 cc cc cc cc 48 8b 07 48 8b 40 08 c3
cc cc cc cc 0f 1f 84 00 00 00 00 00 e8 3b ff ff ff 48 89 c2 83 e2 03 <48> 83 fa
02 75 08 48 3d 00 10 00 00 77 05 c3 cc cc cc cc 0f b6 4f
[330584.622996] RSP: 0018:ffffb427c3387bf8 EFLAGS: 00000202
[330584.623337] RAX: ffff988876d3c002 RBX: ffffb427c3387d70 RCX:
0000000000000002
[330584.623787] RDX: 0000000000000002 RSI: ffff98871f9d2920 RDI:
ffffb427c3387c00
[330584.624239] RBP: 000000000000f161 R08: ffffb427c3387e68 R09:
ffffda1444482000
[330584.624686] R10: 0000000000000001 R11: 0000000000000001 R12:
000000000000f161
[330584.625136] R13: ffff9887ad3c2700 R14: 000000000000f160 R15:
ffffb427c3387e90
[330584.625591] FS:  00007ff8de817800(0000) GS:ffff98887ac80000(0000)
knlGS:0000000000000000
[330584.626097] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[330584.626467] CR2: 0000562bb434e2d8 CR3: 0000000104bd6000 CR4:
00000000000006e0
[330584.626954] Call Trace:
[330584.627125]  <IRQ>
[330584.627281]  ? rcu_dump_cpu_stacks+0xc8/0x100
[330584.627567]  ? rcu_sched_clock_irq.cold+0x15b/0x2fb
[330584.627882]  ? sched_slice+0x87/0x140
[330584.628126]  ? timekeeping_update+0xdd/0x130
[330584.628414]  ? __cgroup_account_cputime_field+0x5b/0xa0
[330584.628751]  ? update_process_times+0x77/0xb0
[330584.629036]  ? update_wall_time+0xc/0x20
[330584.629300]  ? tick_sched_handle+0x34/0x50
[330584.629564]  ? tick_sched_timer+0x6f/0x80
[330584.629823]  ? tick_sched_do_timer+0xa0/0xa0
[330584.630103]  ? __hrtimer_run_queues+0x112/0x2b0
[330584.630404]  ? hrtimer_interrupt+0xfe/0x220
[330584.630678]  ? __sysvec_apic_timer_interrupt+0x7f/0x170
[330584.631016]  ? sysvec_apic_timer_interrupt+0x99/0xc0
[330584.631339]  </IRQ>
[330584.631485]  <TASK>
[330584.631631]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[330584.631974]  ? xas_load+0xb/0x40
[330584.632195]  ? xas_load+0x30/0x40
[330584.632415]  filemap_get_read_batch+0x16e/0x250
[330584.632710]  filemap_get_pages+0xa9/0x630
[330584.632972]  ? memcg_check_events+0xda/0x210
[330584.633259]  ? free_unref_page_commit+0x7c/0x170
[330584.633560]  ? _raw_spin_unlock_irqrestore+0x23/0x40
[330584.633880]  ? free_unref_page+0x1ac/0x220
[330584.634146]  filemap_read+0xd2/0x340
[330584.634389]  xfs_file_buffered_read+0x4f/0xd0 [xfs]
[330584.634778]  xfs_file_read_iter+0x6a/0xd0 [xfs]
[330584.635123]  vfs_read+0x23c/0x310
[330584.635354]  __x64_sys_pread64+0x94/0xc0
[330584.635609]  do_syscall_64+0x3a/0x90
[330584.635846]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[330584.636174] RIP: 0033:0x7ff8de90e747
[330584.636437] Code: 48 e8 9d dc f2 ff 41 b8 02 00 00 00 e9 38 f6 ff ff 66 90
f3 0f 1e fa 80 3d bd ac 0e 00 00 49 89 ca 74 10 b8 11 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 59 c3 48 83 ec 28 48 89 54 24 10 48 89 74 24
[330584.637582] RSP: 002b:00007fff5206aaa8 EFLAGS: 00000202 ORIG_RAX:
0000000000000011
[330584.638056] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007ff8de90e747
[330584.638506] RDX: 0000000000002000 RSI: 00007ff888b68e80 RDI:
0000000000000230
[330584.638954] RBP: 0000000000000001 R08: 000000000a00000d R09:
0000000000000000
[330584.639411] R10: 000000000f160000 R11: 0000000000000202 R12:
00007ff8cdd92688
[330584.639862] R13: 0000562bb434af75 R14: 0000562bb434f510 R15:
0000562bb5fe04d0
[330584.640313]  </TASK>
[330584.721995] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks:
{ 1-.... } 21015 jiffies s: 7297 root: 0x2/.
[330584.722930] rcu: blocking rcu_node structures (internal RCU debug):
[330584.723428] Sending NMI from CPU 0 to CPUs 1:
[330584.723763] NMI backtrace for cpu 1
[330584.723769] CPU: 1 PID: 1246481 Comm: .postgres-wrapp Not tainted 6.1.31
#1-NixOS
[330584.723772] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[330584.723775] RIP: 0010:xas_load+0x29/0x40
[330584.723783] Code: 00 e8 3b ff ff ff 48 89 c2 83 e2 03 48 83 fa 02 75 08 48
3d 00 10 00 00 77 05 c3 cc cc cc cc 0f b6 4f 10 48 8d 70 fe 38 48 fe <72> ee e8
20 fe ff ff 80 3e 00 75 d0 c3 cc cc cc cc 66 0f 1f 44 00
[330584.723785] RSP: 0018:ffffb427c3387bf8 EFLAGS: 00000246
[330584.723787] RAX: ffff98871f8dbdaa RBX: ffffb427c3387d70 RCX:
0000000000000000
[330584.723788] RDX: 0000000000000002 RSI: ffff98871f8dbda8 RDI:
ffffb427c3387c00
[330584.723789] RBP: 000000000000f161 R08: ffffb427c3387e68 R09:
ffffda1444482000
[330584.723790] R10: 0000000000000001 R11: 0000000000000001 R12:
000000000000f161
[330584.723791] R13: ffff9887ad3c2700 R14: 000000000000f160 R15:
ffffb427c3387e90
[330584.723793] FS:  00007ff8de817800(0000) GS:ffff98887ac80000(0000)
knlGS:0000000000000000
[330584.723794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[330584.723795] CR2: 0000562bb434e2d8 CR3: 0000000104bd6000 CR4:
00000000000006e0
[330584.723798] Call Trace:
[330584.723801]  <NMI>
[330584.723805]  ? nmi_cpu_backtrace.cold+0x1b/0x76
[330584.723810]  ? nmi_cpu_backtrace_handler+0xd/0x20
[330584.723814]  ? nmi_handle+0x5d/0x120
[330584.723817]  ? xas_load+0x29/0x40
[330584.723818]  ? default_do_nmi+0x69/0x170
[330584.723821]  ? exc_nmi+0x13c/0x170
[330584.723823]  ? end_repeat_nmi+0x16/0x67
[330584.723828]  ? xas_load+0x29/0x40
[330584.723830]  ? xas_load+0x29/0x40
[330584.723832]  ? xas_load+0x29/0x40
[330584.723834]  </NMI>
[330584.723834]  <TASK>
[330584.723835]  filemap_get_read_batch+0x16e/0x250
[330584.723840]  filemap_get_pages+0xa9/0x630
[330584.723842]  ? memcg_check_events+0xda/0x210
[330584.723845]  ? free_unref_page_commit+0x7c/0x170
[330584.723849]  ? _raw_spin_unlock_irqrestore+0x23/0x40
[330584.723851]  ? free_unref_page+0x1ac/0x220
[330584.723852]  filemap_read+0xd2/0x340
[330584.723857]  xfs_file_buffered_read+0x4f/0xd0 [xfs]
[330584.723935]  xfs_file_read_iter+0x6a/0xd0 [xfs]
[330584.723990]  vfs_read+0x23c/0x310
[330584.723995]  __x64_sys_pread64+0x94/0xc0
[330584.723997]  do_syscall_64+0x3a/0x90
[330584.724000]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[330584.724003] RIP: 0033:0x7ff8de90e747
[330584.724019] Code: 48 e8 9d dc f2 ff 41 b8 02 00 00 00 e9 38 f6 ff ff 66 90
f3 0f 1e fa 80 3d bd ac 0e 00 00 49 89 ca 74 10 b8 11 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 59 c3 48 83 ec 28 48 89 54 24 10 48 89 74 24
[330584.724020] RSP: 002b:00007fff5206aaa8 EFLAGS: 00000202 ORIG_RAX:
0000000000000011
[330584.724021] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007ff8de90e747
[330584.724022] RDX: 0000000000002000 RSI: 00007ff888b68e80 RDI:
0000000000000230
[330584.724023] RBP: 0000000000000001 R08: 000000000a00000d R09:
0000000000000000
[330584.724023] R10: 000000000f160000 R11: 0000000000000202 R12:
00007ff8cdd92688
[330584.724024] R13: 0000562bb434af75 R14: 0000562bb434f510 R15:
0000562bb5fe04d0
[330584.724026]  </TASK>

This keeps repeating until around 4:03 which ends in systemd-journal getting
coredumped:

[331277.846966] rcu: INFO: rcu_preempt self-detected stall on CPU
[331277.847413] rcu:    1-....: (713858 ticks this GP)
idle=1204/1/0x4000000000000000 softirq=53073547/53073547 fqs=311996
[331277.848088]         (t=714253 jiffies g=142859597 q=1821602 ncpus=3)
[331277.848462] CPU: 1 PID: 1246481 Comm: .postgres-wrapp Not tainted 6.1.31
#1-NixOS
[331277.848941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[331277.849667] RIP: 0010:xas_descend+0x22/0x70
[331277.849952] Code: cc cc cc cc cc cc cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83
e2 3f 89 d0 48 83 c0 04 48 8b 44 c6 08 48 89 77 18 48 89 c1 83 e1 03 <48> 83 f9
02 75 08 48 3d fd 00 00 00 76 08 88 57 12 c3 cc cc cc cc
[331277.851120] RSP: 0018:ffffb427c3387bf0 EFLAGS: 00000202
[331277.851468] RAX: ffff988876d3c002 RBX: ffffb427c3387d70 RCX:
0000000000000002
[331277.851928] RDX: 000000000000000f RSI: ffff98871f9d2920 RDI:
ffffb427c3387c00
[331277.852386] RBP: 000000000000f161 R08: ffffb427c3387e68 R09:
ffffda1444482000
[331277.852847] R10: 0000000000000001 R11: 0000000000000001 R12:
000000000000f161
[331277.853303] R13: ffff9887ad3c2700 R14: 000000000000f160 R15:
ffffb427c3387e90
[331277.853766] FS:  00007ff8de817800(0000) GS:ffff98887ac80000(0000)
knlGS:0000000000000000
[331277.854278] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[331277.854653] CR2: 0000562bb434e2d8 CR3: 0000000104bd6000 CR4:
00000000000006e0
[331277.855111] Call Trace:
[331277.855284]  <IRQ>
[331277.855433]  ? rcu_dump_cpu_stacks+0xc8/0x100
[331277.855730]  ? rcu_sched_clock_irq.cold+0x15b/0x2fb
[331277.856049]  ? sched_slice+0x87/0x140
[331277.856294]  ? timekeeping_update+0xdd/0x130
[331277.856587]  ? __cgroup_account_cputime_field+0x5b/0xa0
[331277.856930]  ? update_process_times+0x77/0xb0
[331277.857222]  ? update_wall_time+0xc/0x20
[331277.857482]  ? tick_sched_handle+0x34/0x50
[331277.857758]  ? tick_sched_timer+0x6f/0x80
[331277.858024]  ? tick_sched_do_timer+0xa0/0xa0
[331277.858306]  ? __hrtimer_run_queues+0x112/0x2b0
[331277.858613]  ? hrtimer_interrupt+0xfe/0x220
[331277.858891]  ? __sysvec_apic_timer_interrupt+0x7f/0x170
[331277.859235]  ? sysvec_apic_timer_interrupt+0x99/0xc0
[331277.859558]  </IRQ>
[331277.859710]  <TASK>
[331277.859857]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[331277.860206]  ? xas_descend+0x22/0x70
[331277.860446]  xas_load+0x30/0x40
[331277.860665]  filemap_get_read_batch+0x16e/0x250
[331277.860967]  filemap_get_pages+0xa9/0x630
[331277.861233]  ? memcg_check_events+0xda/0x210
[331277.861517]  ? free_unref_page_commit+0x7c/0x170
[331277.861834]  ? _raw_spin_unlock_irqrestore+0x23/0x40
[331277.862158]  ? free_unref_page+0x1ac/0x220
[331277.862427]  filemap_read+0xd2/0x340
[331277.862677]  xfs_file_buffered_read+0x4f/0xd0 [xfs]
[331277.863072]  xfs_file_read_iter+0x6a/0xd0 [xfs]
[331277.863424]  vfs_read+0x23c/0x310
[331277.863657]  __x64_sys_pread64+0x94/0xc0
[331277.863917]  do_syscall_64+0x3a/0x90
[331277.864159]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[331277.864488] RIP: 0033:0x7ff8de90e747
[331277.864752] Code: 48 e8 9d dc f2 ff 41 b8 02 00 00 00 e9 38 f6 ff ff 66 90
f3 0f 1e fa 80 3d bd ac 0e 00 00 49 89 ca 74 10 b8 11 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 59 c3 48 83 ec 28 48 89 54 24 10 48 89 74 24
[331277.865917] RSP: 002b:00007fff5206aaa8 EFLAGS: 00000202 ORIG_RAX:
0000000000000011
[331277.866395] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007ff8de90e747
[331277.866848] RDX: 0000000000002000 RSI: 00007ff888b68e80 RDI:
0000000000000230
[331277.867298] RBP: 0000000000000001 R08: 000000000a00000d R09:
0000000000000000
[331277.867750] R10: 000000000f160000 R11: 0000000000000202 R12:
00007ff8cdd92688
[331277.868199] R13: 0000562bb434af75 R14: 0000562bb434f510 R15:
0000562bb5fe04d0
[331277.868658]  </TASK>
Fri Jun 16 04:03:23 AM CEST 2023 -- SERIAL CONSOLE IS LIVE --
[331289.987247] systemd[1]: Starting Serial console liveness marker...
[331289.992645] systemd[1]: nscd.service: Deactivated successfully.
[331290.011838] systemd[1]: nscd.service: Consumed 3min 16.251s CPU time,
received 12.0M IP traffic, sent 8.2M IP traffic.
[331290.020289] systemd[1]: serial-console-liveness.service: Deactivated
successfully.
[331290.035818] systemd[1]: Finished Serial console liveness marker.
[331290.068776] systemd[1]: Started Logrotate Service.
[331290.069700] systemd[1]: systemd-journald.service: Watchdog timeout (limit
3min)!
[331290.070475] systemd[1]: systemd-journald.service: Killing process 528
(systemd-journal) with signal SIGABRT.
[331290.096246] systemd[1]: logrotate.service: Deactivated successfully.
[331290.285285] systemd-coredump[1267441]: Process 528 (systemd-journal) of
user 0 dumped core.
[331290.286002] systemd-coredump[1267441]: Coredump diverted to
/var/lib/systemd/coredump/core.systemd-journal.0.39df468c96764e8c8fffc53a0d0a47d1.528.1686881003000000.zst
[331290.287045] systemd-coredump[1267441]: Module libaudit.so.1 without
build-id.
[331290.287574] systemd-coredump[1267441]: Module libnftnl.so.11 without
build-id.
[331290.288111] systemd-coredump[1267441]: Module libmnl.so.0 without build-id.
[331290.288587] systemd-coredump[1267441]: Module libgpg-error.so.0 without
build-id.
[331290.289076] systemd-coredump[1267441]: Module libattr.so.1 without
build-id.
[331290.289541] systemd-coredump[1267441]: Module libzstd.so.1 without
build-id.
[331290.289991] systemd-coredump[1267441]: Module liblzma.so.5 without
build-id.
[331290.290446] systemd-coredump[1267441]: Module libseccomp.so.2 without
build-id.
[331290.290940] systemd-coredump[1267441]: Module libpam.so.0 without build-id.
[331290.291419] systemd-coredump[1267441]: Module liblz4.so.1 without build-id.
[331290.291893] systemd-coredump[1267441]: Module libkmod.so.2 without
build-id.
[331290.292370] systemd-coredump[1267441]: Module libip4tc.so.2 without
build-id.
[331290.295165] systemd-coredump[1267441]: Module libgcrypt.so.20 without
build-id.
[331290.295678] systemd-coredump[1267441]: Module libcrypt.so.2 without
build-id.
[331290.296160] systemd-coredump[1267441]: Module libcap.so.2 without build-id.
[331290.296708] systemd-coredump[1267441]: Module libacl.so.1 without build-id.
[331290.297206] systemd-coredump[1267441]: Module libsystemd-shared-253.so
without build-id.
[331290.297768] systemd-coredump[1267441]: Module systemd-journald without
build-id.
[331290.298273] systemd-coredump[1267441]: Stack trace of thread 528:
[331290.298714] systemd-coredump[1267441]: #0  0x00007f3e96431de8
check_object_header (libsystemd-shared-253.so + 0x231de8)
[331290.299435] systemd-coredump[1267441]: #1  0x00007f3e964330d2
journal_file_move_to_object (libsystemd-shared-253.so + 0x2330d2)
[331290.300199] systemd-coredump[1267441]: #2  0x00007f3e96434a71
journal_file_find_data_object_with_hash (libsystemd-shared-253.so + 0x234a71)
[331290.302207] systemd-coredump[1267441]: #3  0x00007f3e96434d3b
journal_file_append_data (libsystemd-shared-253.so + 0x234d3b)
[331290.302983] systemd-coredump[1267441]: #4  0x00007f3e96437243
journal_file_append_entry (libsystemd-shared-253.so + 0x237243)
[331290.303767] systemd-coredump[1267441]: #5  0x000056536a4a7cd5
server_dispatch_message_real (systemd-journald + 0x10cd5)
[331290.304504] systemd-coredump[1267441]: #6  0x000056536a4a134b
dev_kmsg_record (systemd-journald + 0xa34b)
[331290.305195] systemd-coredump[1267441]: #7  0x000056536a4a182b
server_read_dev_kmsg (systemd-journald + 0xa82b)
[331290.305922] systemd-coredump[1267441]: #8  0x00007f3e9645c140
source_dispatch (libsystemd-shared-253.so + 0x25c140)
[331290.306640] systemd-coredump[1267441]: #9  0x00007f3e9645c42d
sd_event_dispatch (libsystemd-shared-253.so + 0x25c42d)
[331290.307350] systemd-coredump[1267441]: #10 0x00007f3e9645cb48 sd_event_run
(libsystemd-shared-253.so + 0x25cb48)
[331290.308043] systemd-coredump[1267441]: #11 0x000056536a4a0568 main
(systemd-journald + 0x9568)
[331290.308652] systemd-coredump[1267441]: #12 0x00007f3e9603dace
__libc_start_call_main (libc.so.6 + 0x23ace)
[331290.309292] systemd-coredump[1267441]: #13 0x00007f3e9603db89
__libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x23b89)
[331290.310048] systemd-coredump[1267441]: #14 0x000056536a4a0835 _start
(systemd-journald + 0x9835)
[331290.310692] systemd-coredump[1267441]: ELF object binary architecture: AMD
x86-64

This go back to normal until around 20:03 (around 14 hours later):

[389370.873026] rcu: INFO: rcu_preempt self-detected stall on CPU
[389370.873481] rcu:    0-....: (21000 ticks this GP)
idle=91fc/1/0x4000000000000000 softirq=85252827/85252827 fqs=4704
[389370.874126]         (t=21002 jiffies g=167843445 q=13889 ncpus=3)
[389370.874500] CPU: 0 PID: 2202919 Comm: .postgres-wrapp Not tainted 6.1.31
#1-NixOS
[389370.874979] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[389370.875710] RIP: 0010:xas_descend+0x26/0x70
[389370.875991] Code: cc cc cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0
48 83 c0 04 48 8b 44 c6 08 48 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 <75> 08 48
3d fd 00 00 00 76 08 88 57 12 c3 cc cc cc cc 48 c1 e8 02
[389370.877164] RSP: 0018:ffffb427c4917bf0 EFLAGS: 00000246
[389370.877512] RAX: ffff98871f8dbdaa RBX: ffffb427c4917d70 RCX:
0000000000000002
[389370.877967] RDX: 0000000000000005 RSI: ffff988876d3c000 RDI:
ffffb427c4917c00
[389370.878426] RBP: 000000000000f177 R08: ffffb427c4917e68 R09:
ffff988846485d38
[389370.878880] R10: ffffb427c4917e60 R11: ffff988846485d38 R12:
000000000000f177
[389370.879341] R13: ffff988827b4ae00 R14: 000000000000f176 R15:
ffffb427c4917e90
[389370.879798] FS:  00007ff8de817800(0000) GS:ffff98887ac00000(0000)
knlGS:0000000000000000
[389370.880308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[389370.880683] CR2: 00007ff881c8c000 CR3: 000000010dfea000 CR4:
00000000000006f0
[389370.881139] Call Trace:
[389370.881309]  <IRQ>
[389370.881463]  ? rcu_dump_cpu_stacks+0xc8/0x100
[389370.881755]  ? rcu_sched_clock_irq.cold+0x15b/0x2fb
[389370.882077]  ? sched_slice+0x87/0x140
[389370.882325]  ? perf_event_task_tick+0x64/0x370
[389370.882629]  ? __cgroup_account_cputime_field+0x5b/0xa0
[389370.882968]  ? update_process_times+0x77/0xb0
[389370.883258]  ? tick_sched_handle+0x34/0x50
[389370.883538]  ? tick_sched_timer+0x6f/0x80
[389370.883802]  ? tick_sched_do_timer+0xa0/0xa0
[389370.884084]  ? __hrtimer_run_queues+0x112/0x2b0
[389370.884388]  ? hrtimer_interrupt+0xfe/0x220
[389370.884666]  ? __sysvec_apic_timer_interrupt+0x7f/0x170
[389370.885006]  ? sysvec_apic_timer_interrupt+0x99/0xc0
[389370.885338]  </IRQ>
[389370.885488]  <TASK>
[389370.885636]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[389370.885983]  ? xas_descend+0x26/0x70
[389370.886224]  xas_load+0x30/0x40
[389370.886456]  filemap_get_read_batch+0x16e/0x250
[389370.886758]  filemap_get_pages+0xa9/0x630
[389370.887026]  ? atime_needs_update+0x104/0x180
[389370.887317]  ? touch_atime+0x46/0x1f0
[389370.887573]  filemap_read+0xd2/0x340
[389370.887817]  xfs_file_buffered_read+0x4f/0xd0 [xfs]
[389370.888210]  xfs_file_read_iter+0x6a/0xd0 [xfs]
[389370.888571]  vfs_read+0x23c/0x310
[389370.888797]  __x64_sys_pread64+0x94/0xc0
[389370.889058]  do_syscall_64+0x3a/0x90
[389370.889299]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[389370.889634] RIP: 0033:0x7ff8de90e747
[389370.889900] Code: 48 e8 9d dc f2 ff 41 b8 02 00 00 00 e9 38 f6 ff ff 66 90
f3 0f 1e fa 80 3d bd ac 0e 00 00 49 89 ca 74 10 b8 11 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 59 c3 48 83 ec 28 48 89 54 24 10 48 89 74 24
[389370.891063] RSP: 002b:00007fff5206a848 EFLAGS: 00000202 ORIG_RAX:
0000000000000011
[389370.891549] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007ff8de90e747
[389370.892003] RDX: 0000000000002000 RSI: 00007ff881c8ce80 RDI:
00000000000001c7
[389370.892466] RBP: 0000000000000001 R08: 000000000a00000d R09:
0000000000000000
[389370.892923] R10: 000000000f176000 R11: 0000000000000202 R12:
00007ff8cdd91278
[389370.893386] R13: 0000562bb434af75 R14: 0000562bb434f510 R15:
0000562bb5e55828
[389370.893843]  </TASK>

At this point we have to restart the VM externally because it didn't respond
properly any longer.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 32+ messages in thread