All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs send hangs after partial transfer and blocks all IO
@ 2018-09-13  8:34 Jürgen Herrmann
  2018-09-13  8:40 ` Nikolay Borisov
  0 siblings, 1 reply; 16+ messages in thread
From: Jürgen Herrmann @ 2018-09-13  8:34 UTC (permalink / raw)
  To: linux-btrfs

Hello!

I have a newly installed laptop running a freshly installed (abt. two 
months ago) laptop running latest linux mint 19. Root filesystem is on a 
1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G 
partition. Timeshift-btrfs is enabled for root (@) and home (@home) 
subvolumes. I want to transfer snapshots to a server with a separated 
disk via "btrfs send" and ssh.

Here's the list of snapshot directories, each containing tow snapshots 
for root and home:

drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01

"btrfs send 
/mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@ 
 >/dev/null" results in the btrfs task taking 100% cpu time on one cpu and then all IO is blocked -> only reboot can solve the hang.

The crash does not happen immediately, as i was on the road using 
cellular connection it seemed fine at first. That's how I found out that 
it transfers ~140MB of data before hanging. The snapshot is created on 
the server and contains data (du shows abt 140MB).

I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs 
4.17.1 compiled from source.

Here's the btrfs filesystem info:
Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
         Total devices 1 FS bytes used 342.85GiB
         devid    1 size 875.44GiB used 357.05GiB path 
/dev/mapper/sda3_crypt

A scrub shows no errors:
scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
         scrub started at Thu Sep 13 10:20:18 2018 and finished after 
00:12:19
         total bytes scrubbed: 342.78GiB with 0 errors

What can I do to help debugging this issue?

Best regards,
Jürgen
-- 
Jürgen Herrmann
https://t-5.eu
ALbertstraße 2
94327 Bogen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13  8:34 btrfs send hangs after partial transfer and blocks all IO Jürgen Herrmann
@ 2018-09-13  8:40 ` Nikolay Borisov
  2018-09-13 10:29   ` Jürgen Herrmann
  0 siblings, 1 reply; 16+ messages in thread
From: Nikolay Borisov @ 2018-09-13  8:40 UTC (permalink / raw)
  To: Jürgen Herrmann, linux-btrfs



On 13.09.2018 11:34, Jürgen Herrmann wrote:
> Hello!
> 
> I have a newly installed laptop running a freshly installed (abt. two
> months ago) laptop running latest linux mint 19. Root filesystem is on a
> 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
> partition. Timeshift-btrfs is enabled for root (@) and home (@home)
> subvolumes. I want to transfer snapshots to a server with a separated
> disk via "btrfs send" and ssh.
> 
> Here's the list of snapshot directories, each containing tow snapshots
> for root and home:
> 
> drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
> drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
> drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
> drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
> drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
> drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
> drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
> drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
> drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
> drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01
> 
> "btrfs send
> /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@
>>/dev/null" results in the btrfs task taking 100% cpu time on one cpu
> and then all IO is blocked -> only reboot can solve the hang.
> 
> The crash does not happen immediately, as i was on the road using
> cellular connection it seemed fine at first. That's how I found out that
> it transfers ~140MB of data before hanging. The snapshot is created on
> the server and contains data (du shows abt 140MB).
> 
> I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs
> 4.17.1 compiled from source.
> 
> Here's the btrfs filesystem info:
> Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
>         Total devices 1 FS bytes used 342.85GiB
>         devid    1 size 875.44GiB used 357.05GiB path
> /dev/mapper/sda3_crypt
> 
> A scrub shows no errors:
> scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
>         scrub started at Thu Sep 13 10:20:18 2018 and finished after
> 00:12:19
>         total bytes scrubbed: 342.78GiB with 0 errors
> 
> What can I do to help debugging this issue?


You should provide output of echo w > /proc/sysrq-trigger. Also
sample the stack of /proc/[pid of btrfs send]/stack to see if it is
changing.


> 
> Best regards,
> Jürgen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13  8:40 ` Nikolay Borisov
@ 2018-09-13 10:29   ` Jürgen Herrmann
  2018-09-13 10:50     ` Nikolay Borisov
  0 siblings, 1 reply; 16+ messages in thread
From: Jürgen Herrmann @ 2018-09-13 10:29 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

Am 13.9.2018 10:40, schrieb Nikolay Borisov:
> On 13.09.2018 11:34, Jürgen Herrmann wrote:
>> Hello!
>> 
>> I have a newly installed laptop running a freshly installed (abt. two
>> months ago) laptop running latest linux mint 19. Root filesystem is on 
>> a
>> 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
>> partition. Timeshift-btrfs is enabled for root (@) and home (@home)
>> subvolumes. I want to transfer snapshots to a server with a separated
>> disk via "btrfs send" and ssh.
>> 
>> Here's the list of snapshot directories, each containing tow snapshots
>> for root and home:
>> 
>> drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
>> drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
>> drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
>> drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
>> drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
>> drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
>> drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
>> drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
>> drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
>> drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01
>> 
>> "btrfs send
>> /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@
>>> /dev/null" results in the btrfs task taking 100% cpu time on one cpu
>> and then all IO is blocked -> only reboot can solve the hang.
>> 
>> The crash does not happen immediately, as i was on the road using
>> cellular connection it seemed fine at first. That's how I found out 
>> that
>> it transfers ~140MB of data before hanging. The snapshot is created on
>> the server and contains data (du shows abt 140MB).
>> 
>> I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs 
>> progs
>> 4.17.1 compiled from source.
>> 
>> Here's the btrfs filesystem info:
>> Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
>>         Total devices 1 FS bytes used 342.85GiB
>>         devid    1 size 875.44GiB used 357.05GiB path
>> /dev/mapper/sda3_crypt
>> 
>> A scrub shows no errors:
>> scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
>>         scrub started at Thu Sep 13 10:20:18 2018 and finished after
>> 00:12:19
>>         total bytes scrubbed: 342.78GiB with 0 errors
>> 
>> What can I do to help debugging this issue?
> 
> 
> You should provide output of echo w > /proc/sysrq-trigger. Also
> sample the stack of /proc/[pid of btrfs send]/stack to see if it is
> changing.
> 
> 
>> 
>> Best regards,
>> Jürgen

Hello!

dmesg output can be found here:
https://pastebin.com/g86dPGSZ

stacks can be found here:
https://pastebin.com/dCt1YgJp

Best regards,
Jürgen
-- 
Jürgen Herrmann
https://t-5.eu
ALbertstraße 2
94327 Bogen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 10:29   ` Jürgen Herrmann
@ 2018-09-13 10:50     ` Nikolay Borisov
  2018-09-13 10:56       ` Jürgen Herrmann
  0 siblings, 1 reply; 16+ messages in thread
From: Nikolay Borisov @ 2018-09-13 10:50 UTC (permalink / raw)
  To: Jürgen Herrmann; +Cc: linux-btrfs



On 13.09.2018 13:29, Jürgen Herrmann wrote:
> Am 13.9.2018 10:40, schrieb Nikolay Borisov:
>> On 13.09.2018 11:34, Jürgen Herrmann wrote:
>>> Hello!
>>>
>>> I have a newly installed laptop running a freshly installed (abt. two
>>> months ago) laptop running latest linux mint 19. Root filesystem is on a
>>> 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
>>> partition. Timeshift-btrfs is enabled for root (@) and home (@home)
>>> subvolumes. I want to transfer snapshots to a server with a separated
>>> disk via "btrfs send" and ssh.
>>>
>>> Here's the list of snapshot directories, each containing tow snapshots
>>> for root and home:
>>>
>>> drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
>>> drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
>>> drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
>>> drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
>>> drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
>>> drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
>>> drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
>>> drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
>>> drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
>>> drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01
>>>
>>> "btrfs send
>>> /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@
>>>> /dev/null" results in the btrfs task taking 100% cpu time on one cpu
>>> and then all IO is blocked -> only reboot can solve the hang.
>>>
>>> The crash does not happen immediately, as i was on the road using
>>> cellular connection it seemed fine at first. That's how I found out that
>>> it transfers ~140MB of data before hanging. The snapshot is created on
>>> the server and contains data (du shows abt 140MB).
>>>
>>> I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs
>>> 4.17.1 compiled from source.
>>>
>>> Here's the btrfs filesystem info:
>>> Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
>>>         Total devices 1 FS bytes used 342.85GiB
>>>         devid    1 size 875.44GiB used 357.05GiB path
>>> /dev/mapper/sda3_crypt
>>>
>>> A scrub shows no errors:
>>> scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
>>>         scrub started at Thu Sep 13 10:20:18 2018 and finished after
>>> 00:12:19
>>>         total bytes scrubbed: 342.78GiB with 0 errors
>>>
>>> What can I do to help debugging this issue?
>>
>>
>> You should provide output of echo w > /proc/sysrq-trigger. Also
>> sample the stack of /proc/[pid of btrfs send]/stack to see if it is
>> changing.
>>
>>
>>>
>>> Best regards,
>>> Jürgen
> 
> Hello!
> 
> dmesg output can be found here:
> https://pastebin.com/g86dPGSZ

So from what I see current transaction commit is waiting for
root->commit_root_sem and then other threads (in this case systemd) is
waiting for transaction commit to finish.
> 
> stacks can be found here:
> https://pastebin.com/dCt1YgJp

ANd your user process seems to be making some progress as evident from
the fact that the call trace of the process is actually changing over
the course of sampling. Is it possible that it just takes time to do the
IO ?
> 
> Best regards,
> Jürgen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 10:50     ` Nikolay Borisov
@ 2018-09-13 10:56       ` Jürgen Herrmann
  2018-09-13 11:04         ` Nikolay Borisov
  0 siblings, 1 reply; 16+ messages in thread
From: Jürgen Herrmann @ 2018-09-13 10:56 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

Both loops were started before the hang because after the hang I cannot do 
that anymore. That's why there is progress in the logs at first. The hang 
continues for at least 1.5 hours. No data is transferred anymore during 
this time. I never waited longer than 1.5 hours.

Best regards,
Jürgen

Am 13. September 2018 12:50:59 schrieb Nikolay Borisov <nborisov@suse.com>:

> On 13.09.2018 13:29, Jürgen Herrmann wrote:
>> Am 13.9.2018 10:40, schrieb Nikolay Borisov:
>>> On 13.09.2018 11:34, Jürgen Herrmann wrote:
>>>> Hello!
>>>>
>>>> I have a newly installed laptop running a freshly installed (abt. two
>>>> months ago) laptop running latest linux mint 19. Root filesystem is on a
>>>> 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
>>>> partition. Timeshift-btrfs is enabled for root (@) and home (@home)
>>>> subvolumes. I want to transfer snapshots to a server with a separated
>>>> disk via "btrfs send" and ssh.
>>>>
>>>> Here's the list of snapshot directories, each containing tow snapshots
>>>> for root and home:
>>>>
>>>> drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
>>>> drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
>>>> drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
>>>> drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
>>>> drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
>>>> drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
>>>> drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
>>>> drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
>>>> drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
>>>> drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01
>>>>
>>>> "btrfs send
>>>> /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@
>>>>> /dev/null" results in the btrfs task taking 100% cpu time on one cpu
>>>> and then all IO is blocked -> only reboot can solve the hang.
>>>>
>>>> The crash does not happen immediately, as i was on the road using
>>>> cellular connection it seemed fine at first. That's how I found out that
>>>> it transfers ~140MB of data before hanging. The snapshot is created on
>>>> the server and contains data (du shows abt 140MB).
>>>>
>>>> I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs progs
>>>> 4.17.1 compiled from source.
>>>>
>>>> Here's the btrfs filesystem info:
>>>> Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
>>>>         Total devices 1 FS bytes used 342.85GiB
>>>>         devid    1 size 875.44GiB used 357.05GiB path
>>>> /dev/mapper/sda3_crypt
>>>>
>>>> A scrub shows no errors:
>>>> scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
>>>>         scrub started at Thu Sep 13 10:20:18 2018 and finished after
>>>> 00:12:19
>>>>         total bytes scrubbed: 342.78GiB with 0 errors
>>>>
>>>> What can I do to help debugging this issue?
>>>
>>>
>>> You should provide output of echo w > /proc/sysrq-trigger. Also
>>> sample the stack of /proc/[pid of btrfs send]/stack to see if it is
>>> changing.
>>>
>>>
>>>>
>>>> Best regards,
>>>> Jürgen
>>
>> Hello!
>>
>> dmesg output can be found here:
>> https://pastebin.com/g86dPGSZ
>
> So from what I see current transaction commit is waiting for
> root->commit_root_sem and then other threads (in this case systemd) is
> waiting for transaction commit to finish.
>>
>> stacks can be found here:
>> https://pastebin.com/dCt1YgJp
>
> ANd your user process seems to be making some progress as evident from
> the fact that the call trace of the process is actually changing over
> the course of sampling. Is it possible that it just takes time to do the
> IO ?
>>
>> Best regards,
>> Jürgen


Mit AquaMail Android
https://www.mobisystems.com/aqua-mail

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 10:56       ` Jürgen Herrmann
@ 2018-09-13 11:04         ` Nikolay Borisov
  2018-09-13 11:50           ` Jürgen Herrmann
  0 siblings, 1 reply; 16+ messages in thread
From: Nikolay Borisov @ 2018-09-13 11:04 UTC (permalink / raw)
  To: Jürgen Herrmann; +Cc: linux-btrfs



On 13.09.2018 13:56, Jürgen Herrmann wrote:
> Both loops were started before the hang because after the hang I cannot
> do that anymore. That's why there is progress in the logs at first. The
> hang continues for at least 1.5 hours. No data is transferred anymore
> during this time. I never waited longer than 1.5 hours.

So these logs don't provide any useful information then. The other thing
which I can advise is to setup kdump and when the kernel hangs cause a
crashdump to be taken and try to upload it somewhere alongside your
vmlinux file for further debugging.


> 
> Best regards,
> Jürgen
> 
> Am 13. September 2018 12:50:59 schrieb Nikolay Borisov <nborisov@suse.com>:
> 
>> On 13.09.2018 13:29, Jürgen Herrmann wrote:
>>> Am 13.9.2018 10:40, schrieb Nikolay Borisov:
>>>> On 13.09.2018 11:34, Jürgen Herrmann wrote:
>>>>> Hello!
>>>>>
>>>>> I have a newly installed laptop running a freshly installed (abt. two
>>>>> months ago) laptop running latest linux mint 19. Root filesystem is
>>>>> on a
>>>>> 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
>>>>> partition. Timeshift-btrfs is enabled for root (@) and home (@home)
>>>>> subvolumes. I want to transfer snapshots to a server with a separated
>>>>> disk via "btrfs send" and ssh.
>>>>>
>>>>> Here's the list of snapshot directories, each containing tow snapshots
>>>>> for root and home:
>>>>>
>>>>> drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
>>>>> drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
>>>>> drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
>>>>> drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
>>>>> drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
>>>>> drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
>>>>> drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
>>>>> drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
>>>>> drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
>>>>> drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01
>>>>>
>>>>> "btrfs send
>>>>> /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@
>>>>>> /dev/null" results in the btrfs task taking 100% cpu time on one cpu
>>>>> and then all IO is blocked -> only reboot can solve the hang.
>>>>>
>>>>> The crash does not happen immediately, as i was on the road using
>>>>> cellular connection it seemed fine at first. That's how I found out
>>>>> that
>>>>> it transfers ~140MB of data before hanging. The snapshot is created on
>>>>> the server and contains data (du shows abt 140MB).
>>>>>
>>>>> I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs
>>>>> progs
>>>>> 4.17.1 compiled from source.
>>>>>
>>>>> Here's the btrfs filesystem info:
>>>>> Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
>>>>>         Total devices 1 FS bytes used 342.85GiB
>>>>>         devid    1 size 875.44GiB used 357.05GiB path
>>>>> /dev/mapper/sda3_crypt
>>>>>
>>>>> A scrub shows no errors:
>>>>> scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
>>>>>         scrub started at Thu Sep 13 10:20:18 2018 and finished after
>>>>> 00:12:19
>>>>>         total bytes scrubbed: 342.78GiB with 0 errors
>>>>>
>>>>> What can I do to help debugging this issue?
>>>>
>>>>
>>>> You should provide output of echo w > /proc/sysrq-trigger. Also
>>>> sample the stack of /proc/[pid of btrfs send]/stack to see if it is
>>>> changing.
>>>>
>>>>
>>>>>
>>>>> Best regards,
>>>>> Jürgen
>>>
>>> Hello!
>>>
>>> dmesg output can be found here:
>>> https://pastebin.com/g86dPGSZ
>>
>> So from what I see current transaction commit is waiting for
>> root->commit_root_sem and then other threads (in this case systemd) is
>> waiting for transaction commit to finish.
>>>
>>> stacks can be found here:
>>> https://pastebin.com/dCt1YgJp
>>
>> ANd your user process seems to be making some progress as evident from
>> the fact that the call trace of the process is actually changing over
>> the course of sampling. Is it possible that it just takes time to do the
>> IO ?
>>>
>>> Best regards,
>>> Jürgen
> 
> 
> Mit AquaMail Android
> https://www.mobisystems.com/aqua-mail
> 
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 11:04         ` Nikolay Borisov
@ 2018-09-13 11:50           ` Jürgen Herrmann
  2018-09-13 12:02             ` Nikolay Borisov
  0 siblings, 1 reply; 16+ messages in thread
From: Jürgen Herrmann @ 2018-09-13 11:50 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also 
after the hang because I started the loop before the hang. The dmesg output 
should show the hanging tasks from second 346 on or so. Still not useful?

Best regards,
Jürgen

Am 13. September 2018 13:04:39 schrieb Nikolay Borisov <nborisov@suse.com>:

> On 13.09.2018 13:56, Jürgen Herrmann wrote:
>> Both loops were started before the hang because after the hang I cannot
>> do that anymore. That's why there is progress in the logs at first. The
>> hang continues for at least 1.5 hours. No data is transferred anymore
>> during this time. I never waited longer than 1.5 hours.
>
> So these logs don't provide any useful information then. The other thing
> which I can advise is to setup kdump and when the kernel hangs cause a
> crashdump to be taken and try to upload it somewhere alongside your
> vmlinux file for further debugging.
>
>
>>
>> Best regards,
>> Jürgen
>>
>> Am 13. September 2018 12:50:59 schrieb Nikolay Borisov <nborisov@suse.com>:
>>
>>> On 13.09.2018 13:29, Jürgen Herrmann wrote:
>>>> Am 13.9.2018 10:40, schrieb Nikolay Borisov:
>>>>> On 13.09.2018 11:34, Jürgen Herrmann wrote:
>>>>>> Hello!
>>>>>>
>>>>>> I have a newly installed laptop running a freshly installed (abt. two
>>>>>> months ago) laptop running latest linux mint 19. Root filesystem is
>>>>>> on a
>>>>>> 1TB Samsung 860 M.2 SSD with btrfs on top of a LUKS encrypted 900G
>>>>>> partition. Timeshift-btrfs is enabled for root (@) and home (@home)
>>>>>> subvolumes. I want to transfer snapshots to a server with a separated
>>>>>> disk via "btrfs send" and ssh.
>>>>>>
>>>>>> Here's the list of snapshot directories, each containing tow snapshots
>>>>>> for root and home:
>>>>>>
>>>>>> drwxr-xr-x 1 root root 30 Sep 12 22:08 2018-08-16_20-00-01
>>>>>> drwxr-xr-x 1 root root 30 Aug 17 14:00 2018-08-17_14-00-02
>>>>>> drwxr-xr-x 1 root root 30 Aug 23 20:00 2018-08-23_20-00-01
>>>>>> drwxr-xr-x 1 root root 30 Aug 30 20:00 2018-08-30_20-00-01
>>>>>> drwxr-xr-x 1 root root 30 Sep  6 20:00 2018-09-06_20-00-01
>>>>>> drwxr-xr-x 1 root root 30 Sep  6 22:00 2018-09-06_22-00-01
>>>>>> drwxr-xr-x 1 root root 30 Sep  8 16:00 2018-09-08_16-00-01
>>>>>> drwxr-xr-x 1 root root 30 Sep 10 20:00 2018-09-10_20-00-02
>>>>>> drwxr-xr-x 1 root root 30 Sep 11 21:00 2018-09-11_21-00-02
>>>>>> drwxr-xr-x 1 root root 30 Sep 12 21:00 2018-09-12_21-00-01
>>>>>>
>>>>>> "btrfs send
>>>>>> /mnt/timeshift/backup/timeshift-btrfs/snapshots/2018-08-16_20-00-01/@
>>>>>>> /dev/null" results in the btrfs task taking 100% cpu time on one cpu
>>>>>> and then all IO is blocked -> only reboot can solve the hang.
>>>>>>
>>>>>> The crash does not happen immediately, as i was on the road using
>>>>>> cellular connection it seemed fine at first. That's how I found out
>>>>>> that
>>>>>> it transfers ~140MB of data before hanging. The snapshot is created on
>>>>>> the server and contains data (du shows abt 140MB).
>>>>>>
>>>>>> I am running vanilla kernel 4.18.6 (compiled by myself) and btrfs
>>>>>> progs
>>>>>> 4.17.1 compiled from source.
>>>>>>
>>>>>> Here's the btrfs filesystem info:
>>>>>> Label: none  uuid: a914c141-72bf-448b-847f-d64ee82d8b7b
>>>>>>         Total devices 1 FS bytes used 342.85GiB
>>>>>>         devid    1 size 875.44GiB used 357.05GiB path
>>>>>> /dev/mapper/sda3_crypt
>>>>>>
>>>>>> A scrub shows no errors:
>>>>>> scrub status for a914c141-72bf-448b-847f-d64ee82d8b7b
>>>>>>         scrub started at Thu Sep 13 10:20:18 2018 and finished after
>>>>>> 00:12:19
>>>>>>         total bytes scrubbed: 342.78GiB with 0 errors
>>>>>>
>>>>>> What can I do to help debugging this issue?
>>>>>
>>>>>
>>>>> You should provide output of echo w > /proc/sysrq-trigger. Also
>>>>> sample the stack of /proc/[pid of btrfs send]/stack to see if it is
>>>>> changing.
>>>>>
>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Jürgen
>>>>
>>>> Hello!
>>>>
>>>> dmesg output can be found here:
>>>> https://pastebin.com/g86dPGSZ
>>>
>>> So from what I see current transaction commit is waiting for
>>> root->commit_root_sem and then other threads (in this case systemd) is
>>> waiting for transaction commit to finish.
>>>>
>>>> stacks can be found here:
>>>> https://pastebin.com/dCt1YgJp
>>>
>>> ANd your user process seems to be making some progress as evident from
>>> the fact that the call trace of the process is actually changing over
>>> the course of sampling. Is it possible that it just takes time to do the
>>> IO ?
>>>>
>>>> Best regards,
>>>> Jürgen
>>
>>
>> Mit AquaMail Android
>> https://www.mobisystems.com/aqua-mail
>>
>>
>>


Mit AquaMail Android
https://www.mobisystems.com/aqua-mail

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 11:50           ` Jürgen Herrmann
@ 2018-09-13 12:02             ` Nikolay Borisov
  2018-09-13 12:30               ` Jürgen Herrmann
  0 siblings, 1 reply; 16+ messages in thread
From: Nikolay Borisov @ 2018-09-13 12:02 UTC (permalink / raw)
  To: Jürgen Herrmann; +Cc: linux-btrfs



On 13.09.2018 14:50, Jürgen Herrmann wrote:
> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
> after the hang because I started the loop before the hang. The dmesg
> output should show the hanging tasks from second 346 on or so. Still not
> useful?
> 

So from 346 it's evident that transaction commit is waiting for
commit_root_sem to be acquired. So something else is holding it and not
giving the transaction chance to finish committing. Now the only place
where send acquires this lock is in find_extent_clone around the  call
to extent_from_logical. The latter basically does an extent tree search
and doesn't loop so can't possibly deadlock. Furthermore I don't see any
userspace processes being hung in kernel space.

Additionally looking at the userspace processes they indicate that
find_extent_clone has finished and are blocked in send_write_or_clone
which does the write. But I guess this actually happens before the hang.


So at this point without looking at the stacktrace of the btrfs send
process after the hung has occurred I don't think much can be done

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 12:02             ` Nikolay Borisov
@ 2018-09-13 12:30               ` Jürgen Herrmann
  2018-09-13 12:35                 ` Nikolay Borisov
  0 siblings, 1 reply; 16+ messages in thread
From: Jürgen Herrmann @ 2018-09-13 12:30 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

OK, I will install kdump later and perform a dump after the hang.

One more noob question beforehand: does this dump contain sensitive 
information, for example the luks encryption key for the disk etc? A Google 
search only brings up one relevant search result which can only be viewed 
with a redhat subscription...

Best regards,
Jürgen

Am 13. September 2018 14:02:11 schrieb Nikolay Borisov <nborisov@suse.com>:

> On 13.09.2018 14:50, Jürgen Herrmann wrote:
>> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
>> after the hang because I started the loop before the hang. The dmesg
>> output should show the hanging tasks from second 346 on or so. Still not
>> useful?
>>
>
> So from 346 it's evident that transaction commit is waiting for
> commit_root_sem to be acquired. So something else is holding it and not
> giving the transaction chance to finish committing. Now the only place
> where send acquires this lock is in find_extent_clone around the  call
> to extent_from_logical. The latter basically does an extent tree search
> and doesn't loop so can't possibly deadlock. Furthermore I don't see any
> userspace processes being hung in kernel space.
>
> Additionally looking at the userspace processes they indicate that
> find_extent_clone has finished and are blocked in send_write_or_clone
> which does the write. But I guess this actually happens before the hang.
>
>
> So at this point without looking at the stacktrace of the btrfs send
> process after the hung has occurred I don't think much can be done


Mit AquaMail Android
https://www.mobisystems.com/aqua-mail

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 12:30               ` Jürgen Herrmann
@ 2018-09-13 12:35                 ` Nikolay Borisov
  2018-09-13 15:30                   ` Chris Murphy
  2018-09-19 19:41                   ` Jürgen Herrmann
  0 siblings, 2 replies; 16+ messages in thread
From: Nikolay Borisov @ 2018-09-13 12:35 UTC (permalink / raw)
  To: Jürgen Herrmann; +Cc: linux-btrfs



On 13.09.2018 15:30, Jürgen Herrmann wrote:
> OK, I will install kdump later and perform a dump after the hang.
> 
> One more noob question beforehand: does this dump contain sensitive
> information, for example the luks encryption key for the disk etc? A
> Google search only brings up one relevant search result which can only
> be viewed with a redhat subscription...


So a kdump will dump the kernel memory so it's possible that the LUKS
encryption keys could be extracted from that image. Bummer, it's
understandable why you wouldn't want to upload it :). In this case you'd
have to install also the 'crash' utility to open the crashdump and
extract the calltrace of the btrfs process. The rough process should be :


crash 'path to vm linux' 'path to vmcore file', then once inside the
crash utility :

set <pid of btrfs send process>, you can acquire the pid by issuing 'ps'
which will give you a ps-like output of all running processes at the
time of crash. After the context has been set you can run 'bt' which
will give you a backtrace of the send process.



> 
> Best regards,
> Jürgen
> 
> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov <nborisov@suse.com>:
> 
>> On 13.09.2018 14:50, Jürgen Herrmann wrote:
>>> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
>>> after the hang because I started the loop before the hang. The dmesg
>>> output should show the hanging tasks from second 346 on or so. Still not
>>> useful?
>>>
>>
>> So from 346 it's evident that transaction commit is waiting for
>> commit_root_sem to be acquired. So something else is holding it and not
>> giving the transaction chance to finish committing. Now the only place
>> where send acquires this lock is in find_extent_clone around the  call
>> to extent_from_logical. The latter basically does an extent tree search
>> and doesn't loop so can't possibly deadlock. Furthermore I don't see any
>> userspace processes being hung in kernel space.
>>
>> Additionally looking at the userspace processes they indicate that
>> find_extent_clone has finished and are blocked in send_write_or_clone
>> which does the write. But I guess this actually happens before the hang.
>>
>>
>> So at this point without looking at the stacktrace of the btrfs send
>> process after the hung has occurred I don't think much can be done
> 
> 
> Mit AquaMail Android
> https://www.mobisystems.com/aqua-mail
> 
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 12:35                 ` Nikolay Borisov
@ 2018-09-13 15:30                   ` Chris Murphy
  2018-09-13 15:44                     ` Nikolay Borisov
  2018-09-19 19:41                   ` Jürgen Herrmann
  1 sibling, 1 reply; 16+ messages in thread
From: Chris Murphy @ 2018-09-13 15:30 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Jürgen Herrmann, Btrfs BTRFS

This is the 2nd or 3rd thread containing hanging btrfs send, with
kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
and the other I can't find at the moment. In that case though the hang
is reproducible in 4.14.x and weirdly it only happens when a snapshot
contains (perhaps many) reflinks. Scrub and check lowmem find nothing
wrong.

I have snapshots with a few reflinks (cp --reflink and also
deduplication), and I see maybe 15-30 second hangs where nothing is
apparently happening (in top or iotop), but I'm also not seeing any
blocked tasks or high CPU usage. Perhaps in my case it's just
recovering quickly.

Are there any kernel config options in "# Debug Lockups and Hangs"
that might hint at what's going on? Some of these are enabled in
Fedora debug kernels, which are built practically daily, e.g. right
now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
translates to git 54eda9df17f3.


Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 15:30                   ` Chris Murphy
@ 2018-09-13 15:44                     ` Nikolay Borisov
  2018-09-13 16:22                       ` Chris Murphy
  0 siblings, 1 reply; 16+ messages in thread
From: Nikolay Borisov @ 2018-09-13 15:44 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Jürgen Herrmann, Btrfs BTRFS



On 13.09.2018 18:30, Chris Murphy wrote:
> This is the 2nd or 3rd thread containing hanging btrfs send, with
> kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
> and the other I can't find at the moment. In that case though the hang
> is reproducible in 4.14.x and weirdly it only happens when a snapshot
> contains (perhaps many) reflinks. Scrub and check lowmem find nothing
> wrong.
> 
> I have snapshots with a few reflinks (cp --reflink and also
> deduplication), and I see maybe 15-30 second hangs where nothing is
> apparently happening (in top or iotop), but I'm also not seeing any
> blocked tasks or high CPU usage. Perhaps in my case it's just
> recovering quickly.
> 
> Are there any kernel config options in "# Debug Lockups and Hangs"
> that might hint at what's going on? Some of these are enabled in
> Fedora debug kernels, which are built practically daily, e.g. right
> now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
> translates to git 54eda9df17f3.

If it's a lock-related problem then you need Lock Debugging => Lock
debugging: prove locking correctness

> 
> 
> Chris Murphy
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 15:44                     ` Nikolay Borisov
@ 2018-09-13 16:22                       ` Chris Murphy
  2018-09-19 19:35                         ` Jürgen Herrmann
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Murphy @ 2018-09-13 16:22 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Jürgen Herrmann, Nikolay Borisov

(resend to all)

On Thu, Sep 13, 2018 at 9:44 AM, Nikolay Borisov <nborisov@suse.com> wrote:
>
>
> On 13.09.2018 18:30, Chris Murphy wrote:
>> This is the 2nd or 3rd thread containing hanging btrfs send, with
>> kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
>> and the other I can't find at the moment. In that case though the hang
>> is reproducible in 4.14.x and weirdly it only happens when a snapshot
>> contains (perhaps many) reflinks. Scrub and check lowmem find nothing
>> wrong.
>>
>> I have snapshots with a few reflinks (cp --reflink and also
>> deduplication), and I see maybe 15-30 second hangs where nothing is
>> apparently happening (in top or iotop), but I'm also not seeing any
>> blocked tasks or high CPU usage. Perhaps in my case it's just
>> recovering quickly.
>>
>> Are there any kernel config options in "# Debug Lockups and Hangs"
>> that might hint at what's going on? Some of these are enabled in
>> Fedora debug kernels, which are built practically daily, e.g. right
>> now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
>> translates to git 54eda9df17f3.
>
> If it's a lock-related problem then you need Lock Debugging => Lock
> debugging: prove locking correctness

OK looks like that's under a different section as CONFIG_PROVE_LOCKING
which is enabled on Fedora debug kernels.


# Debug Lockups and Hangs
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
# Lock Debugging (spinlocks, mutexes, etc...)
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
# CONFIG_DEBUG_LOCKDEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_LOCK_TORTURE_TEST=m



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 16:22                       ` Chris Murphy
@ 2018-09-19 19:35                         ` Jürgen Herrmann
  0 siblings, 0 replies; 16+ messages in thread
From: Jürgen Herrmann @ 2018-09-19 19:35 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Am 13.9.2018 18:22, schrieb Chris Murphy:
> (resend to all)
> 
> On Thu, Sep 13, 2018 at 9:44 AM, Nikolay Borisov <nborisov@suse.com> 
> wrote:
>> 
>> 
>> On 13.09.2018 18:30, Chris Murphy wrote:
>>> This is the 2nd or 3rd thread containing hanging btrfs send, with
>>> kernel 4.18.x. The subject of one is "btrfs send hung in pipe_wait"
>>> and the other I can't find at the moment. In that case though the 
>>> hang
>>> is reproducible in 4.14.x and weirdly it only happens when a snapshot
>>> contains (perhaps many) reflinks. Scrub and check lowmem find nothing
>>> wrong.
>>> 
>>> I have snapshots with a few reflinks (cp --reflink and also
>>> deduplication), and I see maybe 15-30 second hangs where nothing is
>>> apparently happening (in top or iotop), but I'm also not seeing any
>>> blocked tasks or high CPU usage. Perhaps in my case it's just
>>> recovering quickly.
>>> 
>>> Are there any kernel config options in "# Debug Lockups and Hangs"
>>> that might hint at what's going on? Some of these are enabled in
>>> Fedora debug kernels, which are built practically daily, e.g. right
>>> now the latest in the build system is 4.19.0-0.rc3.git2.1 - which
>>> translates to git 54eda9df17f3.
>> 
>> If it's a lock-related problem then you need Lock Debugging => Lock
>> debugging: prove locking correctness
> 
> OK looks like that's under a different section as CONFIG_PROVE_LOCKING
> which is enabled on Fedora debug kernels.
> 
> 
> # Debug Lockups and Hangs
> CONFIG_LOCKUP_DETECTOR=y
> CONFIG_SOFTLOCKUP_DETECTOR=y
> # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
> CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
> CONFIG_HARDLOCKUP_DETECTOR_PERF=y
> CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
> CONFIG_HARDLOCKUP_DETECTOR=y
> # CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
> CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
> # Lock Debugging (spinlocks, mutexes, etc...)
> CONFIG_LOCK_DEBUGGING_SUPPORT=y
> CONFIG_PROVE_LOCKING=y
> CONFIG_LOCK_STAT=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_LOCK_ALLOC=y
> CONFIG_LOCKDEP=y
> # CONFIG_DEBUG_LOCKDEP is not set
> # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
> CONFIG_LOCK_TORTURE_TEST=m

Hello again!

I have CONFIG_PROVE_LOCKING enabled in my kernel, but no change to
the observed behaviour.

Best regards, Jürgen


-- 
Jürgen Herrmann
https://t-5.eu
ALbertstraße 2
94327 Bogen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-13 12:35                 ` Nikolay Borisov
  2018-09-13 15:30                   ` Chris Murphy
@ 2018-09-19 19:41                   ` Jürgen Herrmann
  2018-09-20 17:25                     ` Chris Murphy
  1 sibling, 1 reply; 16+ messages in thread
From: Jürgen Herrmann @ 2018-09-19 19:41 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

Am 13.9.2018 14:35, schrieb Nikolay Borisov:
> On 13.09.2018 15:30, Jürgen Herrmann wrote:
>> OK, I will install kdump later and perform a dump after the hang.
>> 
>> One more noob question beforehand: does this dump contain sensitive
>> information, for example the luks encryption key for the disk etc? A
>> Google search only brings up one relevant search result which can only
>> be viewed with a redhat subscription...
> 
> 
> So a kdump will dump the kernel memory so it's possible that the LUKS
> encryption keys could be extracted from that image. Bummer, it's
> understandable why you wouldn't want to upload it :). In this case 
> you'd
> have to install also the 'crash' utility to open the crashdump and
> extract the calltrace of the btrfs process. The rough process should be 
> :
> 
> 
> crash 'path to vm linux' 'path to vmcore file', then once inside the
> crash utility :
> 
> set <pid of btrfs send process>, you can acquire the pid by issuing 
> 'ps'
> which will give you a ps-like output of all running processes at the
> time of crash. After the context has been set you can run 'bt' which
> will give you a backtrace of the send process.
> 
> 
> 
>> 
>> Best regards,
>> Jürgen
>> 
>> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov 
>> <nborisov@suse.com>:
>> 
>>> On 13.09.2018 14:50, Jürgen Herrmann wrote:
>>>> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work 
>>>> also
>>>> after the hang because I started the loop before the hang. The dmesg
>>>> output should show the hanging tasks from second 346 on or so. Still 
>>>> not
>>>> useful?
>>>> 
>>> 
>>> So from 346 it's evident that transaction commit is waiting for
>>> commit_root_sem to be acquired. So something else is holding it and 
>>> not
>>> giving the transaction chance to finish committing. Now the only 
>>> place
>>> where send acquires this lock is in find_extent_clone around the  
>>> call
>>> to extent_from_logical. The latter basically does an extent tree 
>>> search
>>> and doesn't loop so can't possibly deadlock. Furthermore I don't see 
>>> any
>>> userspace processes being hung in kernel space.
>>> 
>>> Additionally looking at the userspace processes they indicate that
>>> find_extent_clone has finished and are blocked in send_write_or_clone
>>> which does the write. But I guess this actually happens before the 
>>> hang.
>>> 
>>> 
>>> So at this point without looking at the stacktrace of the btrfs send
>>> process after the hung has occurred I don't think much can be done
>> 
I know this is probably not the correct list to ask this question but 
maybe someone of the devs can point me to the right list?

I cannot get kdump to work. The crashkernel is loaded and everything is 
setup for it afaict. I asked a question on this over at stackexchange 
but no answer yet.
https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing

So i did a little digging and added some debug printk() statements to 
see whats going on and it seems that panic() is never called. maybe the 
second stack trace is the reason?
Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN

Could someone please tell me where I can report this problem and get 
some help on this topic?

Best regards,
Jürgen

-- 
Jürgen Herrmann
https://t-5.eu
ALbertstraße 2
94327 Bogen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs send hangs after partial transfer and blocks all IO
  2018-09-19 19:41                   ` Jürgen Herrmann
@ 2018-09-20 17:25                     ` Chris Murphy
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Murphy @ 2018-09-20 17:25 UTC (permalink / raw)
  To: Jürgen Herrmann; +Cc: Nikolay Borisov, Btrfs BTRFS

On Wed, Sep 19, 2018 at 1:41 PM, Jürgen Herrmann <t-5@t-5.eu> wrote:
> Am 13.9.2018 14:35, schrieb Nikolay Borisov:
>>
>> On 13.09.2018 15:30, Jürgen Herrmann wrote:
>>>
>>> OK, I will install kdump later and perform a dump after the hang.
>>>
>>> One more noob question beforehand: does this dump contain sensitive
>>> information, for example the luks encryption key for the disk etc? A
>>> Google search only brings up one relevant search result which can only
>>> be viewed with a redhat subscription...
>>
>>
>>
>> So a kdump will dump the kernel memory so it's possible that the LUKS
>> encryption keys could be extracted from that image. Bummer, it's
>> understandable why you wouldn't want to upload it :). In this case you'd
>> have to install also the 'crash' utility to open the crashdump and
>> extract the calltrace of the btrfs process. The rough process should be :
>>
>>
>> crash 'path to vm linux' 'path to vmcore file', then once inside the
>> crash utility :
>>
>> set <pid of btrfs send process>, you can acquire the pid by issuing 'ps'
>> which will give you a ps-like output of all running processes at the
>> time of crash. After the context has been set you can run 'bt' which
>> will give you a backtrace of the send process.
>>
>>
>>
>>>
>>> Best regards,
>>> Jürgen
>>>
>>> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov
>>> <nborisov@suse.com>:
>>>
>>>> On 13.09.2018 14:50, Jürgen Herrmann wrote:
>>>>>
>>>>> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
>>>>> after the hang because I started the loop before the hang. The dmesg
>>>>> output should show the hanging tasks from second 346 on or so. Still
>>>>> not
>>>>> useful?
>>>>>
>>>>
>>>> So from 346 it's evident that transaction commit is waiting for
>>>> commit_root_sem to be acquired. So something else is holding it and not
>>>> giving the transaction chance to finish committing. Now the only place
>>>> where send acquires this lock is in find_extent_clone around the  call
>>>> to extent_from_logical. The latter basically does an extent tree search
>>>> and doesn't loop so can't possibly deadlock. Furthermore I don't see any
>>>> userspace processes being hung in kernel space.
>>>>
>>>> Additionally looking at the userspace processes they indicate that
>>>> find_extent_clone has finished and are blocked in send_write_or_clone
>>>> which does the write. But I guess this actually happens before the hang.
>>>>
>>>>
>>>> So at this point without looking at the stacktrace of the btrfs send
>>>> process after the hung has occurred I don't think much can be done
>>>
>>>
> I know this is probably not the correct list to ask this question but maybe
> someone of the devs can point me to the right list?
>
> I cannot get kdump to work. The crashkernel is loaded and everything is
> setup for it afaict. I asked a question on this over at stackexchange but no
> answer yet.
> https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing
>
> So i did a little digging and added some debug printk() statements to see
> whats going on and it seems that panic() is never called. maybe the second
> stack trace is the reason?
> Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN
>
> Could someone please tell me where I can report this problem and get some
> help on this topic?


Try kexec mailing list. They handle kdump.

http://lists.infradead.org/mailman/listinfo/kexec



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-09-20 23:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-13  8:34 btrfs send hangs after partial transfer and blocks all IO Jürgen Herrmann
2018-09-13  8:40 ` Nikolay Borisov
2018-09-13 10:29   ` Jürgen Herrmann
2018-09-13 10:50     ` Nikolay Borisov
2018-09-13 10:56       ` Jürgen Herrmann
2018-09-13 11:04         ` Nikolay Borisov
2018-09-13 11:50           ` Jürgen Herrmann
2018-09-13 12:02             ` Nikolay Borisov
2018-09-13 12:30               ` Jürgen Herrmann
2018-09-13 12:35                 ` Nikolay Borisov
2018-09-13 15:30                   ` Chris Murphy
2018-09-13 15:44                     ` Nikolay Borisov
2018-09-13 16:22                       ` Chris Murphy
2018-09-19 19:35                         ` Jürgen Herrmann
2018-09-19 19:41                   ` Jürgen Herrmann
2018-09-20 17:25                     ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.