linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BUG: scheduling while atomic
@ 2018-08-05 14:09 James Courtier-Dutton
  2018-08-06  0:48 ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: James Courtier-Dutton @ 2018-08-05 14:09 UTC (permalink / raw)
  To: linux-btrfs

I am seeing a server halt and require a manual restart that I think
might be related to btrfs.
I attach the kernel log from it, in the hope that someone will
understand it better than me.
Any clues?

https://paste.fedoraproject.org/paste/xSblK1RKANiwhKHQj31Cdw


Kind Regards

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: scheduling while atomic
  2018-08-05 14:09 BUG: scheduling while atomic James Courtier-Dutton
@ 2018-08-06  0:48 ` Qu Wenruo
       [not found]   ` <CAAMvbhGJdGobexGS0QEGDexNeZpZkJY3o_s9N4izZeEJ7-HWtA@mail.gmail.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2018-08-06  0:48 UTC (permalink / raw)
  To: James Courtier-Dutton, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 862 bytes --]



On 2018年08月05日 22:09, James Courtier-Dutton wrote:
> I am seeing a server halt and require a manual restart that I think
> might be related to btrfs.
> I attach the kernel log from it, in the hope that someone will
> understand it better than me.
> Any clues?
> 
> https://paste.fedoraproject.org/paste/xSblK1RKANiwhKHQj31Cdw

When checking the dmesg, we should check for the first backtrace other
than later backtraces, which are normally caused by the first problem.

In your case, it's general protection happens at timerqueue_add(), which
is unrelated to btrfs at all.

Thanks,
Qu
> 
> 
> Kind Regards
> 
> James
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: scheduling while atomic
       [not found]   ` <CAAMvbhGJdGobexGS0QEGDexNeZpZkJY3o_s9N4izZeEJ7-HWtA@mail.gmail.com>
@ 2018-08-06  6:26     ` Qu Wenruo
  2018-08-10 22:14       ` James Courtier-Dutton
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2018-08-06  6:26 UTC (permalink / raw)
  To: James Courtier-Dutton; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6023 bytes --]



On 2018年08月06日 14:07, James Courtier-Dutton wrote:
> 
> 
> On Mon, 6 Aug 2018, 01:48 Qu Wenruo, <quwenruo.btrfs@gmx.com
> <mailto:quwenruo.btrfs@gmx.com>> wrote:
> 
> 
> 
>     On 2018年08月05日 22:09, James Courtier-Dutton wrote:
>     > I am seeing a server halt and require a manual restart that I think
>     > might be related to btrfs.
>     > I attach the kernel log from it, in the hope that someone will
>     > understand it better than me.
>     > Any clues?
>     >
>     > https://paste.fedoraproject.org/paste/xSblK1RKANiwhKHQj31Cdw
>     <https://paste.fedoraproject.org/paste/xSblK1RKANiwhKHQj31Cdw>
> 
>     When checking the dmesg, we should check for the first backtrace other
>     than later backtraces, which are normally caused by the first problem.
> 
>     In your case, it's general protection happens at timerqueue_add(), which
>     is unrelated to btrfs at all.
> 
> 
> I forgot to mention. This log is via netconsole as nothing is seen in
> the local logs. As such it seems to jumble up the logs a bit.
> The first bad bit happens here:

Sorry, my fault.

Then it makes sense why a lot of messages looks interrupted.

> 
> WARNING: CPU: 3 PID: 803 at
> /build/linux-hwe-SYRsgd/linux-hwe-4.15.0/fs/btrfs/extent_map.c:77
> free_extent_map+0x78/0x90 [btrfs]

Then it makes sense, as it's a WARN_ON() line, showing one extent map is
still used.

If it get freed, it will definitely cause some rbtree corruption.

> 
> But the associated stack trace is not the immediately one below it. The
> associated stack trace is:
> Jul 30 03:48:20 mum [453835.199561] RIP: 0010:frde_extent_map+0x78/0x90
> [btrfs]
> Jul 30 03:48:20 mum [453835.199564] RSP: 0000:ffffb548417cb5d0 EFLAGS:
> 00010296
> Jul 30 03:48:20 mum [453835.199568] RAX: 0000000000000001 RBX:
> fffffffffffffffe RCX: ffff960c0aa20000
> Jul 30 03:48:20 mum [453835.199570] RDX: fffffffffffff000 RSI:
> 00000912e6968000 RDI: ffff960c0aa20000
> Jul 30 03:48:20 mum [453835.199572] RBP: ffffb548417cb5d8 R08:
> 0000000000000000 R09: 0000000000000000
> Jul 30 03:48:20 mum [453835.199572] R10: ffff960b584a2690 R11:
> 0000000000000000 R12: 0000000000000000
> Jul 30 03:48:20 mum [453835.199575] R13: 0000000000000000 R14:
> 00000912e6968000 R15: 0000000000001000
> Jul 30 03:48:20 mum [453835.199578] FS:  00007f1b4b92f700(0000)
> GS:ffff960c1fd80000(0000) knlGS:0000000000000000
> Jul 30 03:48:20 mum [453835.199579] CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Jul 30 03:48:20 mum [453835.199582] CR2: 00007f1b4c96489f CR3:
> 0000000213d42000 CR4: 00000000000006e0
> Jul 30 03:48:20 mum [453835.199584] Call Trace:
> Jul 30 03:48:20 mum [453835.199607]  __do_readpage+0x2bb/0x850 [btrfs]

It's should be the only free_extent_map() call in __do_readpage() function.
However a first glance into this function shows nothing wrong, nor new
modification in this function.
(Maybe it's the get_extent() hook?)

Is there any reproducer? Or special workload?

And, have you tried "btrfs check --readonly <device>"? If there is any
error it would help a lot to locate the problem.

Thanks,
Qu

> Jul 30 03:48:20 mum [453835.199625]  ? free_root_pointers+0x70/0x70 [btrfs]
> Jul 30 03:48:20 mum [453835.199645]  __extent_read_full_page+0xe1/0xf0
> [btrfs]
> Jul 30 03:48:20 mum [453835.199663]  ? free_root_pointers+0x70/0x70 [btrfs]
> Jul 30 03:48:20 mum [453835.199683] 
> read_extent_buffer_pages+0x291/0x2c0 [btrfs]
> Jul 30 03:48:20 mum [453835.199701]  ? free_root_pointers+0x70/0x70 [btrfs]
> Jul 30 03:48:20 mum [453835.199718] 
> btree_read_extent_buffer_pages+0x9d/0x100 [btrfs]
> Jul 30 03:48:20 mum [453835.199736]  read_tree_block+0x32/0x50 [btrfs]
> Jul 30 03:48:20 mum [453835.199752] 
> read_block_for_search.isra.31+0x135/0x2f0 [btrfs]
> Jul 30 03:48:20 mum [453835.199767]  btrfs_search_slot+0x3a1/0x9d0 [btrfs]
> Jul 30 03:48:20 mum [453835.199785]  btrfs_lookup_csum+0x52/0x150 [btrfs]
> Jul 30 03:48:20 mum [453835.199802]  __btrfs_lookup_bio_sums+0x315/0x6b0
> [btrfs]
> Jul 30 03:48:20 mum [453835.199820]  ? btrfs_bio_wq_end_io+0x28/0x70 [btrfs]
> Jul 30 03:48:20 mum [453835.199837]  btrfs_lookup_bio_sums+0x16/0x20 [btrfs]
> Jul 30 03:48:20 mum [453835.199856]  btrfs_submit_bio_hook+0xc3/0x180
> [btrfs]
> Jul 30 03:48:20 mum [453835.199877]  submit_one_bio+0x64/0x90 [btrfs]
> Jul 30 03:48:20 mum [453835.199897]  extent_readpages+0x181/0x280 [btrfs]
> Jul 30 03:48:20 mum [453835.199945]  __do_page_cache_readahead+0x1be/0x2c0
> Jul 30 03:48:20 mum [453835.199949]  filemap_fault+0x53e/0x750
> Jul 30 03:48:20 mum [453835.199952]  ? filemap_fault+0x53e/0x750
> Jul 30 03:48:20 mum [453835.199953]  ? filemap_map_pages+0x185/0x3f0
> Jul 30 03:48:20 mum [453835.199957]  __do_fault+0x24/0xf0
> Jul 30 03:48:20 mum [453835.199960]  __handle_mm_fault+0xd1b/0x11f0
> Jul 30 03:48:20 mum [453835.199963]  ? __check_object_size+0xfc/0x1a0
> Jul 30 03:48:20 mum [453835.199966]  handle_mm_fault+0xcc/0x1c0
> Jul 30 03:48:20 mum [453835.199971]  __do_page_fault+0x260/0x500
> Jul 30 03:48:20 mum [453835.199974]  do_page_fault+0x2e/0xf0
> Jul 30 03:48:20 mum [453835.199978]  ? page_fault+0x2f/0x50
> Jul 30 03:48:20 mum [453835.199981]  page_fatlt+0x45/0x50
> Jul 30 03:48:20 mum [453835.199984] RIP: 0033:0x7f1b4c96489f
> Jul 30 03:48:20 mum [453835.199986] RSP: 002b:00007f1b4b90e590 EFLAGS:
> 00010203
> Jul 30 03:48:20 mum [453835.199988] RAX: 00000000000000cf RBX:
> 0000000000000000 RCX: 00007f1b4df8151d
> Jul 30 03:48:20 mum [453835.199989] RDX: 00000000000000cf RSI:
> 00007f1b4b90eda0 RDI: 0000000000000000
> Jul 30 03:48:20 mum [453835.199991] RBP: 000000000096c2c0 R08:
> 0000000000000000 R09: 0000000004000001
> Jul 30 03:48:20 mum [453835.199994] R10: 0000000000000001 R11:
> 0000000000000000 R12: 00007f1b4b90eda0
> Jul 30 03:48:20 mum [453835.199997] R13: 0000000000001fa0 R14:
> 0000000000001f9f R15: 00007f1b4b90ee03
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: scheduling while atomic
  2018-08-06  6:26     ` Qu Wenruo
@ 2018-08-10 22:14       ` James Courtier-Dutton
  2018-08-10 23:26         ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: James Courtier-Dutton @ 2018-08-10 22:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On 6 August 2018 at 07:26, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>> WARNING: CPU: 3 PID: 803 at
>> /build/linux-hwe-SYRsgd/linux-hwe-4.15.0/fs/btrfs/extent_map.c:77
>> free_extent_map+0x78/0x90 [btrfs]
>
> Then it makes sense, as it's a WARN_ON() line, showing one extent map is
> still used.
>
> If it get freed, it will definitely cause some rbtree corruption.
>
>
> It's should be the only free_extent_map() call in __do_readpage() function.
> However a first glance into this function shows nothing wrong, nor new
> modification in this function.
> (Maybe it's the get_extent() hook?)
>
> Is there any reproducer? Or special workload?
The workload is fairly simple.
1) The server is receiving 1Gbyte files from across the network, in 10
minute intervals, and storing them on the HDD.
2) A process reads the files, scanning them for certain patterns.
3) A cron job deletes the old files.



>
> And, have you tried "btrfs check --readonly <device>"? If there is any
> error it would help a lot to locate the problem.
>
root@sandisk:~# btrfs check --readonly /dev/sda3
Checking filesystem on /dev/sda3
UUID: 8c9063b9-a4bb-48d1-92ba-6adf49af6fb5
checking extents
checking free space cache
block group 6874953940992 has wrong amount of free space
failed to load free space cache for block group 6874953940992
checking fs roots
checking csums
checking root refs
found 1488143566396 bytes used err is 0
total csum bytes: 1448276084
total tree bytes: 5079711744
total fs tree bytes: 3280207872
total extent tree bytes: 146997248
btree space waste bytes: 1100557047
file data blocks allocated: 2266345996288
 referenced 2235075653632
root@sandisk:~#


So, not much to see there.
Any more ideas?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: scheduling while atomic
  2018-08-10 22:14       ` James Courtier-Dutton
@ 2018-08-10 23:26         ` Qu Wenruo
  0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2018-08-10 23:26 UTC (permalink / raw)
  To: James Courtier-Dutton; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2192 bytes --]



On 8/11/18 6:14 AM, James Courtier-Dutton wrote:
> On 6 August 2018 at 07:26, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>> WARNING: CPU: 3 PID: 803 at
>>> /build/linux-hwe-SYRsgd/linux-hwe-4.15.0/fs/btrfs/extent_map.c:77
>>> free_extent_map+0x78/0x90 [btrfs]
>>
>> Then it makes sense, as it's a WARN_ON() line, showing one extent map is
>> still used.
>>
>> If it get freed, it will definitely cause some rbtree corruption.
>>
>>
>> It's should be the only free_extent_map() call in __do_readpage() function.
>> However a first glance into this function shows nothing wrong, nor new
>> modification in this function.
>> (Maybe it's the get_extent() hook?)
>>
>> Is there any reproducer? Or special workload?
> The workload is fairly simple.
> 1) The server is receiving 1Gbyte files from across the network, in 10
> minute intervals, and storing them on the HDD.
> 2) A process reads the files, scanning them for certain patterns.
> 3) A cron job deletes the old files.

This looks pretty normal.
Shouldn't cause any problem.

> 
> 
> 
>>
>> And, have you tried "btrfs check --readonly <device>"? If there is any
>> error it would help a lot to locate the problem.
>>
> root@sandisk:~# btrfs check --readonly /dev/sda3
> Checking filesystem on /dev/sda3
> UUID: 8c9063b9-a4bb-48d1-92ba-6adf49af6fb5
> checking extents
> checking free space cache
> block group 6874953940992 has wrong amount of free space
> failed to load free space cache for block group 6874953940992

This problem can be solved by "btrfs check --clear-space-cache=v1".
Normally it shouldn't cause much problem.

> checking fs roots
> checking csums
> checking root refs
> found 1488143566396 bytes used err is 0
> total csum bytes: 1448276084
> total tree bytes: 5079711744
> total fs tree bytes: 3280207872
> total extent tree bytes: 146997248
> btree space waste bytes: 1100557047
> file data blocks allocated: 2266345996288
>  referenced 2235075653632
> root@sandisk:~#
> 
> 
> So, not much to see there.
> Any more ideas?

Not really.

If it's still reproducible, maybe it's worth trying the latest upstream
kernel.

Thanks,
Qu

> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-08-11  1:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-05 14:09 BUG: scheduling while atomic James Courtier-Dutton
2018-08-06  0:48 ` Qu Wenruo
     [not found]   ` <CAAMvbhGJdGobexGS0QEGDexNeZpZkJY3o_s9N4izZeEJ7-HWtA@mail.gmail.com>
2018-08-06  6:26     ` Qu Wenruo
2018-08-10 22:14       ` James Courtier-Dutton
2018-08-10 23:26         ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).