linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG - btrfs] kernel oops in extent_range_uptodate
@ 2012-01-19 14:42 Vincent Vanackere
  2012-01-19 16:24 ` Mitch Harder
  0 siblings, 1 reply; 7+ messages in thread
From: Vincent Vanackere @ 2012-01-19 14:42 UTC (permalink / raw)
  To: linux-btrfs, Linux kernel mailing list; +Cc: Vincent Vanackere

Hi,

With the most current git kernel 
(90a4c0f51e8e44111a926be6f4c87af3938a79c3) I'm still getting the same 
reproducible kernel panic when trying to read a particular file stored 
on a btrfs filesystem (as seen in the log there are indeed disk media 
errors on this disk).
I'd like the "software" part of this to be fixed - btrfs should 
definitely not oops even in case of media error - before sending the 
disk to RMA. Is there anything I can do to make progress on this ?

Regards,

Vincent

--------------------------------
ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata6.00: BMDMA stat 0x24
ata6.00: failed command: READ DMA EXT
ata6.00: cmd 25/00:08:5f:dc:2f/00:00:70:00:00/e0 tag 0 dma 4096 in
          res 51/40:00:61:dc:2f/40:00:70:00:00/e0 Emask 0x9 (media error)
ata6.00: status: { DRDY ERR }
ata6.00: error: { UNC }
ata6.00: configured for UDMA/133
sd 5:0:0:0: [sdd] Unhandled sense code
sd 5:0:0:0: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 5:0:0:0: [sdd]  Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
         70 2f dc 61
sd 5:0:0:0: [sdd]  Add. Sense: Unrecovered read error - auto reallocate 
failed
sd 5:0:0:0: [sdd] CDB: Read(10): 28 00 70 2f dc 5f 00 00 08 00
end_request: I/O error, dev sdd, sector 1882184801
ata6: EH complete
BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffffa0191b09>] extent_range_uptodate+0x59/0xe0 [btrfs]
PGD 221bf8067 PUD 222864067 PMD 0
Oops: 0000 [#1] SMP
CPU 1
Modules linked in: ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat 
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT 
xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables 
bridge stp kvm_intel kvm parport_pc ppdev dm_crypt nfsd nfs lockd 
fscache auth_rpcgss nfs_acl binfmt_misc sunrpc snd_usb_audio joydev 
snd_usbmidi_lib snd_hda_codec_realtek snd_hda_intel snd_hda_codec 
snd_hwdep snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq 
snd_timer psmouse snd_seq_device serio_raw snd soundcore snd_page_alloc 
lp parport btrfs zlib_deflate libcrc32c hid_logitech ff_memless usbhid 
hid i915 r8169 drm_kms_helper drm pata_jmicron i2c_algo_bit video

Pid: 1003, comm: btrfs-endio-met Not tainted 3.2.0-custom-9429-g90a4c0f 
#3 Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R
RIP: 0010:[<ffffffffa0191b09>]  [<ffffffffa0191b09>] 
extent_range_uptodate+0x59/0xe0 [btrfs]
RSP: 0018:ffff88022191dde0  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 000000df57385000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 000000000df57385 RDI: 0000000000000000
RBP: ffff88022191de00 R08: 0000000000000000 R09: ffff8801da949ae0
R10: ffff8801fda37010 R11: 0000000000001000 R12: ffff88021b4487f0
R13: 000000df573853ff R14: ffff88022191de98 R15: ffff880221ac2ae8
FS:  0000000000000000(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000221bf9000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process btrfs-endio-met (pid: 1003, threadinfo ffff88022191c000, task 
ffff880221b9db80)
Stack:
  0000000000000000 ffff8801fdb64eb8 ffff8802221be840 ffff880220b3e000
  ffff88022191de30 ffffffffa016ad89 ffff880221ac2ae0 ffff8801fdb64ee0
  ffff880221ac2ae0 ffff880221ac2af8 ffff88022191dee0 ffffffffa019c18f
Call Trace:
  [<ffffffffa016ad89>] end_workqueue_fn+0x119/0x140 [btrfs]
  [<ffffffffa019c18f>] worker_loop+0x16f/0x5d0 [btrfs]
  [<ffffffffa019c020>] ? btrfs_queue_worker+0x310/0x310 [btrfs]
  [<ffffffff81070193>] kthread+0x93/0xa0
  [<ffffffff81636f24>] kernel_thread_helper+0x4/0x10
  [<ffffffff81070100>] ? kthread_freezable_should_stop+0x70/0x70
  [<ffffffff81636f20>] ? gs_change+0x13/0x13
Code: 01 f0 48 09 f0 a9 ff 0f 00 00 75 4e 49 39 dd b8 01 00 00 00 72 36 
0f 1f 40 00 49 8b 7c 24 18 48 89 de 48 c1 ee 0c e8 e7 36 f8 e0 <48> 8b 
10 83 e2 08 74 5f 48 89 c7 48 81 c3 00 10 00 00 e8 40 00
RIP  [<ffffffffa0191b09>] extent_range_uptodate+0x59/0xe0 [btrfs]
  RSP <ffff88022191dde0>
CR2: 0000000000000000
---[ end trace 4c48da444d2270f0 ]---



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG - btrfs] kernel oops in extent_range_uptodate
  2012-01-19 14:42 [BUG - btrfs] kernel oops in extent_range_uptodate Vincent Vanackere
@ 2012-01-19 16:24 ` Mitch Harder
  2012-01-20 16:48   ` Vincent Vanackere
  0 siblings, 1 reply; 7+ messages in thread
From: Mitch Harder @ 2012-01-19 16:24 UTC (permalink / raw)
  To: Vincent Vanackere; +Cc: linux-btrfs, Linux kernel mailing list

On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere
<vincent.vanackere@gmail.com> wrote:
> Hi,
>
> With the most current git kernel (90a4c0f51e8e44111a926be6f4c87af3938a79c3)
> I'm still getting the same reproducible kernel panic when trying to read a
> particular file stored on a btrfs filesystem (as seen in the log there are
> indeed disk media errors on this disk).
> I'd like the "software" part of this to be fixed - btrfs should definitely
> not oops even in case of media error - before sending the disk to RMA. Is
> there anything I can do to make progress on this ?
>

Is this kernel compiled with "Compile the kernel with debug info" (in
the "Kernel hacking  --->" configuration section)?

It would be nice to have the specific line of code passing the NULL pointer.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG - btrfs] kernel oops in extent_range_uptodate
  2012-01-19 16:24 ` Mitch Harder
@ 2012-01-20 16:48   ` Vincent Vanackere
  2012-01-20 20:54     ` Mitch Harder
  0 siblings, 1 reply; 7+ messages in thread
From: Vincent Vanackere @ 2012-01-20 16:48 UTC (permalink / raw)
  To: Mitch Harder; +Cc: linux-btrfs, Linux kernel mailing list

On 01/19/2012 05:24 PM, Mitch Harder wrote:
> On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere
> <vincent.vanackere@gmail.com>  wrote:
>> Hi,
>>
>> With the most current git kernel (90a4c0f51e8e44111a926be6f4c87af3938a79c3)
>> I'm still getting the same reproducible kernel panic when trying to read a
>> particular file stored on a btrfs filesystem (as seen in the log there are
>> indeed disk media errors on this disk).
>> I'd like the "software" part of this to be fixed - btrfs should definitely
>> not oops even in case of media error - before sending the disk to RMA. Is
>> there anything I can do to make progress on this ?
>>
> Is this kernel compiled with "Compile the kernel with debug info" (in
> the "Kernel hacking  --->" configuration section)?
>
> It would be nice to have the specific line of code passing the NULL pointer.

The kernel was compiled with debug information but modern linux 
distribution make it really hard to keep your debug information it seems :-(
I even had to compile btrfs builtin to keep the line numbers... Anyway, 
thanks to kexec / kdump I finally managed to get this, hope it helps :

crash> bt -l
PID: 939    TASK: ffff880218a4adc0  CPU: 0   COMMAND: "btrfs-endio-met"
  #0 [ffff88022316b9e0] machine_kexec at ffffffff810366aa
     /usr/src/linux/arch/x86/kernel/machine_kexec_64.c: 339
  #1 [ffff88022316ba50] crash_kexec at ffffffff810b2df8
     /usr/src/linux/kernel/kexec.c: 1101
  #2 [ffff88022316bb20] oops_end at ffffffff816afdd8
     /usr/src/linux/arch/x86/kernel/dumpstack.c: 228
  #3 [ffff88022316bb50] no_context at ffffffff816a3141
     /usr/src/linux/arch/x86/mm/fault.c: 690
  #4 [ffff88022316bbb0] __bad_area_nosemaphore at ffffffff816a3321
     /usr/src/linux/arch/x86/mm/fault.c: 767
  #5 [ffff88022316bc10] bad_area_nosemaphore at ffffffff816a3353
     /usr/src/linux/arch/x86/mm/fault.c: 775
  #6 [ffff88022316bc20] do_page_fault at ffffffff816b29b6
     /usr/src/linux/arch/x86/mm/fault.c: 1122
  #7 [ffff88022316bd30] page_fault at ffffffff816af235
     /usr/src/linux/arch/x86_64/kernel/entry.S
     [exception RIP: extent_range_uptodate+89]
     RIP: ffffffff812c7239  RSP: ffff88022316bde0  RFLAGS: 00010246
     RAX: 0000000000000000  RBX: 000000df57385000  RCX: 0000000000000000
     RDX: 0000000000000001  RSI: 000000000df57385  RDI: 0000000000000000
     RBP: ffff88022316be00   R8: 0000000000000000   R9: ffff88021f2823c0
     R10: ffff88022034d010  R11: 0000000000001000  R12: ffff880222908410
     R13: 000000df573853ff  R14: ffff88022316be98  R15: ffff88021a3e72a8
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #8 [ffff88022316be08] end_workqueue_fn at ffffffff812a04b9
     /usr/src/linux/fs/btrfs/disk-io.c: 1564
  #9 [ffff88022316be38] worker_loop at ffffffff812d18bf
     /usr/src/linux/arch/x86/include/asm/atomic.h: 107
#10 [ffff88022316bee8] kthread at ffffffff81070193
     /usr/src/linux/kernel/kthread.c: 121
#11 [ffff88022316bf48] kernel_thread_helper at ffffffff816b8124
     /usr/src/linux/arch/x86/kernel/entry_64.S: 1163


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG - btrfs] kernel oops in extent_range_uptodate
  2012-01-20 16:48   ` Vincent Vanackere
@ 2012-01-20 20:54     ` Mitch Harder
  2012-01-24 16:24       ` Vincent Vanackere
  0 siblings, 1 reply; 7+ messages in thread
From: Mitch Harder @ 2012-01-20 20:54 UTC (permalink / raw)
  To: Vincent Vanackere; +Cc: linux-btrfs, Linux kernel mailing list

On Fri, Jan 20, 2012 at 10:48 AM, Vincent Vanackere
<vincent.vanackere@gmail.com> wrote:
> On 01/19/2012 05:24 PM, Mitch Harder wrote:
>>
>> On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere
>> <vincent.vanackere@gmail.com>  wrote:
>>>
>>> Hi,
>>>
>>> With the most current git kernel
>>> (90a4c0f51e8e44111a926be6f4c87af3938a79c3)
>>> I'm still getting the same reproducible kernel panic when trying to read
>>> a
>>> particular file stored on a btrfs filesystem (as seen in the log there
>>> are
>>> indeed disk media errors on this disk).
>>> I'd like the "software" part of this to be fixed - btrfs should
>>> definitely
>>> not oops even in case of media error - before sending the disk to RMA. Is
>>> there anything I can do to make progress on this ?
>>>
>> Is this kernel compiled with "Compile the kernel with debug info" (in
>> the "Kernel hacking  --->" configuration section)?
>>
>> It would be nice to have the specific line of code passing the NULL
>> pointer.
>
>
> The kernel was compiled with debug information but modern linux distribution
> make it really hard to keep your debug information it seems :-(

I see where the find_get_page(...) function called in
extent_range_uptodate has the potential to return a NULL value.

Could you try the following patch, and if it solves your oops and
shows the included warning in your dmesg log, I'll simplify the patch
to drop the printk and submit it to the list.

I only included the printk since your current error log is ambiguous
regarding the specific point where we're getting the NULL pointer
dereference, but I'll pull it out if it works.

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9d09a4f..35c3a2a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3909,6 +3909,13 @@ int extent_range_uptodate(struct extent_io_tree *tree,
 	while (start <= end) {
 		index = start >> PAGE_CACHE_SHIFT;
 		page = find_get_page(tree->mapping, index);
+		if (unlikely(!page)) {
+			if (printk_ratelimit())
+ 				printk(KERN_WARNING
+ 				       "btrfs: NULL page in "
+ 				       "extent_range_uptodate()\n");
+			return 1;
+		}
 		uptodate = PageUptodate(page);
 		page_cache_release(page);
 		if (!uptodate) {

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [BUG - btrfs] kernel oops in extent_range_uptodate
  2012-01-20 20:54     ` Mitch Harder
@ 2012-01-24 16:24       ` Vincent Vanackere
  2012-01-25  3:30         ` Mitch Harder
  0 siblings, 1 reply; 7+ messages in thread
From: Vincent Vanackere @ 2012-01-24 16:24 UTC (permalink / raw)
  To: Mitch Harder; +Cc: linux-btrfs, Linux kernel mailing list

On 01/20/2012 09:54 PM, Mitch Harder wrote:
> On Fri, Jan 20, 2012 at 10:48 AM, Vincent Vanackere
> <vincent.vanackere@gmail.com>  wrote:
>> On 01/19/2012 05:24 PM, Mitch Harder wrote:
>>> On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere
>>> <vincent.vanackere@gmail.com>    wrote:
>>>> Hi,
>>>>
>>>> With the most current git kernel
>>>> (90a4c0f51e8e44111a926be6f4c87af3938a79c3)
>>>> I'm still getting the same reproducible kernel panic when trying to read
>>>> a
>>>> particular file stored on a btrfs filesystem (as seen in the log there
>>>> are
>>>> indeed disk media errors on this disk).
>>>> I'd like the "software" part of this to be fixed - btrfs should
>>>> definitely
>>>> not oops even in case of media error - before sending the disk to RMA. Is
>>>> there anything I can do to make progress on this ?
>>>>
>>> Is this kernel compiled with "Compile the kernel with debug info" (in
>>> the "Kernel hacking  --->" configuration section)?
>>>
>>> It would be nice to have the specific line of code passing the NULL
>>> pointer.
>>
>> The kernel was compiled with debug information but modern linux distribution
>> make it really hard to keep your debug information it seems :-(
> I see where the find_get_page(...) function called in
> extent_range_uptodate has the potential to return a NULL value.
>
> Could you try the following patch, and if it solves your oops and
> shows the included warning in your dmesg log, I'll simplify the patch
> to drop the printk and submit it to the list.
>
> I only included the printk since your current error log is ambiguous
> regarding the specific point where we're getting the NULL pointer
> dereference, but I'll pull it out if it works.
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 9d09a4f..35c3a2a 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3909,6 +3909,13 @@ int extent_range_uptodate(struct extent_io_tree *tree,
>   	while (start<= end) {
>   		index = start>>  PAGE_CACHE_SHIFT;
>   		page = find_get_page(tree->mapping, index);
> +		if (unlikely(!page)) {
> +			if (printk_ratelimit())
> + 				printk(KERN_WARNING
> + 				       "btrfs: NULL page in "
> + 				       "extent_range_uptodate()\n");
> +			return 1;
> +		}
>   		uptodate = PageUptodate(page);
>   		page_cache_release(page);
>   		if (!uptodate) {

Indeed your patch helps. No kernel panic any more... but it looks like 
the task doesn't finish and there's another problem to solve now :

sd 5:0:0:0: [sdd] Unhandled sense code
sd 5:0:0:0: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 5:0:0:0: [sdd]  Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
         70 2f dc 61
sd 5:0:0:0: [sdd]  Add. Sense: Unrecovered read error - auto reallocate 
failed
sd 5:0:0:0: [sdd] CDB: Read(10): 28 00 70 2f dc 5f 00 00 08 00
end_request: I/O error, dev sdd, sector 1882184801
ata6: EH complete
btrfs: NULL page in extent_range_uptodate()
btrfs: NULL page in extent_range_uptodate()
btrfs bad tree block start 959241011200 959241011200
INFO: task cat:3099 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cat             D ffffffff8180c600     0  3099   3002 0x00000000
  ffff8801f2b0f618 0000000000000086 ffff8801f2b0f5d8 ffff880221018770
  ffff880222c65b80 ffff8801f2b0ffd8 ffff8801f2b0ffd8 ffff8801f2b0ffd8
  ffff8802241816e0 ffff880222c65b80 ffff8801f2b0f5e8 ffff88022fd13e88
Call Trace:
  [<ffffffff81114260>] ? __lock_page+0x70/0x70
  [<ffffffff8162c93f>] schedule+0x3f/0x60
  [<ffffffff8162c9ef>] io_schedule+0x8f/0xd0
  [<ffffffff8111426e>] sleep_on_page+0xe/0x20
  [<ffffffff8162b1ff>] __wait_on_bit+0x5f/0x90
  [<ffffffff811143d8>] wait_on_page_bit+0x78/0x80
  [<ffffffff81070c40>] ? autoremove_wake_function+0x40/0x40
  [<ffffffffa0192161>] read_extent_buffer_pages+0x471/0x4d0 [btrfs]
  [<ffffffffa01697b0>] ? verify_parent_transid+0x160/0x160 [btrfs]
  [<ffffffffa016a13a>] btree_read_extent_buffer_pages.isra.99+0x8a/0xc0 
[btrfs]
  [<ffffffffa016c1e1>] read_tree_block+0x41/0x60 [btrfs]
  [<ffffffffa01526a3>] read_block_for_search.isra.34+0xf3/0x3d0 [btrfs]
  [<ffffffffa0154930>] btrfs_search_slot+0x300/0x8a0 [btrfs]
  [<ffffffffa0166ab4>] btrfs_lookup_csum+0x74/0x170 [btrfs]
  [<ffffffffa0166d5f>] __btrfs_lookup_bio_sums+0x1af/0x3b0 [btrfs]
  [<ffffffffa0166fb6>] btrfs_lookup_bio_sums+0x16/0x20 [btrfs]
  [<ffffffffa0173650>] btrfs_submit_bio_hook+0x140/0x170 [btrfs]
  [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs]
  [<ffffffffa018c17a>] submit_one_bio+0x6a/0xa0 [btrfs]
  [<ffffffffa0190e34>] extent_readpages+0xe4/0x100 [btrfs]
  [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs]
  [<ffffffffa0173ebf>] btrfs_readpages+0x1f/0x30 [btrfs]
  [<ffffffff81120a0f>] __do_page_cache_readahead+0x1af/0x250
  [<ffffffff81120e11>] ra_submit+0x21/0x30
  [<ffffffff81120f35>] ondemand_readahead+0x115/0x230
  [<ffffffff81137cd9>] ? __do_fault+0x419/0x530
  [<ffffffff81121131>] page_cache_sync_readahead+0x31/0x50
  [<ffffffff811165f8>] generic_file_aio_read+0x438/0x780
  [<ffffffff81173bb2>] do_sync_read+0xd2/0x110
  [<ffffffff81293e73>] ? security_file_permission+0x93/0xb0
  [<ffffffff81174031>] ? rw_verify_area+0x61/0xf0
  [<ffffffff81174510>] vfs_read+0xb0/0x180
  [<ffffffff8117462a>] sys_read+0x4a/0x90
  [<ffffffff81635ae9>] system_call_fastpath+0x16/0x1b


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG - btrfs] kernel oops in extent_range_uptodate
  2012-01-24 16:24       ` Vincent Vanackere
@ 2012-01-25  3:30         ` Mitch Harder
  2012-01-25  8:29           ` Vincent Vanackere
  0 siblings, 1 reply; 7+ messages in thread
From: Mitch Harder @ 2012-01-25  3:30 UTC (permalink / raw)
  To: Vincent Vanackere; +Cc: linux-btrfs, Linux kernel mailing list

On Tue, Jan 24, 2012 at 10:24 AM, Vincent Vanackere
<vincent.vanackere@gmail.com> wrote:
> On 01/20/2012 09:54 PM, Mitch Harder wrote:
>>
>> On Fri, Jan 20, 2012 at 10:48 AM, Vincent Vanackere
>> <vincent.vanackere@gmail.com>  wrote:
>>>
>>> On 01/19/2012 05:24 PM, Mitch Harder wrote:
>>>>
>>>> On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere
>>>> <vincent.vanackere@gmail.com>    wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> With the most current git kernel
>>>>> (90a4c0f51e8e44111a926be6f4c87af3938a79c3)
>>>>> I'm still getting the same reproducible kernel panic when trying to
>>>>> read
>>>>> a
>>>>> particular file stored on a btrfs filesystem (as seen in the log there
>>>>> are
>>>>> indeed disk media errors on this disk).
>>>>> I'd like the "software" part of this to be fixed - btrfs should
>>>>> definitely
>>>>> not oops even in case of media error - before sending the disk to RMA.
>>>>> Is
>>>>> there anything I can do to make progress on this ?
>>>>>
>>>> Is this kernel compiled with "Compile the kernel with debug info" (in
>>>> the "Kernel hacking  --->" configuration section)?
>>>>
>>>> It would be nice to have the specific line of code passing the NULL
>>>> pointer.
>>>
>>>
>>> The kernel was compiled with debug information but modern linux
>>> distribution
>>> make it really hard to keep your debug information it seems :-(
>>
>> I see where the find_get_page(...) function called in
>> extent_range_uptodate has the potential to return a NULL value.
>>
>> Could you try the following patch, and if it solves your oops and
>> shows the included warning in your dmesg log, I'll simplify the patch
>> to drop the printk and submit it to the list.
>>
>> I only included the printk since your current error log is ambiguous
>> regarding the specific point where we're getting the NULL pointer
>> dereference, but I'll pull it out if it works.
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 9d09a4f..35c3a2a 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3909,6 +3909,13 @@ int extent_range_uptodate(struct extent_io_tree
>> *tree,
>>        while (start<= end) {
>>                index = start>>  PAGE_CACHE_SHIFT;
>>                page = find_get_page(tree->mapping, index);
>> +               if (unlikely(!page)) {
>> +                       if (printk_ratelimit())
>> +                               printk(KERN_WARNING
>> +                                      "btrfs: NULL page in "
>> +                                      "extent_range_uptodate()\n");
>> +                       return 1;
>> +               }
>>                uptodate = PageUptodate(page);
>>                page_cache_release(page);
>>                if (!uptodate) {
>
>
> Indeed your patch helps. No kernel panic any more... but it looks like the
> task doesn't finish and there's another problem to solve now :
>
>
> sd 5:0:0:0: [sdd] Unhandled sense code
> sd 5:0:0:0: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 5:0:0:0: [sdd]  Sense Key : Medium Error [current] [descriptor]
> Descriptor sense data with sense descriptors (in hex):
>        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
>        70 2f dc 61
> sd 5:0:0:0: [sdd]  Add. Sense: Unrecovered read error - auto reallocate
> failed
> sd 5:0:0:0: [sdd] CDB: Read(10): 28 00 70 2f dc 5f 00 00 08 00
> end_request: I/O error, dev sdd, sector 1882184801
> ata6: EH complete
> btrfs: NULL page in extent_range_uptodate()
> btrfs: NULL page in extent_range_uptodate()
> btrfs bad tree block start 959241011200 959241011200
> INFO: task cat:3099 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> cat             D ffffffff8180c600     0  3099   3002 0x00000000
>  ffff8801f2b0f618 0000000000000086 ffff8801f2b0f5d8 ffff880221018770
>  ffff880222c65b80 ffff8801f2b0ffd8 ffff8801f2b0ffd8 ffff8801f2b0ffd8
>  ffff8802241816e0 ffff880222c65b80 ffff8801f2b0f5e8 ffff88022fd13e88
> Call Trace:
>  [<ffffffff81114260>] ? __lock_page+0x70/0x70
>  [<ffffffff8162c93f>] schedule+0x3f/0x60
>  [<ffffffff8162c9ef>] io_schedule+0x8f/0xd0
>  [<ffffffff8111426e>] sleep_on_page+0xe/0x20
>  [<ffffffff8162b1ff>] __wait_on_bit+0x5f/0x90
>  [<ffffffff811143d8>] wait_on_page_bit+0x78/0x80
>  [<ffffffff81070c40>] ? autoremove_wake_function+0x40/0x40
>  [<ffffffffa0192161>] read_extent_buffer_pages+0x471/0x4d0 [btrfs]
>  [<ffffffffa01697b0>] ? verify_parent_transid+0x160/0x160 [btrfs]
>  [<ffffffffa016a13a>] btree_read_extent_buffer_pages.isra.99+0x8a/0xc0
> [btrfs]
>  [<ffffffffa016c1e1>] read_tree_block+0x41/0x60 [btrfs]
>  [<ffffffffa01526a3>] read_block_for_search.isra.34+0xf3/0x3d0 [btrfs]
>  [<ffffffffa0154930>] btrfs_search_slot+0x300/0x8a0 [btrfs]
>  [<ffffffffa0166ab4>] btrfs_lookup_csum+0x74/0x170 [btrfs]
>  [<ffffffffa0166d5f>] __btrfs_lookup_bio_sums+0x1af/0x3b0 [btrfs]
>  [<ffffffffa0166fb6>] btrfs_lookup_bio_sums+0x16/0x20 [btrfs]
>  [<ffffffffa0173650>] btrfs_submit_bio_hook+0x140/0x170 [btrfs]
>  [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs]
>  [<ffffffffa018c17a>] submit_one_bio+0x6a/0xa0 [btrfs]
>  [<ffffffffa0190e34>] extent_readpages+0xe4/0x100 [btrfs]
>  [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs]
>  [<ffffffffa0173ebf>] btrfs_readpages+0x1f/0x30 [btrfs]
>  [<ffffffff81120a0f>] __do_page_cache_readahead+0x1af/0x250
>  [<ffffffff81120e11>] ra_submit+0x21/0x30
>  [<ffffffff81120f35>] ondemand_readahead+0x115/0x230
>  [<ffffffff81137cd9>] ? __do_fault+0x419/0x530
>  [<ffffffff81121131>] page_cache_sync_readahead+0x31/0x50
>  [<ffffffff811165f8>] generic_file_aio_read+0x438/0x780
>  [<ffffffff81173bb2>] do_sync_read+0xd2/0x110
>  [<ffffffff81293e73>] ? security_file_permission+0x93/0xb0
>  [<ffffffff81174031>] ? rw_verify_area+0x61/0xf0
>  [<ffffffff81174510>] vfs_read+0xb0/0x180
>  [<ffffffff8117462a>] sys_read+0x4a/0x90
>  [<ffffffff81635ae9>] system_call_fastpath+0x16/0x1b
>

Good, looks like we're making progress.

We appear to be stuck now at wait_on_page_locked(page) in the
read_extent_buffer_pages(...) function in extent_io.c

	for (i = start_i; i < num_pages; i++) {
		page = extent_buffer_page(eb, i);
		wait_on_page_locked(page);
		if (!PageUptodate(page))
			ret = -EIO;
	}

I tried looking around the kernel for how others have handled error
checking when using wait_on_page_locked(...), but I could not find
many examples.

http://lxr.free-electrons.com/ident?i=wait_on_page_locked

I believe I'll have to ask for help from the others on the list at
this point for how to handle this issue.

Do you still have data you are trying to recover from this disk?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG - btrfs] kernel oops in extent_range_uptodate
  2012-01-25  3:30         ` Mitch Harder
@ 2012-01-25  8:29           ` Vincent Vanackere
  0 siblings, 0 replies; 7+ messages in thread
From: Vincent Vanackere @ 2012-01-25  8:29 UTC (permalink / raw)
  To: Mitch Harder; +Cc: linux-btrfs, Linux kernel mailing list, Vincent Vanackere

On Wed, Jan 25, 2012 at 04:30, Mitch Harder
<mitch.harder@sabayonlinux.org> wrote:
>
> On Tue, Jan 24, 2012 at 10:24 AM, Vincent Vanackere
> <vincent.vanackere@gmail.com> wrote:
> > On 01/20/2012 09:54 PM, Mitch Harder wrote:
> >>
> >> On Fri, Jan 20, 2012 at 10:48 AM, Vincent Vanackere
> >> <vincent.vanackere@gmail.com>  wrote:
> >>>
> >>> On 01/19/2012 05:24 PM, Mitch Harder wrote:
> >>>>
> >>>> On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere
> >>>> <vincent.vanackere@gmail.com>    wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> With the most current git kernel
> >>>>> (90a4c0f51e8e44111a926be6f4c87af3938a79c3)
> >>>>> I'm still getting the same reproducible kernel panic when trying to
> >>>>> read
> >>>>> a
> >>>>> particular file stored on a btrfs filesystem (as seen in the log there
> >>>>> are
> >>>>> indeed disk media errors on this disk).
> >>>>> I'd like the "software" part of this to be fixed - btrfs should
> >>>>> definitely
> >>>>> not oops even in case of media error - before sending the disk to RMA.
> >>>>> Is
> >>>>> there anything I can do to make progress on this ?
> >>>>>
> >>>> Is this kernel compiled with "Compile the kernel with debug info" (in
> >>>> the "Kernel hacking  --->" configuration section)?
> >>>>
> >>>> It would be nice to have the specific line of code passing the NULL
> >>>> pointer.
> >>>
> >>>
> >>> The kernel was compiled with debug information but modern linux
> >>> distribution
> >>> make it really hard to keep your debug information it seems :-(
> >>
> >> I see where the find_get_page(...) function called in
> >> extent_range_uptodate has the potential to return a NULL value.
> >>
> >> Could you try the following patch, and if it solves your oops and
> >> shows the included warning in your dmesg log, I'll simplify the patch
> >> to drop the printk and submit it to the list.
> >>
> >> I only included the printk since your current error log is ambiguous
> >> regarding the specific point where we're getting the NULL pointer
> >> dereference, but I'll pull it out if it works.
> >>
> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> >> index 9d09a4f..35c3a2a 100644
> >> --- a/fs/btrfs/extent_io.c
> >> +++ b/fs/btrfs/extent_io.c
> >> @@ -3909,6 +3909,13 @@ int extent_range_uptodate(struct extent_io_tree
> >> *tree,
> >>        while (start<= end) {
> >>                index = start>>  PAGE_CACHE_SHIFT;
> >>                page = find_get_page(tree->mapping, index);
> >> +               if (unlikely(!page)) {
> >> +                       if (printk_ratelimit())
> >> +                               printk(KERN_WARNING
> >> +                                      "btrfs: NULL page in "
> >> +                                      "extent_range_uptodate()\n");
> >> +                       return 1;
> >> +               }
> >>                uptodate = PageUptodate(page);
> >>                page_cache_release(page);
> >>                if (!uptodate) {
> >
> >
> > Indeed your patch helps. No kernel panic any more... but it looks like the
> > task doesn't finish and there's another problem to solve now :
> >
> >
> > sd 5:0:0:0: [sdd] Unhandled sense code
> > sd 5:0:0:0: [sdd]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> > sd 5:0:0:0: [sdd]  Sense Key : Medium Error [current] [descriptor]
> > Descriptor sense data with sense descriptors (in hex):
> >        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
> >        70 2f dc 61
> > sd 5:0:0:0: [sdd]  Add. Sense: Unrecovered read error - auto reallocate
> > failed
> > sd 5:0:0:0: [sdd] CDB: Read(10): 28 00 70 2f dc 5f 00 00 08 00
> > end_request: I/O error, dev sdd, sector 1882184801
> > ata6: EH complete
> > btrfs: NULL page in extent_range_uptodate()
> > btrfs: NULL page in extent_range_uptodate()
> > btrfs bad tree block start 959241011200 959241011200
> > INFO: task cat:3099 blocked for more than 120 seconds.
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > cat             D ffffffff8180c600     0  3099   3002 0x00000000
> >  ffff8801f2b0f618 0000000000000086 ffff8801f2b0f5d8 ffff880221018770
> >  ffff880222c65b80 ffff8801f2b0ffd8 ffff8801f2b0ffd8 ffff8801f2b0ffd8
> >  ffff8802241816e0 ffff880222c65b80 ffff8801f2b0f5e8 ffff88022fd13e88
> > Call Trace:
> >  [<ffffffff81114260>] ? __lock_page+0x70/0x70
> >  [<ffffffff8162c93f>] schedule+0x3f/0x60
> >  [<ffffffff8162c9ef>] io_schedule+0x8f/0xd0
> >  [<ffffffff8111426e>] sleep_on_page+0xe/0x20
> >  [<ffffffff8162b1ff>] __wait_on_bit+0x5f/0x90
> >  [<ffffffff811143d8>] wait_on_page_bit+0x78/0x80
> >  [<ffffffff81070c40>] ? autoremove_wake_function+0x40/0x40
> >  [<ffffffffa0192161>] read_extent_buffer_pages+0x471/0x4d0 [btrfs]
> >  [<ffffffffa01697b0>] ? verify_parent_transid+0x160/0x160 [btrfs]
> >  [<ffffffffa016a13a>] btree_read_extent_buffer_pages.isra.99+0x8a/0xc0
> > [btrfs]
> >  [<ffffffffa016c1e1>] read_tree_block+0x41/0x60 [btrfs]
> >  [<ffffffffa01526a3>] read_block_for_search.isra.34+0xf3/0x3d0 [btrfs]
> >  [<ffffffffa0154930>] btrfs_search_slot+0x300/0x8a0 [btrfs]
> >  [<ffffffffa0166ab4>] btrfs_lookup_csum+0x74/0x170 [btrfs]
> >  [<ffffffffa0166d5f>] __btrfs_lookup_bio_sums+0x1af/0x3b0 [btrfs]
> >  [<ffffffffa0166fb6>] btrfs_lookup_bio_sums+0x16/0x20 [btrfs]
> >  [<ffffffffa0173650>] btrfs_submit_bio_hook+0x140/0x170 [btrfs]
> >  [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs]
> >  [<ffffffffa018c17a>] submit_one_bio+0x6a/0xa0 [btrfs]
> >  [<ffffffffa0190e34>] extent_readpages+0xe4/0x100 [btrfs]
> >  [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs]
> >  [<ffffffffa0173ebf>] btrfs_readpages+0x1f/0x30 [btrfs]
> >  [<ffffffff81120a0f>] __do_page_cache_readahead+0x1af/0x250
> >  [<ffffffff81120e11>] ra_submit+0x21/0x30
> >  [<ffffffff81120f35>] ondemand_readahead+0x115/0x230
> >  [<ffffffff81137cd9>] ? __do_fault+0x419/0x530
> >  [<ffffffff81121131>] page_cache_sync_readahead+0x31/0x50
> >  [<ffffffff811165f8>] generic_file_aio_read+0x438/0x780
> >  [<ffffffff81173bb2>] do_sync_read+0xd2/0x110
> >  [<ffffffff81293e73>] ? security_file_permission+0x93/0xb0
> >  [<ffffffff81174031>] ? rw_verify_area+0x61/0xf0
> >  [<ffffffff81174510>] vfs_read+0xb0/0x180
> >  [<ffffffff8117462a>] sys_read+0x4a/0x90
> >  [<ffffffff81635ae9>] system_call_fastpath+0x16/0x1b
> >
>
> Good, looks like we're making progress.
>
> We appear to be stuck now at wait_on_page_locked(page) in the
> read_extent_buffer_pages(...) function in extent_io.c
>
>        for (i = start_i; i < num_pages; i++) {
>                page = extent_buffer_page(eb, i);
>                wait_on_page_locked(page);
>                if (!PageUptodate(page))
>                        ret = -EIO;
>        }
>
> I tried looking around the kernel for how others have handled error
> checking when using wait_on_page_locked(...), but I could not find
> many examples.
>
> http://lxr.free-electrons.com/ident?i=wait_on_page_locked
>
> I believe I'll have to ask for help from the others on the list at
> this point for how to handle this issue.
>
> Do you still have data you are trying to recover from this disk?

I already recovered all interesting data, I'm only keeping this disk
until I'm confident btrfs will be able to deal with this particular IO
error... Thanks for your help so far !

Vincent

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-01-25  8:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-19 14:42 [BUG - btrfs] kernel oops in extent_range_uptodate Vincent Vanackere
2012-01-19 16:24 ` Mitch Harder
2012-01-20 16:48   ` Vincent Vanackere
2012-01-20 20:54     ` Mitch Harder
2012-01-24 16:24       ` Vincent Vanackere
2012-01-25  3:30         ` Mitch Harder
2012-01-25  8:29           ` Vincent Vanackere

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).