xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Bob Liu <bob.liu@oracle.com>
To: Evgenii Shatokhin <eshatokhin@virtuozzo.com>
Cc: Juergen Gross <jgross@suse.com>,
	Dario Faggioli <dario.faggioli@citrix.com>,
	George Dunlap <George.Dunlap@citrix.com>,
	xen-devel@lists.xen.org, David Vrabel <david.vrabel@citrix.com>,
	Konstantin Khorenko <khorenko@virtuozzo.com>,
	Roger Pau Monne <roger.paumonne@citrix.com>
Subject: Re: [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711
Date: Thu, 11 Aug 2016 10:10:10 +0800	[thread overview]
Message-ID: <57ABDE82.6040102@oracle.com> (raw)
In-Reply-To: <57AB4037.7050303@virtuozzo.com>


On 08/10/2016 10:54 PM, Evgenii Shatokhin wrote:
> On 10.08.2016 15:49, Bob Liu wrote:
>>
>> On 08/10/2016 08:33 PM, Evgenii Shatokhin wrote:
>>> On 14.07.2016 15:04, Bob Liu wrote:
>>>>
>>>> On 07/14/2016 07:49 PM, Evgenii Shatokhin wrote:
>>>>> On 11.07.2016 15:04, Bob Liu wrote:
>>>>>>
>>>>>>
>>>>>> On 07/11/2016 04:50 PM, Evgenii Shatokhin wrote:
>>>>>>> On 06.06.2016 11:42, Dario Faggioli wrote:
>>>>>>>> Just Cc-ing some Linux, block, and Xen on CentOS people...
>>>>>>>>
>>>>>>>
>>>>>>> Ping.
>>>>>>>
>>>>>>> Any suggestions how to debug this or what might cause the problem?
>>>>>>>
>>>>>>> Obviously, we cannot control Xen on the Amazon's servers. But perhaps there is something we can do at the kernel's side, is it?
>>>>>>>
>>>>>>>> On Mon, 2016-06-06 at 11:24 +0300, Evgenii Shatokhin wrote:
>>>>>>>>> (Resending this bug report because the message I sent last week did
>>>>>>>>> not
>>>>>>>>> make it to the mailing list somehow.)
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> One of our users gets kernel panics from time to time when he tries
>>>>>>>>> to
>>>>>>>>> use his Amazon EC2 instance with CentOS7 x64 in it [1]. Kernel panic
>>>>>>>>> happens within minutes from the moment the instance starts. The
>>>>>>>>> problem
>>>>>>>>> does not show up every time, however.
>>>>>>>>>
>>>>>>>>> The user first observed the problem with a custom kernel, but it was
>>>>>>>>> found later that the stock kernel 3.10.0-327.18.2.el7.x86_64 from
>>>>>>>>> CentOS7 was affected as well.
>>>>>>
>>>>>> Please try this patch:
>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7b0767502b5db11cb1f0daef2d01f6d71b1192dc
>>>>>>
>>>>>> Regards,
>>>>>> Bob
>>>>>>
>>>>>
>>>>> Unfortunately, it did not help. The same BUG_ON() in blkfront_setup_indirect() still triggers in our kernel based on RHEL's 3.10.0-327.18.2, where I added the patch.
>>>>>
>>>>> As far as I can see, the patch makes sure the indirect pages are added to the list only if (!info->feature_persistent) holds. I suppose it holds in our case and the pages are added to the list because the triggered BUG_ON() is here:
>>>>>
>>>>>       if (!info->feature_persistent && info->max_indirect_segments) {
>>>>>           <...>
>>>>>           BUG_ON(!list_empty(&info->indirect_pages));
>>>>>           <...>
>>>>>       }
>>>>>
>>>>
>>>> That's odd.
>>>> Could you please try to reproduce this issue with a recent upstream kernel?
>>>>
>>>> Thanks,
>>>> Bob
>>>
>>> No luck with the upstream kernel 4.7.0 so far due to unrelated issues (bad initrd, I suppose, so the system does not even boot).
>>>
>>> However, the problem reproduced with the stable upstream kernel 3.14.74. After the system booted the second time with this kernel, that BUG_ON triggered:
>>>       kernel BUG at drivers/block/xen-blkfront.c:1701
>>>
>>
>> Could you please provide more detail on how to reproduce this bug? I'd like to have a test.
>>
>> Thanks!
>> Bob
> 
> As the user says, he uses an Amazon EC2 instance. Namely: HVM CentOS7 AMI on a c3.large instance with EBS magnetic storage.
> 

Oh, then it would be difficult to debug this issue.
The xen-blkfront communicates with xen-blkback(in dom0 or driver domain), but that part is a black box when running Amazon EC2.
We can't see the source code of the backend side!

Can this bug be reproduced on your own environment(xen + dom0)?

> At least 2 LVM partitions are needed:
> * /, 20-30 Gb should be enough, ext4
> * /vz, 5-10 Gb should be enough, ext4
> 
> Kernel 3.14.74 I was talking about: https://www.dropbox.com/s/bhus3mubza87z86/kernel-3.14.74-1.test.x86_64.rpm?dl=1
> 
> Not sure if it is relevant, but the user may have installed additional packages from https://download.openvz.org/virtuozzo/releases/7.0-rtm/x86_64/os/ repository. Namely: vzctl, vzmigrate, vzprocps, vztt-lib, vzctcalc, ploop, prlctl, centos-7-x86_64-ez.
> 
> After the kernel and the other mentioned packages have been installed,
> the user rebooted the instance to run that kernel 3.14.74.
> 
> Then - start the instance, wait 5 minutes, stop the instance, repeat. 2-20 such iterations were usually enough to reproduce the problem. Can be automated with the help of Amazon's API.
> 
> BTW, before the BUG_ON triggered this time, there was the following in dmesg. Not sure if it is related but still:
> 

Attach the full dmesg would be better.

Regards,
Bob

> ----------------------
> [    2.835034] scsi0 : ata_piix
> [    2.840317] scsi1 : ata_piix
> [    2.842267] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc100 irq 14
> [    2.845861] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc108 irq 15
> [    2.853840] AVX version of gcm_enc/dec engaged.
> [    2.859963] xen_netfront: Initialising Xen virtual ethernet driver
> [    2.867156] alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
> [    2.885861] blkfront: xvda: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: enabled;
> [    2.889046] alg: No test for crc32 (crc32-pclmul)
> [    2.899290]  xvda: xvda1
> [    2.997751] blkfront: xvdc: flush diskcache: enabled; persistent grants: disabled; indirect descriptors: enabled;
> [    3.007401]  xvdc: unknown partition table
> [    3.010465] Setting capacity to 31992832
> [    3.012922] xvdc: detected capacity change from 0 to 16380329984
> [    3.017408] blkfront: xvdd: flush diskcache: enabled; persistent grants: disabled; indirect descriptors: enabled;
> [    3.023861]  xvdd: unknown partition table
> [    3.026481] Setting capacity to 31992832
> [    3.029051] xvdd: detected capacity change from 0 to 16380329984
> [    3.033320] blkfront: xvdf: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: enabled;
> [    3.040712] random: nonblocking pool is initialized
> [    3.057432]  xvdf: unknown partition table
> [    3.060807] Setting capacity to 41943040
> [    3.063194] xvdf: detected capacity change from 0 to 21474836480
> [    3.067684] blkfront: xvdb: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: enabled;
> [    3.076835]  xvdb: unknown partition table
> [    3.079692] Setting capacity to 16777216
> [    3.082112] xvdb: detected capacity change from 0 to 8589934592
> [    3.086853] vbd vbd-51712: 16 xlvbd_add at /local/domain/0/backend/vbd/9543/51712
> ----------------------
> 
>>
>>>>
>>>>> So the problem is still out there somewhere, it seems.
>>>>>
>>>>> Regards,
>>>>> Evgenii
>>>>>
>>>>>>>>>
>>>>>>>>> The part of the system log he was able to retrieve is attached. Here
>>>>>>>>> is
>>>>>>>>> the bug info, for convenience:
>>>>>>>>>
>>>>>>>>> ------------------------------------
>>>>>>>>> [    2.246912] kernel BUG at drivers/block/xen-blkfront.c:1711!
>>>>>>>>> [    2.246912] invalid opcode: 0000 [#1] SMP
>>>>>>>>> [    2.246912] Modules linked in: ata_generic pata_acpi
>>>>>>>>> crct10dif_pclmul
>>>>>>>>> crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
>>>>>>>>> xen_netfront xen_blkfront(+) aesni_intel lrw ata_piix gf128mul
>>>>>>>>> glue_helper ablk_helper cryptd libata serio_raw floppy sunrpc
>>>>>>>>> dm_mirror
>>>>>>>>> dm_region_hash dm_log dm_mod scsi_transport_iscsi
>>>>>>>>> [    2.246912] CPU: 1 PID: 50 Comm: xenwatch Not tainted
>>>>>>>>> 3.10.0-327.18.2.el7.x86_64 #1
>>>>>>>>> [    2.246912] Hardware name: Xen HVM domU, BIOS 4.2.amazon
>>>>>>>>> 12/07/2015
>>>>>>>>> [    2.246912] task: ffff8800e9fcb980 ti: ffff8800e98bc000 task.ti:
>>>>>>>>> ffff8800e98bc000
>>>>>>>>> [    2.246912] RIP: 0010:[<ffffffffa015584f>]  [<ffffffffa015584f>]
>>>>>>>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>>>>>>>> [    2.246912] RSP: 0018:ffff8800e98bfcd0  EFLAGS: 00010283
>>>>>>>>> [    2.246912] RAX: ffff8800353e15c0 RBX: ffff8800e98c52c8 RCX:
>>>>>>>>> 0000000000000020
>>>>>>>>> [    2.246912] RDX: ffff8800353e15b0 RSI: ffff8800e98c52b8 RDI:
>>>>>>>>> ffff8800353e15d0
>>>>>>>>> [    2.246912] RBP: ffff8800e98bfd20 R08: ffff8800353e15b0 R09:
>>>>>>>>> ffff8800eb403c00
>>>>>>>>> [    2.246912] R10: ffffffffa0155532 R11: ffffffffffffffe8 R12:
>>>>>>>>> ffff8800e98c4000
>>>>>>>>> [    2.246912] R13: ffff8800e98c52b8 R14: 0000000000000020 R15:
>>>>>>>>> ffff8800353e15c0
>>>>>>>>> [    2.246912] FS:  0000000000000000(0000) GS:ffff8800efc20000(0000)
>>>>>>>>> knlGS:0000000000000000
>>>>>>>>> [    2.246912] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>> [    2.246912] CR2: 00007f1b615ef000 CR3: 00000000e2b44000 CR4:
>>>>>>>>> 00000000001406e0
>>>>>>>>> [    2.246912] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>>>> 0000000000000000
>>>>>>>>> [    2.246912] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>>>>>>>> 0000000000000400
>>>>>>>>> [    2.246912] Stack:
>>>>>>>>> [    2.246912]  0000000000000020 0000000000000001 00000020a0157217
>>>>>>>>> 00000100e98bfdbc
>>>>>>>>> [    2.246912]  0000000027efa3ef ffff8800e98bfdbc ffff8800e98ce000
>>>>>>>>> ffff8800e98c4000
>>>>>>>>> [    2.246912]  ffff8800e98ce040 0000000000000001 ffff8800e98bfe08
>>>>>>>>> ffffffffa0155d4c
>>>>>>>>> [    2.246912] Call Trace:
>>>>>>>>> [    2.246912]  [<ffffffffa0155d4c>] blkback_changed+0x4ec/0xfc8
>>>>>>>>> [xen_blkfront]
>>>>>>>>> [    2.246912]  [<ffffffff813a6fd0>] ? xenbus_gather+0x170/0x190
>>>>>>>>> [    2.246912]  [<ffffffff816322f5>] ? __slab_free+0x10e/0x277
>>>>>>>>> [    2.246912]  [<ffffffff813a805d>]
>>>>>>>>> xenbus_otherend_changed+0xad/0x110
>>>>>>>>> [    2.246912]  [<ffffffff813a7257>] ? xenwatch_thread+0x77/0x180
>>>>>>>>> [    2.246912]  [<ffffffff813a9ba3>] backend_changed+0x13/0x20
>>>>>>>>> [    2.246912]  [<ffffffff813a7246>] xenwatch_thread+0x66/0x180
>>>>>>>>> [    2.246912]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
>>>>>>>>> [    2.246912]  [<ffffffff813a71e0>] ?
>>>>>>>>> unregister_xenbus_watch+0x1f0/0x1f0
>>>>>>>>> [    2.246912]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
>>>>>>>>> [    2.246912]  [<ffffffff810a5a20>] ?
>>>>>>>>> kthread_create_on_node+0x140/0x140
>>>>>>>>> [    2.246912]  [<ffffffff81646118>] ret_from_fork+0x58/0x90
>>>>>>>>> [    2.246912]  [<ffffffff810a5a20>] ?
>>>>>>>>> kthread_create_on_node+0x140/0x140
>>>>>>>>> [    2.246912] Code: e1 48 85 c0 75 ce 49 8d 84 24 40 01 00 00 48 89
>>>>>>>>> 45
>>>>>>>>> b8 e9 91 fd ff ff 4c 89 ff e8 8d ae 06 e1 e9 f2 fc ff ff 31 c0 e9 2e
>>>>>>>>> fe
>>>>>>>>> ff ff <0f> 0b e8 9a 57 f2 e0 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44
>>>>>>>>> 00
>>>>>>>>> [    2.246912] RIP  [<ffffffffa015584f>]
>>>>>>>>> blkfront_setup_indirect+0x41f/0x430 [xen_blkfront]
>>>>>>>>> [    2.246912]  RSP <ffff8800e98bfcd0>
>>>>>>>>> [    2.491574] ---[ end trace 8a9b992812627c71 ]---
>>>>>>>>> [    2.495618] Kernel panic - not syncing: Fatal exception
>>>>>>>>> ------------------------------------

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

  reply	other threads:[~2016-08-11  2:10 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-06  8:24 [BUG] kernel BUG at drivers/block/xen-blkfront.c:1711 Evgenii Shatokhin
2016-06-06  8:42 ` Dario Faggioli
2016-07-11  8:50   ` Evgenii Shatokhin
2016-07-11 10:37     ` George Dunlap
2016-07-11 14:34       ` Evgenii Shatokhin
2016-07-11 12:04     ` Bob Liu
2016-07-11 14:08       ` Evgenii Shatokhin
2016-07-14 11:49       ` Evgenii Shatokhin
2016-07-14 12:04         ` Bob Liu
2016-07-14 12:53           ` Evgenii Shatokhin
2016-08-10 12:33           ` Evgenii Shatokhin
2016-08-10 12:49             ` Bob Liu
2016-08-10 14:54               ` Evgenii Shatokhin
2016-08-11  2:10                 ` Bob Liu [this message]
2016-08-11  7:45                   ` Evgenii Shatokhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57ABDE82.6040102@oracle.com \
    --to=bob.liu@oracle.com \
    --cc=George.Dunlap@citrix.com \
    --cc=dario.faggioli@citrix.com \
    --cc=david.vrabel@citrix.com \
    --cc=eshatokhin@virtuozzo.com \
    --cc=jgross@suse.com \
    --cc=khorenko@virtuozzo.com \
    --cc=roger.paumonne@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).