All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dexuan Cui <decui@microsoft.com>
To: "hch@lst.de" <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>,
	Bart Van Assche <Bart.VanAssche@sandisk.com>,
	"hare@suse.com" <hare@suse.com>, "hare@suse.de" <hare@suse.de>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"jth@kernel.org" <jth@kernel.org>,
	Nick Meier <Nick.Meier@microsoft.com>,
	"Alex Ng (LIS)" <alexng@microsoft.com>,
	Long Li <longli@microsoft.com>,
	"Adrian Suhov (Cloudbase Solutions SRL)" <v-adsuho@microsoft.com>,
	"Chris Valean (Cloudbase Solutions SRL)" <v-chvale@microsoft.com>
Subject: RE: Boot regression (was "Re: [PATCH] genhd: Do not hold event	lock when scheduling workqueue elements")
Date: Wed, 15 Feb 2017 13:51:33 +0000	[thread overview]
Message-ID: <MWHPR03MB26691FB87F0823E2E62868BABF5B0@MWHPR03MB2669.namprd03.prod.outlook.com> (raw)
In-Reply-To: <20170214163437.GA23956@lst.de>

> From: hch@lst.de [mailto:hch@lst.de]
> Sent: Wednesday, February 15, 2017 00:35
> > I tested today's linux-next (next-20170214) + the 2 patches just now an=
d
> got
> > a weird result:
> > sometimes the VM stills hung with a new calltrace (BUG: spinlock bad
> > magic) , but sometimes the VM did boot up despite the new calltrace!
> >
> > Attached is the log of a "good" boot.
> >
> > It looks we have a memory corruption issue somewhere...
>
> Yes.
It's due to an uninitialized spinlock. Please see the below.

> > Actually previously I saw the "BUG: spinlock bad magic" message once, b=
ut
> I
> > couldn't repro it later, so I didn't mention it to you.
>
> Interesting.
Ditto.
And probably my memory was inaccurate due to the long period of bisecting.
We should always see the message.

>
> A few questions on the dmesg:
>
> [    6.208794] sd 2:0:0:0: [storvsc] Sense Key : Illegal Request [current=
]
> [    6.209447] sd 2:0:0:0: [storvsc] Add. Sense: Invalid command operatio=
n
> code
> [    6.210043] sd 3:0:0:0: [storvsc] Sense Key : Illegal Request [current=
]
> [    6.210618] sd 3:0:0:0: [storvsc] Add. Sense: Invalid command operatio=
n
> code
> [    6.212272] sd 2:0:0:0: [storvsc] Sense Key : Illegal Request [current=
]
> [    6.212897] sd 2:0:0:0: [storvsc] Add. Sense: Invalid command operatio=
n
> code
> [    6.213474] sd 3:0:0:0: [storvsc] Sense Key : Illegal Request [current=
]
> [    6.214051] sd 3:0:0:0: [storvsc] Add. Sense: Invalid command operatio=
n
> code
>
> I didn't see anything like this in the other logs.  Are these messages
> something usual on HyperV VMs?

The messages should be normal, and I have always seen them for years
at least.

I think they're triggered by "/sbin/hdparm /dev/sda" or
"/lib/udev/ata_id /dev/sda"  in my Ubuntu 16.04 VM: during the VM
boot-up, the 2 programs are executed by the initrd's systemd-udev or other
system init scripts and the 2 programs try to run these 2 SCSI commands,
which are not supported by Hyper-V:

ATA PASS THROUGH (12) -- 0xa1
ATA PASS THROUGH (16) -- 0x85

IMO the commands are used when PATA/SATA devices are connected to
SCSI-to-ATA bridge, so it's understandable that Hyper-V doesn't support
them.

In the case of boot failure (hang), the 2 programs don't have a chance to
run, so we can't see the messages.

IMO we should ignore the messages, which should be harmless.

> [    6.358405] XFS (sdb1): Mounting V5 Filesystem
> [    6.404478] XFS (sdb1): Ending clean mount
> [    7.535174] BUG: spinlock bad magic on CPU#0, swapper/0/0
> [    7.536807]  lock: host_ts+0x30/0xffffffffffffe1a0 [hv_utils], .magic:
> 00000000, .owner: <none>/-1, .owner_cpu: 0
> [    7.538436] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-rc8-next-
> 20170214+ #1
> [    7.539142] Hardware name: Microsoft Corporation Virtual Machine/Virtu=
al
> Machine, BIOS 090006  04/28/2016
> [    7.539142] Call Trace:
> [    7.539142]  <IRQ>
> [    7.539142]  dump_stack+0x63/0x82
> [    7.539142]  spin_dump+0x78/0xc0
> [    7.539142]  do_raw_spin_lock+0xfd/0x160
> [    7.539142]  _raw_spin_lock_irqsave+0x4c/0x60
> [    7.539142]  ? timesync_onchannelcallback+0x153/0x220 [hv_utils]
> [    7.539142]  timesync_onchannelcallback+0x153/0x220 [hv_utils]
>
> Can you resolve this address using gdb to a line of code?  Once inside
> gdb do:
>
> l *(timesync_onchannelcallback+0x153)

(gdb) l *(timesync_onchannelcallback+0x153)
0xffffffffc0104593 is in timesync_onchannelcallback (drivers/hv/hv_util.c:2=
79).
274             } else {
275                     /*
276                      * Save the adjusted time sample from the host and =
the snapshot
277                      * of the current system time for PTP device.
278                      */
279                     spin_lock_irqsave(&host_ts.lock, flags);
280
281                     cur_reftime =3D hyperv_cs->read(hyperv_cs);
282                     host_ts.host_time =3D hosttime;
283                     host_ts.ref_time =3D cur_reftime;

It turns out the "host_ts.lock" isn't initialized with spin_lock_init().
I'll submit a patch for this.

However, the SCSI issue (i.e. *sometimes* the VM fails to boot) is still th=
ere, I think.
We need to continue to debug...

Thanks,
-- Dexuan

WARNING: multiple messages have this Message-ID (diff)
From: Dexuan Cui <decui@microsoft.com>
To: "hch@lst.de" <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>,
	Bart Van Assche <Bart.VanAssche@sandisk.com>,
	"hare@suse.com" <hare@suse.com>, "hare@suse.de" <hare@suse.de>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"jth@kernel.org" <jth@kernel.org>,
	Nick Meier <Nick.Meier@microsoft.com>,
	"Alex Ng (LIS)" <alexng@microsoft.com>,
	Long Li <longli@microsoft.com>,
	"Adrian Suhov (Cloudbase Solutions SRL)" <v-adsuho@microsoft.com>,
	"Chris Valean (Cloudbase Solutions SRL)" <v-chvale@microsoft.com>
Subject: RE: Boot regression (was "Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue elements")
Date: Wed, 15 Feb 2017 13:51:33 +0000	[thread overview]
Message-ID: <MWHPR03MB26691FB87F0823E2E62868BABF5B0@MWHPR03MB2669.namprd03.prod.outlook.com> (raw)
In-Reply-To: <20170214163437.GA23956@lst.de>

> From: hch@lst.de [mailto:hch@lst.de]
> Sent: Wednesday, February 15, 2017 00:35
> > I tested today's linux-next (next-20170214) + the 2 patches just now and
> got
> > a weird result:
> > sometimes the VM stills hung with a new calltrace (BUG: spinlock bad
> > magic) , but sometimes the VM did boot up despite the new calltrace!
> >
> > Attached is the log of a "good" boot.
> >
> > It looks we have a memory corruption issue somewhere...
>
> Yes.
It's due to an uninitialized spinlock. Please see the below.

> > Actually previously I saw the "BUG: spinlock bad magic" message once, but
> I
> > couldn't repro it later, so I didn't mention it to you.
>
> Interesting.
Ditto.
And probably my memory was inaccurate due to the long period of bisecting.
We should always see the message.

>
> A few questions on the dmesg:
>
> [    6.208794] sd 2:0:0:0: [storvsc] Sense Key : Illegal Request [current]
> [    6.209447] sd 2:0:0:0: [storvsc] Add. Sense: Invalid command operation
> code
> [    6.210043] sd 3:0:0:0: [storvsc] Sense Key : Illegal Request [current]
> [    6.210618] sd 3:0:0:0: [storvsc] Add. Sense: Invalid command operation
> code
> [    6.212272] sd 2:0:0:0: [storvsc] Sense Key : Illegal Request [current]
> [    6.212897] sd 2:0:0:0: [storvsc] Add. Sense: Invalid command operation
> code
> [    6.213474] sd 3:0:0:0: [storvsc] Sense Key : Illegal Request [current]
> [    6.214051] sd 3:0:0:0: [storvsc] Add. Sense: Invalid command operation
> code
>
> I didn't see anything like this in the other logs.  Are these messages
> something usual on HyperV VMs?

The messages should be normal, and I have always seen them for years
at least.

I think they're triggered by "/sbin/hdparm /dev/sda" or
"/lib/udev/ata_id /dev/sda"  in my Ubuntu 16.04 VM: during the VM
boot-up, the 2 programs are executed by the initrd's systemd-udev or other
system init scripts and the 2 programs try to run these 2 SCSI commands,
which are not supported by Hyper-V:

ATA PASS THROUGH (12) -- 0xa1
ATA PASS THROUGH (16) -- 0x85

IMO the commands are used when PATA/SATA devices are connected to
SCSI-to-ATA bridge, so it's understandable that Hyper-V doesn't support
them.

In the case of boot failure (hang), the 2 programs don't have a chance to
run, so we can't see the messages.

IMO we should ignore the messages, which should be harmless.

> [    6.358405] XFS (sdb1): Mounting V5 Filesystem
> [    6.404478] XFS (sdb1): Ending clean mount
> [    7.535174] BUG: spinlock bad magic on CPU#0, swapper/0/0
> [    7.536807]  lock: host_ts+0x30/0xffffffffffffe1a0 [hv_utils], .magic:
> 00000000, .owner: <none>/-1, .owner_cpu: 0
> [    7.538436] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-rc8-next-
> 20170214+ #1
> [    7.539142] Hardware name: Microsoft Corporation Virtual Machine/Virtual
> Machine, BIOS 090006  04/28/2016
> [    7.539142] Call Trace:
> [    7.539142]  <IRQ>
> [    7.539142]  dump_stack+0x63/0x82
> [    7.539142]  spin_dump+0x78/0xc0
> [    7.539142]  do_raw_spin_lock+0xfd/0x160
> [    7.539142]  _raw_spin_lock_irqsave+0x4c/0x60
> [    7.539142]  ? timesync_onchannelcallback+0x153/0x220 [hv_utils]
> [    7.539142]  timesync_onchannelcallback+0x153/0x220 [hv_utils]
>
> Can you resolve this address using gdb to a line of code?  Once inside
> gdb do:
>
> l *(timesync_onchannelcallback+0x153)

(gdb) l *(timesync_onchannelcallback+0x153)
0xffffffffc0104593 is in timesync_onchannelcallback (drivers/hv/hv_util.c:279).
274             } else {
275                     /*
276                      * Save the adjusted time sample from the host and the snapshot
277                      * of the current system time for PTP device.
278                      */
279                     spin_lock_irqsave(&host_ts.lock, flags);
280
281                     cur_reftime = hyperv_cs->read(hyperv_cs);
282                     host_ts.host_time = hosttime;
283                     host_ts.ref_time = cur_reftime;

It turns out the "host_ts.lock" isn't initialized with spin_lock_init().
I'll submit a patch for this.

However, the SCSI issue (i.e. *sometimes* the VM fails to boot) is still there, I think.
We need to continue to debug...

Thanks,
-- Dexuan

  reply	other threads:[~2017-02-15 13:51 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-18  9:48 [PATCH] genhd: Do not hold event lock when scheduling workqueue elements Hannes Reinecke
2017-01-31  0:31 ` Bart Van Assche
2017-01-31  0:31   ` Bart Van Assche
2017-01-31 16:15   ` Hannes Reinecke
2017-01-31 16:15     ` Hannes Reinecke
2017-02-03 12:22     ` Dexuan Cui
2017-02-03 12:22       ` Dexuan Cui
2017-02-07  2:23       ` Dexuan Cui
2017-02-07  2:23         ` Dexuan Cui
2017-02-07  2:56         ` Bart Van Assche
2017-02-07  2:56           ` Bart Van Assche
2017-02-07  3:48           ` Dexuan Cui
2017-02-07  6:29             ` Dexuan Cui
2017-02-07  6:29               ` Dexuan Cui
2017-02-07 16:09               ` Jens Axboe
2017-02-08 10:48                 ` Dexuan Cui
2017-02-08 10:48                   ` Dexuan Cui
2017-02-08 17:43                   ` Boot regression (was "Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue elements") Jens Axboe
2017-02-08 18:03                     ` hch
2017-02-09  7:35                       ` Dexuan Cui
2017-02-09  7:35                         ` Dexuan Cui
2017-02-09 13:08                         ` hch
2017-02-10 14:49                           ` Dexuan Cui
2017-02-10 14:49                             ` Dexuan Cui
2017-02-14 13:47                             ` hch
2017-02-14 14:17                               ` Dexuan Cui
2017-02-14 14:17                                 ` Dexuan Cui
2017-02-14 14:28                                 ` hch
2017-02-14 14:46                                   ` Dexuan Cui
2017-02-14 14:46                                     ` Dexuan Cui
2017-02-14 14:51                                     ` hch
2017-02-14 15:54                                       ` Dexuan Cui
2017-02-14 15:54                                         ` Dexuan Cui
2017-02-14 16:34                                         ` hch
2017-02-15 13:51                                           ` Dexuan Cui [this message]
2017-02-15 13:51                                             ` Dexuan Cui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=MWHPR03MB26691FB87F0823E2E62868BABF5B0@MWHPR03MB2669.namprd03.prod.outlook.com \
    --to=decui@microsoft.com \
    --cc=Bart.VanAssche@sandisk.com \
    --cc=Nick.Meier@microsoft.com \
    --cc=alexng@microsoft.com \
    --cc=axboe@kernel.dk \
    --cc=hare@suse.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=jth@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longli@microsoft.com \
    --cc=martin.petersen@oracle.com \
    --cc=v-adsuho@microsoft.com \
    --cc=v-chvale@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.