All of lore.kernel.org
 help / color / mirror / Atom feed
* HP ProLiant DL360p Gen8 hangs with Linux 4.13+.
@ 2018-01-04 22:32 Vinson Lee
  2018-01-05 16:32 ` Bart Van Assche
  2018-01-14 23:40 ` Laurence Oberman
  0 siblings, 2 replies; 10+ messages in thread
From: Vinson Lee @ 2018-01-04 22:32 UTC (permalink / raw)
  To: linux-scsi, Don Brace

Hi.

HP ProLiant DL360p Gen8 with Smart Array P420i boots to the login
prompt and hangs with Linux 4.13 or later. I cannot log in on console
or SSH into the machine. Linux 4.12 and older boot fine.

I see these messages on the console.

[  242.843206] INFO: task scsi_eh_2:465 blocked for more than 120 seconds.
[  242.877835]       Not tainted 4.15.0-041500rc6-generic #201712312330
[  242.909228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  242.945404] INFO: task xfsaild/sda2:625 blocked for more than 120 seconds.
[  242.945407]       Not tainted 4.15.0-041500rc6-generic #201712312330
[  242.945410] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  242.945896] INFO: task kworker/u130:4:1023 blocked for more than 120 seconds.
[  242.945897]       Not tainted 4.15.0-041500rc6-generic #201712312330
[  242.945897] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  242.946449] INFO: task modprobe:1550 blocked for more than 120 seconds.
[  242.946450]       Not tainted 4.15.0-041500rc6-generic #201712312330
[  242.946450] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  242.946943] INFO: task postfix:1704 blocked for more than 120 seconds.
[  242.946946]       Not tainted 4.15.0-041500rc6-generic #201712312330
[  242.946948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  242.947429] INFO: task (xinit.sh):1989 blocked for more than 120 seconds.
[  242.947432]       Not tainted 4.15.0-041500rc6-generic #201712312330
[  242.947434] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  363.674387] INFO: task scsi_eh_2:465 blocked for more than 120 seconds.
[  363.707741]       Not tainted 4.15.0-041500rc6-generic #201712312330
[  363.738601] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  363.774098] INFO: task xfsaild/sda2:625 blocked for more than 120 seconds.
[  363.804996]       Not tainted 4.15.0-041500rc6-generic #201712312330
[  363.833565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  363.869380] INFO: task kworker/u130:4:1023 blocked for more than 120 seconds.
[  363.901795]       Not tainted 4.15.0-041500rc6-generic #201712312330
[  363.930403] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  363.966228] INFO: task modprobe:1550 blocked for more than 120 seconds.
[  363.966231]       Not tainted 4.15.0-041500rc6-generic #201712312330
[  363.966233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

Cheers,
Vinson

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HP ProLiant DL360p Gen8 hangs with Linux 4.13+.
  2018-01-04 22:32 HP ProLiant DL360p Gen8 hangs with Linux 4.13+ Vinson Lee
@ 2018-01-05 16:32 ` Bart Van Assche
  2018-01-06 20:45   ` Laurence Oberman
  2018-01-11  0:52   ` Vinson Lee
  2018-01-14 23:40 ` Laurence Oberman
  1 sibling, 2 replies; 10+ messages in thread
From: Bart Van Assche @ 2018-01-05 16:32 UTC (permalink / raw)
  To: linux-scsi, don.brace, vlee

On Thu, 2018-01-04 at 14:32 -0800, Vinson Lee wrote:
> HP ProLiant DL360p Gen8 with Smart Array P420i boots to the login
> prompt and hangs with Linux 4.13 or later. I cannot log in on console
> or SSH into the machine. Linux 4.12 and older boot fine.
> 
> I see these messages on the console.
> 
> [  242.843206] INFO: task scsi_eh_2:465 blocked for more than 120 seconds.
> [  242.877835]       Not tainted 4.15.0-041500rc6-generic #201712312330

It seems like something got stuck in the block layer. The traditional way to
debug this is to analyze the information that is available under
/sys/kernel/debug/block. However, since login is not possible we can't use
that approach. Would it be possible for you to check whether this has been
resolved in kernel v4.15-rc6, and if not, bisect this?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HP ProLiant DL360p Gen8 hangs with Linux 4.13+.
  2018-01-05 16:32 ` Bart Van Assche
@ 2018-01-06 20:45   ` Laurence Oberman
  2018-01-11  0:52   ` Vinson Lee
  1 sibling, 0 replies; 10+ messages in thread
From: Laurence Oberman @ 2018-01-06 20:45 UTC (permalink / raw)
  To: Bart Van Assche, linux-scsi, don.brace, vlee

On Fri, 2018-01-05 at 16:32 +0000, Bart Van Assche wrote:
> On Thu, 2018-01-04 at 14:32 -0800, Vinson Lee wrote:
> > HP ProLiant DL360p Gen8 with Smart Array P420i boots to the login
> > prompt and hangs with Linux 4.13 or later. I cannot log in on
> > console
> > or SSH into the machine. Linux 4.12 and older boot fine.
> > 
> > I see these messages on the console.
> > 
> > [  242.843206] INFO: task scsi_eh_2:465 blocked for more than 120
> > seconds.
> > [  242.877835]       Not tainted 4.15.0-041500rc6-generic
> > #201712312330
> 
> It seems like something got stuck in the block layer. The traditional
> way to
> debug this is to analyze the information that is available under
> /sys/kernel/debug/block. However, since login is not possible we
> can't use
> that approach. Would it be possible for you to check whether this has
> been
> resolved in kernel v4.15-rc6, and if not, bisect this?
> 
> Thanks,
> 
> Bart.

One of the ways to debug this given its an HP DL380 is follow this.

1. Boot the working kernel
2, ensure kdump is activated and running on boot.
3. add these to the /etc/sysctl.conf file

   kernel.panic_on_io_nmi = 1
   kernel.panic_on_unrecovered_nmi = 1
   kernel.unknown_nmi_panic = 1

4, Once hung after boot, go to the ILO page under admin/diagnostics and
press the Virtual NMI button to generate a vmcore

When you have a vmcore, I will give you a place to upload it to so I
can look at it

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HP ProLiant DL360p Gen8 hangs with Linux 4.13+.
  2018-01-05 16:32 ` Bart Van Assche
  2018-01-06 20:45   ` Laurence Oberman
@ 2018-01-11  0:52   ` Vinson Lee
  2018-01-17  0:17     ` Vinson Lee
  1 sibling, 1 reply; 10+ messages in thread
From: Vinson Lee @ 2018-01-11  0:52 UTC (permalink / raw)
  To: Bart Van Assche, Thomas Gleixner, Christoph Hellwig; +Cc: linux-scsi, don.brace

On Fri, Jan 5, 2018 at 8:32 AM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Thu, 2018-01-04 at 14:32 -0800, Vinson Lee wrote:
>> HP ProLiant DL360p Gen8 with Smart Array P420i boots to the login
>> prompt and hangs with Linux 4.13 or later. I cannot log in on console
>> or SSH into the machine. Linux 4.12 and older boot fine.
>>
>> I see these messages on the console.
>>
>> [  242.843206] INFO: task scsi_eh_2:465 blocked for more than 120 seconds.
>> [  242.877835]       Not tainted 4.15.0-041500rc6-generic #201712312330
>
> It seems like something got stuck in the block layer. The traditional way to
> debug this is to analyze the information that is available under
> /sys/kernel/debug/block. However, since login is not possible we can't use
> that approach. Would it be possible for you to check whether this has been
> resolved in kernel v4.15-rc6, and if not, bisect this?
>
> Thanks,
>
> Bart.

Hi.

The machine still hangs with Linux 4.15-rc6.

I did a bisect. The hang is introduced with Linux 4.13-rc1 commit
c5cb83bb337c25caae995d992d1cdf9b317f83de "genirq/cpuhotplug: Handle
managed IRQs on CPU hotplug".

There is a startup script that disables hyperthreading by offlining
sibling CPUs.

for CPU in $(cut -s -d, -f2
$SYS_PATH/cpu*/topology/thread_siblings_list | sort -un); do
    echo 0 > /sys/devices/system/cpu/cpu$CPU/online
done

If the above script is not run, the machine does not hang with Linux 4.13.

Cheers,
Vinson

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HP ProLiant DL360p Gen8 hangs with Linux 4.13+.
  2018-01-04 22:32 HP ProLiant DL360p Gen8 hangs with Linux 4.13+ Vinson Lee
  2018-01-05 16:32 ` Bart Van Assche
@ 2018-01-14 23:40 ` Laurence Oberman
  2018-01-15 12:17   ` Ming Lei
  1 sibling, 1 reply; 10+ messages in thread
From: Laurence Oberman @ 2018-01-14 23:40 UTC (permalink / raw)
  To: Vinson Lee, linux-scsi, Don Brace; +Cc: Hellwig, Christoph, Jens Axboe

On Thu, 2018-01-04 at 14:32 -0800, Vinson Lee wrote:
> Hi.
> 
> HP ProLiant DL360p Gen8 with Smart Array P420i boots to the login
> prompt and hangs with Linux 4.13 or later. I cannot log in on console
> or SSH into the machine. Linux 4.12 and older boot fine.
> 
> 
...

...

This issue bit me for for two straight days.
I was testing Mike Snitzers combined tree and this commit crept into
the latest combined tree.

commit 84676c1f21e8ff54befe985f4f14dc1edc10046b
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Jan 12 10:53:05 2018 +0800

    genirq/affinity: assign vectors to all possible CPUs
   
    Currently we assign managed interrupt vectors to all present
CPUs.  This
    works fine for systems were we only online/offline CPUs.  But in
case of
    systems that support physical CPU hotplug (or the virtualized
version of
    it) this means the additional CPUs covered for in the ACPI tables
or on
    the command line are not catered for.  To fix this we'd either need
to
    introduce new hotplug CPU states just for this case, or we can
start
    assining vectors to possible but not present CPUs.
   
    Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
    Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
    Cc: linux-kernel@vger.kernel.org
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reason I never thought about this being my reason for the latest hang
is I have used Linus' tree all the way to 4.15-rc7 with no issues.

Vinson reporting it against 4.13 or later was not making sense because
I had not seen the hang until this weekend.

I checked  and its in Linus's tree but its not an issue in the generic
4.15-rc7 for me.

Anyway, its going to possibly bite anybody running HP DL servers with
HPSA boot devices. I have not tried the workaround below.
>From Vinsons message repeated here

"The machine still hangs with Linux 4.15-rc6.

I did a bisect. The hang is introduced with Linux 4.13-rc1 commit
c5cb83bb337c25caae995d992d1cdf9b317f83de "genirq/cpuhotplug: Handle
managed IRQs on CPU hotplug".

There is a startup script that disables hyperthreading by offlining
sibling CPUs.

for CPU in $(cut -s -d, -f2
$SYS_PATH/cpu*/topology/thread_siblings_list | sort -un); do
    echo 0 > /sys/devices/system/cpu/cpu$CPU/online
done

If the above script is not run, the machine does not hang with Linux
4.13.

Cheers,
Vinson"

Thanks
Laurence

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HP ProLiant DL360p Gen8 hangs with Linux 4.13+.
  2018-01-14 23:40 ` Laurence Oberman
@ 2018-01-15 12:17   ` Ming Lei
  2018-01-15 12:51     ` Laurence Oberman
  0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2018-01-15 12:17 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Vinson Lee, linux-scsi, Don Brace, Hellwig, Christoph,
	Jens Axboe, Thomas Gleixner, linux-kernel

On Sun, Jan 14, 2018 at 06:40:40PM -0500, Laurence Oberman wrote:
> On Thu, 2018-01-04 at 14:32 -0800, Vinson Lee wrote:
> > Hi.
> > 
> > HP ProLiant DL360p Gen8 with Smart Array P420i boots to the login
> > prompt and hangs with Linux 4.13 or later. I cannot log in on console
> > or SSH into the machine. Linux 4.12 and older boot fine.
> > 
> > 
> ...
> 
> ...
> 
> This issue bit me for for two straight days.
> I was testing Mike Snitzers combined tree and this commit crept into
> the latest combined tree.
> 
> commit 84676c1f21e8ff54befe985f4f14dc1edc10046b
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Fri Jan 12 10:53:05 2018 +0800
> 
>     genirq/affinity: assign vectors to all possible CPUs
>    
>     Currently we assign managed interrupt vectors to all present
> CPUs.  This
>     works fine for systems were we only online/offline CPUs.  But in
> case of
>     systems that support physical CPU hotplug (or the virtualized
> version of
>     it) this means the additional CPUs covered for in the ACPI tables
> or on
>     the command line are not catered for.  To fix this we'd either need
> to
>     introduce new hotplug CPU states just for this case, or we can
> start
>     assining vectors to possible but not present CPUs.
>    
>     Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
>     Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
>     Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
>     Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
>     Cc: linux-kernel@vger.kernel.org
>     Cc: Thomas Gleixner <tglx@linutronix.de>
>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> Reason I never thought about this being my reason for the latest hang
> is I have used Linus' tree all the way to 4.15-rc7 with no issues.
> 
> Vinson reporting it against 4.13 or later was not making sense because
> I had not seen the hang until this weekend.
> 
> I checked  and its in Linus's tree but its not an issue in the generic
> 4.15-rc7 for me.

Hi Laurence,

Wrt. your issue, I have investigated a bit and found that it is because
one irq vector may be assigned to all offline CPUs, and it may not be
same with Vinson's.

And the following patch can address your issue, I may prepare a formal
version if no one objects this approach.

Thomas, Christoph, could you take a look this patch?

---
 kernel/irq/affinity.c | 69 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 47 insertions(+), 22 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a37a3b4b6342..dfc1f6a9c488 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,6 +94,39 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
 	return nodes;
 }
 
+/*
+ * Spread the affinity of @nmsk into @nr_vecs irq vectors, and the
+ * result is stored to @start_irqmsk.
+ */
+static int irq_vecs_spread_affinity(struct cpumask *irqmsk,
+				    int max_irqmsks,
+				    struct cpumask *nmsk,
+				    int max_ncpus)
+{
+	int v, ncpus;
+	int vecs_to_assign, extra_vecs;
+
+	/* Calculate the number of cpus per vector */
+	ncpus = cpumask_weight(nmsk);
+	vecs_to_assign = min(max_ncpus, ncpus);
+
+	/* Account for rounding errors */
+	extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
+
+	for (v = 0; v < min(max_irqmsks, vecs_to_assign); v++) {
+		int cpus_per_vec = ncpus / vecs_to_assign;
+
+		/* Account for extra vectors to compensate rounding errors */
+		if (extra_vecs) {
+			cpus_per_vec++;
+			--extra_vecs;
+		}
+		irq_spread_init_one(irqmsk + v, nmsk, cpus_per_vec);
+	}
+
+	return v;
+}
+
 /**
  * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
  * @nvecs:	The total number of vectors
@@ -104,7 +137,7 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
 struct cpumask *
 irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 {
-	int n, nodes, cpus_per_vec, extra_vecs, curvec;
+	int n, nodes, curvec;
 	int affv = nvecs - affd->pre_vectors - affd->post_vectors;
 	int last_affv = affv + affd->pre_vectors;
 	nodemask_t nodemsk = NODE_MASK_NONE;
@@ -154,33 +187,25 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	}
 
 	for_each_node_mask(n, nodemsk) {
-		int ncpus, v, vecs_to_assign, vecs_per_node;
+		int vecs_per_node;
 
 		/* Spread the vectors per node */
 		vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
-		/* Get the cpus on this node which are in the mask */
-		cpumask_and(nmsk, cpu_possible_mask, node_to_possible_cpumask[n]);
 
-		/* Calculate the number of cpus per vector */
-		ncpus = cpumask_weight(nmsk);
-		vecs_to_assign = min(vecs_per_node, ncpus);
-
-		/* Account for rounding errors */
-		extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
-
-		for (v = 0; curvec < last_affv && v < vecs_to_assign;
-		     curvec++, v++) {
-			cpus_per_vec = ncpus / vecs_to_assign;
-
-			/* Account for extra vectors to compensate rounding errors */
-			if (extra_vecs) {
-				cpus_per_vec++;
-				--extra_vecs;
-			}
-			irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
-		}
+		/* spread non-online possible cpus */
+		cpumask_andnot(nmsk, node_to_possible_cpumask[n], cpu_online_mask);
+		irq_vecs_spread_affinity(&masks[curvec], last_affv - curvec,
+					 nmsk, vecs_per_node);
 
+		/*
+		 * spread online possible cpus to make sure each vector
+		 * can get one online cpu to handle
+		 */
+		cpumask_and(nmsk, node_to_possible_cpumask[n], cpu_online_mask);
+		curvec += irq_vecs_spread_affinity(&masks[curvec],
+						   last_affv - curvec,
+						   nmsk, vecs_per_node);
 		if (curvec >= last_affv)
 			break;
 		--nodes;
-- 
2.9.5


-- 
Ming

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: HP ProLiant DL360p Gen8 hangs with Linux 4.13+.
  2018-01-15 12:17   ` Ming Lei
@ 2018-01-15 12:51     ` Laurence Oberman
  2018-01-15 15:01       ` Hellwig, Christoph
  0 siblings, 1 reply; 10+ messages in thread
From: Laurence Oberman @ 2018-01-15 12:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vinson Lee, linux-scsi, Don Brace, Hellwig, Christoph,
	Jens Axboe, Thomas Gleixner, linux-kernel

On Mon, 2018-01-15 at 20:17 +0800, Ming Lei wrote:
> On Sun, Jan 14, 2018 at 06:40:40PM -0500, Laurence Oberman wrote:
> > On Thu, 2018-01-04 at 14:32 -0800, Vinson Lee wrote:
> > > Hi.
> > > 
> > > HP ProLiant DL360p Gen8 with Smart Array P420i boots to the login
> > > prompt and hangs with Linux 4.13 or later. I cannot log in on
> > > console
> > > or SSH into the machine. Linux 4.12 and older boot fine.
> > > 
> > > 
> > 
> > ...
> > 
> > ...
> > 
> > This issue bit me for for two straight days.
> > I was testing Mike Snitzers combined tree and this commit crept
> > into
> > the latest combined tree.
> > 
> > commit 84676c1f21e8ff54befe985f4f14dc1edc10046b
> > Author: Christoph Hellwig <hch@lst.de>
> > Date:   Fri Jan 12 10:53:05 2018 +0800
> > 
> >     genirq/affinity: assign vectors to all possible CPUs
> >    
> >     Currently we assign managed interrupt vectors to all present
> > CPUs.  This
> >     works fine for systems were we only online/offline CPUs.  But
> > in
> > case of
> >     systems that support physical CPU hotplug (or the virtualized
> > version of
> >     it) this means the additional CPUs covered for in the ACPI
> > tables
> > or on
> >     the command line are not catered for.  To fix this we'd either
> > need
> > to
> >     introduce new hotplug CPU states just for this case, or we can
> > start
> >     assining vectors to possible but not present CPUs.
> >    
> >     Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> >     Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> >     Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
> >     Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present
> > CPU")
> >     Cc: linux-kernel@vger.kernel.org
> >     Cc: Thomas Gleixner <tglx@linutronix.de>
> >     Signed-off-by: Christoph Hellwig <hch@lst.de>
> >     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > 
> > Reason I never thought about this being my reason for the latest
> > hang
> > is I have used Linus' tree all the way to 4.15-rc7 with no issues.
> > 
> > Vinson reporting it against 4.13 or later was not making sense
> > because
> > I had not seen the hang until this weekend.
> > 
> > I checked  and its in Linus's tree but its not an issue in the
> > generic
> > 4.15-rc7 for me.
> 
> Hi Laurence,
> 
> Wrt. your issue, I have investigated a bit and found that it is
> because
> one irq vector may be assigned to all offline CPUs, and it may not be
> same with Vinson's.
> 
> And the following patch can address your issue, I may prepare a
> formal
> version if no one objects this approach.
> 
> Thomas, Christoph, could you take a look this patch?
> 
> ---
>  kernel/irq/affinity.c | 69 +++++++++++++++++++++++++++++++++++----
> ------------
>  1 file changed, 47 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index a37a3b4b6342..dfc1f6a9c488 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -94,6 +94,39 @@ static int get_nodes_in_cpumask(cpumask_var_t
> *node_to_possible_cpumask,
>  	return nodes;
>  }
>  
> +/*
> + * Spread the affinity of @nmsk into @nr_vecs irq vectors, and the
> + * result is stored to @start_irqmsk.
> + */
> +static int irq_vecs_spread_affinity(struct cpumask *irqmsk,
> +				    int max_irqmsks,
> +				    struct cpumask *nmsk,
> +				    int max_ncpus)
> +{
> +	int v, ncpus;
> +	int vecs_to_assign, extra_vecs;
> +
> +	/* Calculate the number of cpus per vector */
> +	ncpus = cpumask_weight(nmsk);
> +	vecs_to_assign = min(max_ncpus, ncpus);
> +
> +	/* Account for rounding errors */
> +	extra_vecs = ncpus - vecs_to_assign * (ncpus /
> vecs_to_assign);
> +
> +	for (v = 0; v < min(max_irqmsks, vecs_to_assign); v++) {
> +		int cpus_per_vec = ncpus / vecs_to_assign;
> +
> +		/* Account for extra vectors to compensate rounding
> errors */
> +		if (extra_vecs) {
> +			cpus_per_vec++;
> +			--extra_vecs;
> +		}
> +		irq_spread_init_one(irqmsk + v, nmsk, cpus_per_vec);
> +	}
> +
> +	return v;
> +}
> +
>  /**
>   * irq_create_affinity_masks - Create affinity masks for multiqueue
> spreading
>   * @nvecs:	The total number of vectors
> @@ -104,7 +137,7 @@ static int get_nodes_in_cpumask(cpumask_var_t
> *node_to_possible_cpumask,
>  struct cpumask *
>  irq_create_affinity_masks(int nvecs, const struct irq_affinity
> *affd)
>  {
> -	int n, nodes, cpus_per_vec, extra_vecs, curvec;
> +	int n, nodes, curvec;
>  	int affv = nvecs - affd->pre_vectors - affd->post_vectors;
>  	int last_affv = affv + affd->pre_vectors;
>  	nodemask_t nodemsk = NODE_MASK_NONE;
> @@ -154,33 +187,25 @@ irq_create_affinity_masks(int nvecs, const
> struct irq_affinity *affd)
>  	}
>  
>  	for_each_node_mask(n, nodemsk) {
> -		int ncpus, v, vecs_to_assign, vecs_per_node;
> +		int vecs_per_node;
>  
>  		/* Spread the vectors per node */
>  		vecs_per_node = (affv - (curvec - affd-
> >pre_vectors)) / nodes;
>  
> -		/* Get the cpus on this node which are in the mask
> */
> -		cpumask_and(nmsk, cpu_possible_mask,
> node_to_possible_cpumask[n]);
>  
> -		/* Calculate the number of cpus per vector */
> -		ncpus = cpumask_weight(nmsk);
> -		vecs_to_assign = min(vecs_per_node, ncpus);
> -
> -		/* Account for rounding errors */
> -		extra_vecs = ncpus - vecs_to_assign * (ncpus /
> vecs_to_assign);
> -
> -		for (v = 0; curvec < last_affv && v <
> vecs_to_assign;
> -		     curvec++, v++) {
> -			cpus_per_vec = ncpus / vecs_to_assign;
> -
> -			/* Account for extra vectors to compensate
> rounding errors */
> -			if (extra_vecs) {
> -				cpus_per_vec++;
> -				--extra_vecs;
> -			}
> -			irq_spread_init_one(masks + curvec, nmsk,
> cpus_per_vec);
> -		}
> +		/* spread non-online possible cpus */
> +		cpumask_andnot(nmsk, node_to_possible_cpumask[n],
> cpu_online_mask);
> +		irq_vecs_spread_affinity(&masks[curvec], last_affv -
> curvec,
> +					 nmsk, vecs_per_node);
>  
> +		/*
> +		 * spread online possible cpus to make sure each
> vector
> +		 * can get one online cpu to handle
> +		 */
> +		cpumask_and(nmsk, node_to_possible_cpumask[n],
> cpu_online_mask);
> +		curvec += irq_vecs_spread_affinity(&masks[curvec],
> +						   last_affv -
> curvec,
> +						   nmsk,
> vecs_per_node);
>  		if (curvec >= last_affv)
>  			break;
>  		--nodes;
> -- 
> 2.9.5
> 
> 

Hello Ming

I will test the patch. I did not spend a lot of time seeing if this
weekends stalls were an exact match to Vinson, I just knew pulling that
patch resolved it.
Perhaps this explains why I was not seeing this on generic 4.15-rc7.

Thanks
Laurence

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HP ProLiant DL360p Gen8 hangs with Linux 4.13+.
  2018-01-15 12:51     ` Laurence Oberman
@ 2018-01-15 15:01       ` Hellwig, Christoph
  2018-01-15 16:25         ` Laurence Oberman
  0 siblings, 1 reply; 10+ messages in thread
From: Hellwig, Christoph @ 2018-01-15 15:01 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Ming Lei, Vinson Lee, linux-scsi, Don Brace, Hellwig, Christoph,
	Jens Axboe, Thomas Gleixner, linux-kernel

Laurence, I'm a little confused.  Is this the same issue we just fixed,
or is this an issue showing up with the fix?

E.g. what kernel versions or trees are affected?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HP ProLiant DL360p Gen8 hangs with Linux 4.13+.
  2018-01-15 15:01       ` Hellwig, Christoph
@ 2018-01-15 16:25         ` Laurence Oberman
  0 siblings, 0 replies; 10+ messages in thread
From: Laurence Oberman @ 2018-01-15 16:25 UTC (permalink / raw)
  To: Hellwig, Christoph
  Cc: Ming Lei, Vinson Lee, linux-scsi, Don Brace, Jens Axboe,
	Thomas Gleixner, linux-kernel

On Mon, 2018-01-15 at 07:01 -0800, Hellwig, Christoph wrote:
> Laurence, I'm a little confused.  Is this the same issue we just
> fixed,
> or is this an issue showing up with the fix?
> 
> E.g. what kernel versions or trees are affected?

Hello Christoph

This showed up on a  combined tree of Mikes and Jens (4.15.0-
rc4.block.dm.4.16) I was testing this weekend but was not apparent on
the generic upstream 4.15-rc7 from Linus.
I have to admit that was puzzling me.

When I removed your commit the issue went away.

Ming has crafted a fix so that your original commit can remain in and I
am testing that now against the same tree that was hanging before.

Ming has a handle on the issue so I will report back after testing.

Kernel is building now

Thanks
Laurence

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: HP ProLiant DL360p Gen8 hangs with Linux 4.13+.
  2018-01-11  0:52   ` Vinson Lee
@ 2018-01-17  0:17     ` Vinson Lee
  0 siblings, 0 replies; 10+ messages in thread
From: Vinson Lee @ 2018-01-17  0:17 UTC (permalink / raw)
  To: Bart Van Assche, Thomas Gleixner, Christoph Hellwig; +Cc: linux-scsi, don.brace

On Wed, Jan 10, 2018 at 4:52 PM, Vinson Lee <vlee@freedesktop.org> wrote:
> On Fri, Jan 5, 2018 at 8:32 AM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
>> On Thu, 2018-01-04 at 14:32 -0800, Vinson Lee wrote:
>>> HP ProLiant DL360p Gen8 with Smart Array P420i boots to the login
>>> prompt and hangs with Linux 4.13 or later. I cannot log in on console
>>> or SSH into the machine. Linux 4.12 and older boot fine.
>>>
>>> I see these messages on the console.
>>>
>>> [  242.843206] INFO: task scsi_eh_2:465 blocked for more than 120 seconds.
>>> [  242.877835]       Not tainted 4.15.0-041500rc6-generic #201712312330
>>
>> It seems like something got stuck in the block layer. The traditional way to
>> debug this is to analyze the information that is available under
>> /sys/kernel/debug/block. However, since login is not possible we can't use
>> that approach. Would it be possible for you to check whether this has been
>> resolved in kernel v4.15-rc6, and if not, bisect this?
>>
>> Thanks,
>>
>> Bart.
>
> Hi.
>
> The machine still hangs with Linux 4.15-rc6.
>
> I did a bisect. The hang is introduced with Linux 4.13-rc1 commit
> c5cb83bb337c25caae995d992d1cdf9b317f83de "genirq/cpuhotplug: Handle
> managed IRQs on CPU hotplug".
>
> There is a startup script that disables hyperthreading by offlining
> sibling CPUs.
>
> for CPU in $(cut -s -d, -f2
> $SYS_PATH/cpu*/topology/thread_siblings_list | sort -un); do
>     echo 0 > /sys/devices/system/cpu/cpu$CPU/online
> done
>
> If the above script is not run, the machine does not hang with Linux 4.13.
>
> Cheers,
> Vinson

Hi.

HP ProLiant DL360p Gen8 still hangs with Linux 4.15-rc8.

I see machine hangs now too with another machine with Microsemi
Adaptec RAID 71605 and aacraid driver on both Linux 4.13 and Linux
4.15-rc8.

Cheers,
Vinson

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-01-17  0:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-04 22:32 HP ProLiant DL360p Gen8 hangs with Linux 4.13+ Vinson Lee
2018-01-05 16:32 ` Bart Van Assche
2018-01-06 20:45   ` Laurence Oberman
2018-01-11  0:52   ` Vinson Lee
2018-01-17  0:17     ` Vinson Lee
2018-01-14 23:40 ` Laurence Oberman
2018-01-15 12:17   ` Ming Lei
2018-01-15 12:51     ` Laurence Oberman
2018-01-15 15:01       ` Hellwig, Christoph
2018-01-15 16:25         ` Laurence Oberman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.