All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression in 4.8 - CPU speed set very low
@ 2016-09-09 17:39 Larry Finger
  2016-09-14 16:00 ` Larry Finger
  0 siblings, 1 reply; 35+ messages in thread
From: Larry Finger @ 2016-09-09 17:39 UTC (permalink / raw)
  To: LKML

I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop 
with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu 
frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem, 
thus a bisection is not possible. It usually happens under heavy load, such as a 
kernel build or the RPM build of VirtualBox, but it does not always fail with 
these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of 
396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the 
fault is 3437 MHz. Nothing is logged when this happens.

If I were to get a patch that would show a backtrace when the maximum CPU 
frequency is changed, perhaps it would be possible to track this bug.

Sorry that I can not be more specific.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-09 17:39 Regression in 4.8 - CPU speed set very low Larry Finger
@ 2016-09-14 16:00 ` Larry Finger
  2016-09-19  2:54   ` Larry Finger
  0 siblings, 1 reply; 35+ messages in thread
From: Larry Finger @ 2016-09-14 16:00 UTC (permalink / raw)
  To: LKML

On 09/09/2016 12:39 PM, Larry Finger wrote:
> I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
> frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
> thus a bisection is not possible. It usually happens under heavy load, such as a
> kernel build or the RPM build of VirtualBox, but it does not always fail with
> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
> fault is 3437 MHz. Nothing is logged when this happens.
>
> If I were to get a patch that would show a backtrace when the maximum CPU
> frequency is changed, perhaps it would be possible to track this bug.

I have not yet found the bad commit, but I have reduced the range of commits a 
bit. This bug has been difficult to trigger. So far, it has not taken over 1/2 
day to appear in bad kernels, thus I am allowing three days before deciding that 
a given trial is good. I never saw the problem with 4.7 kernels, but I did in 
4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did 
not show the bug.

Testing continues.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-14 16:00 ` Larry Finger
@ 2016-09-19  2:54   ` Larry Finger
  2016-09-24  2:45     ` Larry Finger
  0 siblings, 1 reply; 35+ messages in thread
From: Larry Finger @ 2016-09-19  2:54 UTC (permalink / raw)
  To: LKML

On 09/14/2016 11:00 AM, Larry Finger wrote:
> On 09/09/2016 12:39 PM, Larry Finger wrote:
>> I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
>> thus a bisection is not possible. It usually happens under heavy load, such as a
>> kernel build or the RPM build of VirtualBox, but it does not always fail with
>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
>> fault is 3437 MHz. Nothing is logged when this happens.
>>
>> If I were to get a patch that would show a backtrace when the maximum CPU
>> frequency is changed, perhaps it would be possible to track this bug.
>
> I have not yet found the bad commit, but I have reduced the range of commits a
> bit. This bug has been difficult to trigger. So far, it has not taken over 1/2
> day to appear in bad kernels, thus I am allowing three days before deciding that
> a given trial is good. I never saw the problem with 4.7 kernels, but I did in
> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did
> not show the bug.
>
> Testing continues.

And still does. My bisection seemed to be trending toward an improbable set of 
commits, and I needed to do some other work with the machine, thus I started 
running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated 
that using 3 days to indicate a "good" trial was likely too short. I am 
currently testing the first of the trial and will run it for at least a week. It 
is unlikely that these tests will be complete before 4,8 is released, even if 
-rc8 is needed. I will keep attempting to find the faulty commit.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-19  2:54   ` Larry Finger
@ 2016-09-24  2:45     ` Larry Finger
  2016-09-26 11:37       ` Rafael J. Wysocki
  0 siblings, 1 reply; 35+ messages in thread
From: Larry Finger @ 2016-09-24  2:45 UTC (permalink / raw)
  To: LKML

On 09/18/2016 09:54 PM, Larry Finger wrote:
> On 09/14/2016 11:00 AM, Larry Finger wrote:
>> On 09/09/2016 12:39 PM, Larry Finger wrote:
>>> I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
>>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
>>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
>>> thus a bisection is not possible. It usually happens under heavy load, such as a
>>> kernel build or the RPM build of VirtualBox, but it does not always fail with
>>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
>>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
>>> fault is 3437 MHz. Nothing is logged when this happens.
>>>
>>> If I were to get a patch that would show a backtrace when the maximum CPU
>>> frequency is changed, perhaps it would be possible to track this bug.
>>
>> I have not yet found the bad commit, but I have reduced the range of commits a
>> bit. This bug has been difficult to trigger. So far, it has not taken over 1/2
>> day to appear in bad kernels, thus I am allowing three days before deciding that
>> a given trial is good. I never saw the problem with 4.7 kernels, but I did in
>> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did
>> not show the bug.
>>
>> Testing continues.
>
> And still does. My bisection seemed to be trending toward an improbable set of
> commits, and I needed to do some other work with the machine, thus I started
> running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated
> that using 3 days to indicate a "good" trial was likely too short. I am
> currently testing the first of the trial and will run it for at least a week. It
> is unlikely that these tests will be complete before 4,8 is released, even if
> -rc8 is needed. I will keep attempting to find the faulty commit.

My debugging continues. After 7 days of beating on commit f7816ad, I have 
concluded that it is likely good. Thus I think the bug lies between commit 
581e0cd (bad) and f7816ad (good). I will need to do a long test on commit 
1b05cf6, which did not fail with a shorter run.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-24  2:45     ` Larry Finger
@ 2016-09-26 11:37       ` Rafael J. Wysocki
  2016-09-26 16:15         ` Larry Finger
  0 siblings, 1 reply; 35+ messages in thread
From: Rafael J. Wysocki @ 2016-09-26 11:37 UTC (permalink / raw)
  To: Larry Finger; +Cc: LKML, Linux PM list

On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
> On 09/18/2016 09:54 PM, Larry Finger wrote:
> > On 09/14/2016 11:00 AM, Larry Finger wrote:
> >> On 09/09/2016 12:39 PM, Larry Finger wrote:
> >>> I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
> >>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
> >>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
> >>> thus a bisection is not possible. It usually happens under heavy load, such as a
> >>> kernel build or the RPM build of VirtualBox, but it does not always fail with
> >>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
> >>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
> >>> fault is 3437 MHz. Nothing is logged when this happens.
> >>>
> >>> If I were to get a patch that would show a backtrace when the maximum CPU
> >>> frequency is changed, perhaps it would be possible to track this bug.
> >>
> >> I have not yet found the bad commit, but I have reduced the range of commits a
> >> bit. This bug has been difficult to trigger. So far, it has not taken over 1/2
> >> day to appear in bad kernels, thus I am allowing three days before deciding that
> >> a given trial is good. I never saw the problem with 4.7 kernels, but I did in
> >> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did
> >> not show the bug.
> >>
> >> Testing continues.
> >
> > And still does. My bisection seemed to be trending toward an improbable set of
> > commits, and I needed to do some other work with the machine, thus I started
> > running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated
> > that using 3 days to indicate a "good" trial was likely too short. I am
> > currently testing the first of the trial and will run it for at least a week. It
> > is unlikely that these tests will be complete before 4,8 is released, even if
> > -rc8 is needed. I will keep attempting to find the faulty commit.
> 
> My debugging continues. After 7 days of beating on commit f7816ad, I have 
> concluded that it is likely good. Thus I think the bug lies between commit 
> 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit 
> 1b05cf6, which did not fail with a shorter run.

581e0cd is not a valid mainline commit hash AFAICS.

What cpufreq driver do you use?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 11:37       ` Rafael J. Wysocki
@ 2016-09-26 16:15         ` Larry Finger
  2016-09-26 21:06           ` Rafael J. Wysocki
  0 siblings, 1 reply; 35+ messages in thread
From: Larry Finger @ 2016-09-26 16:15 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: LKML, Linux PM list

On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:
> On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
>> On 09/18/2016 09:54 PM, Larry Finger wrote:
>>> On 09/14/2016 11:00 AM, Larry Finger wrote:
>>>> On 09/09/2016 12:39 PM, Larry Finger wrote:
>>>>> I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
>>>>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
>>>>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
>>>>> thus a bisection is not possible. It usually happens under heavy load, such as a
>>>>> kernel build or the RPM build of VirtualBox, but it does not always fail with
>>>>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
>>>>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
>>>>> fault is 3437 MHz. Nothing is logged when this happens.
>>>>>
>>>>> If I were to get a patch that would show a backtrace when the maximum CPU
>>>>> frequency is changed, perhaps it would be possible to track this bug.
>>>>
>>>> I have not yet found the bad commit, but I have reduced the range of commits a
>>>> bit. This bug has been difficult to trigger. So far, it has not taken over 1/2
>>>> day to appear in bad kernels, thus I am allowing three days before deciding that
>>>> a given trial is good. I never saw the problem with 4.7 kernels, but I did in
>>>> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did
>>>> not show the bug.
>>>>
>>>> Testing continues.
>>>
>>> And still does. My bisection seemed to be trending toward an improbable set of
>>> commits, and I needed to do some other work with the machine, thus I started
>>> running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated
>>> that using 3 days to indicate a "good" trial was likely too short. I am
>>> currently testing the first of the trial and will run it for at least a week. It
>>> is unlikely that these tests will be complete before 4,8 is released, even if
>>> -rc8 is needed. I will keep attempting to find the faulty commit.
>>
>> My debugging continues. After 7 days of beating on commit f7816ad, I have
>> concluded that it is likely good. Thus I think the bug lies between commit
>> 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit
>> 1b05cf6, which did not fail with a shorter run.
>
> 581e0cd is not a valid mainline commit hash AFAICS.

That was a typo. The correct value is 581e0c7.
>
> What cpufreq driver do you use?

My "Default CPUFreq governor" is on demand.

Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in

CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
CONFIG_X86_PCC_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ_CPB=y

Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up going 
wrong. Further tests have shown that commit 351a4ded is bad. Once again, by 
bisection seems to be converging to a set of commits that seem unlikely to cause 
this problem. Perhaps commit f7816ad is not really good even though it survived 
7 days of heavy CPU usage.

I have been reluctant to post my entire .config on the list. It is available at 
http://pastebin.com/aMZaAKwL.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 16:15         ` Larry Finger
@ 2016-09-26 21:06           ` Rafael J. Wysocki
  2016-09-26 21:26             ` Srinivas Pandruvada
  2016-09-26 21:28             ` Larry Finger
  0 siblings, 2 replies; 35+ messages in thread
From: Rafael J. Wysocki @ 2016-09-26 21:06 UTC (permalink / raw)
  To: Larry Finger; +Cc: LKML, Linux PM list, Srinivas Pandruvada

On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
> On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:
> > On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
> >> On 09/18/2016 09:54 PM, Larry Finger wrote:
> >>> On 09/14/2016 11:00 AM, Larry Finger wrote:
> >>>> On 09/09/2016 12:39 PM, Larry Finger wrote:
> >>>>> I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
> >>>>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
> >>>>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
> >>>>> thus a bisection is not possible. It usually happens under heavy load, such as a
> >>>>> kernel build or the RPM build of VirtualBox, but it does not always fail with
> >>>>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
> >>>>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
> >>>>> fault is 3437 MHz. Nothing is logged when this happens.
> >>>>>
> >>>>> If I were to get a patch that would show a backtrace when the maximum CPU
> >>>>> frequency is changed, perhaps it would be possible to track this bug.
> >>>>
> >>>> I have not yet found the bad commit, but I have reduced the range of commits a
> >>>> bit. This bug has been difficult to trigger. So far, it has not taken over 1/2
> >>>> day to appear in bad kernels, thus I am allowing three days before deciding that
> >>>> a given trial is good. I never saw the problem with 4.7 kernels, but I did in
> >>>> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did
> >>>> not show the bug.
> >>>>
> >>>> Testing continues.
> >>>
> >>> And still does. My bisection seemed to be trending toward an improbable set of
> >>> commits, and I needed to do some other work with the machine, thus I started
> >>> running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated
> >>> that using 3 days to indicate a "good" trial was likely too short. I am
> >>> currently testing the first of the trial and will run it for at least a week. It
> >>> is unlikely that these tests will be complete before 4,8 is released, even if
> >>> -rc8 is needed. I will keep attempting to find the faulty commit.
> >>
> >> My debugging continues. After 7 days of beating on commit f7816ad, I have
> >> concluded that it is likely good. Thus I think the bug lies between commit
> >> 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit
> >> 1b05cf6, which did not fail with a shorter run.
> >
> > 581e0cd is not a valid mainline commit hash AFAICS.
> 
> That was a typo. The correct value is 581e0c7.
> >
> > What cpufreq driver do you use?
> 
> My "Default CPUFreq governor" is on demand.
> 
> Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in
> 
> CONFIG_ACPI_CPU_FREQ_PSS=y
> CONFIG_CPU_FREQ=y
> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
> CONFIG_CPU_FREQ_GOV_COMMON=y
> # CONFIG_CPU_FREQ_STAT is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
> CONFIG_CPU_FREQ_GOV_POWERSAVE=m
> CONFIG_CPU_FREQ_GOV_USERSPACE=m
> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
> CONFIG_X86_PCC_CPUFREQ=m
> CONFIG_X86_ACPI_CPUFREQ=m
> CONFIG_X86_ACPI_CPUFREQ_CPB=y
> 
> Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up going 
> wrong. Further tests have shown that commit 351a4ded is bad. Once again, by 
> bisection seems to be converging to a set of commits that seem unlikely to cause 
> this problem. Perhaps commit f7816ad is not really good even though it survived 
> 7 days of heavy CPU usage.
> 
> I have been reluctant to post my entire .config on the list. It is available at 
> http://pastebin.com/aMZaAKwL.

If the governor is ondemand, the driver is acpi-cpufreq, most likely.

How do you measure the frequency?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 21:06           ` Rafael J. Wysocki
@ 2016-09-26 21:26             ` Srinivas Pandruvada
  2016-09-26 21:30               ` Rafael J. Wysocki
  2016-09-26 21:28             ` Larry Finger
  1 sibling, 1 reply; 35+ messages in thread
From: Srinivas Pandruvada @ 2016-09-26 21:26 UTC (permalink / raw)
  To: Rafael J. Wysocki, Larry Finger; +Cc: LKML, Linux PM list

On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote:
> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:

[...]

> > I have been reluctant to post my entire .config on the list. It is
> > available at 
> > http://pastebin.com/aMZaAKwL.
> 
> If the governor is ondemand, the driver is acpi-cpufreq, most likely.
> 
> How do you measure the frequency?
> 
Also
When you get into this situation, please dump:
# cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
# cat /sys/devices/system/cpu/intel_pstate/*


Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 21:06           ` Rafael J. Wysocki
  2016-09-26 21:26             ` Srinivas Pandruvada
@ 2016-09-26 21:28             ` Larry Finger
  2016-09-26 21:37               ` Rafael J. Wysocki
  2016-09-27 14:51               ` Lennart Sorensen
  1 sibling, 2 replies; 35+ messages in thread
From: Larry Finger @ 2016-09-26 21:28 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: LKML, Linux PM list, Srinivas Pandruvada

On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:
> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
>> On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:
>>> On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
>>>> On 09/18/2016 09:54 PM, Larry Finger wrote:
>>>>> On 09/14/2016 11:00 AM, Larry Finger wrote:
>>>>>> On 09/09/2016 12:39 PM, Larry Finger wrote:
>>>>>>> I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
>>>>>>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
>>>>>>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
>>>>>>> thus a bisection is not possible. It usually happens under heavy load, such as a
>>>>>>> kernel build or the RPM build of VirtualBox, but it does not always fail with
>>>>>>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
>>>>>>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
>>>>>>> fault is 3437 MHz. Nothing is logged when this happens.
>>>>>>>
>>>>>>> If I were to get a patch that would show a backtrace when the maximum CPU
>>>>>>> frequency is changed, perhaps it would be possible to track this bug.
>>>>>>
>>>>>> I have not yet found the bad commit, but I have reduced the range of commits a
>>>>>> bit. This bug has been difficult to trigger. So far, it has not taken over 1/2
>>>>>> day to appear in bad kernels, thus I am allowing three days before deciding that
>>>>>> a given trial is good. I never saw the problem with 4.7 kernels, but I did in
>>>>>> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did
>>>>>> not show the bug.
>>>>>>
>>>>>> Testing continues.
>>>>>
>>>>> And still does. My bisection seemed to be trending toward an improbable set of
>>>>> commits, and I needed to do some other work with the machine, thus I started
>>>>> running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated
>>>>> that using 3 days to indicate a "good" trial was likely too short. I am
>>>>> currently testing the first of the trial and will run it for at least a week. It
>>>>> is unlikely that these tests will be complete before 4,8 is released, even if
>>>>> -rc8 is needed. I will keep attempting to find the faulty commit.
>>>>
>>>> My debugging continues. After 7 days of beating on commit f7816ad, I have
>>>> concluded that it is likely good. Thus I think the bug lies between commit
>>>> 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit
>>>> 1b05cf6, which did not fail with a shorter run.
>>>
>>> 581e0cd is not a valid mainline commit hash AFAICS.
>>
>> That was a typo. The correct value is 581e0c7.
>>>
>>> What cpufreq driver do you use?
>>
>> My "Default CPUFreq governor" is on demand.
>>
>> Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in
>>
>> CONFIG_ACPI_CPU_FREQ_PSS=y
>> CONFIG_CPU_FREQ=y
>> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
>> CONFIG_CPU_FREQ_GOV_COMMON=y
>> # CONFIG_CPU_FREQ_STAT is not set
>> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
>> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
>> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
>> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
>> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
>> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
>> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
>> CONFIG_CPU_FREQ_GOV_POWERSAVE=m
>> CONFIG_CPU_FREQ_GOV_USERSPACE=m
>> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
>> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
>> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
>> CONFIG_X86_PCC_CPUFREQ=m
>> CONFIG_X86_ACPI_CPUFREQ=m
>> CONFIG_X86_ACPI_CPUFREQ_CPB=y
>>
>> Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up going
>> wrong. Further tests have shown that commit 351a4ded is bad. Once again, by
>> bisection seems to be converging to a set of commits that seem unlikely to cause
>> this problem. Perhaps commit f7816ad is not really good even though it survived
>> 7 days of heavy CPU usage.
>>
>> I have been reluctant to post my entire .config on the list. It is available at
>> http://pastebin.com/aMZaAKwL.
>
> If the governor is ondemand, the driver is acpi-cpufreq, most likely.
>
> How do you measure the frequency?

Mostly I use a KDE applet named "System load" and look at the "average clock", 
but the same info is also available in /proc/cpuinfo as "cpu MHz". When the bug 
triggers, the system gets very slow, and the cpu fan stops even though the cpu 
is still busy.

Commit f7816ad, which had run for 7 days without showing the bug, failed after 
about 2 hours today. All my testing since Sept. 9 has been wasted. Oh well, 
that's the way it goes!

Thanks,

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 21:26             ` Srinivas Pandruvada
@ 2016-09-26 21:30               ` Rafael J. Wysocki
  2016-09-26 21:41                 ` Srinivas Pandruvada
  0 siblings, 1 reply; 35+ messages in thread
From: Rafael J. Wysocki @ 2016-09-26 21:30 UTC (permalink / raw)
  To: Srinivas Pandruvada; +Cc: Rafael J. Wysocki, Larry Finger, LKML, Linux PM list

On Mon, Sep 26, 2016 at 11:26 PM, Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
> On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote:
>> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
>
> [...]
>
>> > I have been reluctant to post my entire .config on the list. It is
>> > available at
>> > http://pastebin.com/aMZaAKwL.
>>
>> If the governor is ondemand, the driver is acpi-cpufreq, most likely.
>>
>> How do you measure the frequency?
>>
> Also
> When you get into this situation, please dump:
> # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
> # cat /sys/devices/system/cpu/intel_pstate/*

The driver is not intel_pstate.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 21:28             ` Larry Finger
@ 2016-09-26 21:37               ` Rafael J. Wysocki
  2016-09-26 21:46                 ` Srinivas Pandruvada
  2016-09-26 22:09                 ` Larry Finger
  2016-09-27 14:51               ` Lennart Sorensen
  1 sibling, 2 replies; 35+ messages in thread
From: Rafael J. Wysocki @ 2016-09-26 21:37 UTC (permalink / raw)
  To: Larry Finger; +Cc: Rafael J. Wysocki, LKML, Linux PM list, Srinivas Pandruvada

On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
<Larry.Finger@lwfinger.net> wrote:
> On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:
>>
>> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
>>>
>>> On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:
>>>>
>>>> On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
>>>>>
>>>>> On 09/18/2016 09:54 PM, Larry Finger wrote:
>>>>>>
>>>>>> On 09/14/2016 11:00 AM, Larry Finger wrote:
>>>>>>>
>>>>>>> On 09/09/2016 12:39 PM, Larry Finger wrote:
>>>>>>>>
>>>>>>>> I have found a regression in kernel 4.8-rc2 that causes the speed of
>>>>>>>> my laptop
>>>>>>>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a
>>>>>>>> maximum cpu
>>>>>>>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger
>>>>>>>> this problem,
>>>>>>>> thus a bisection is not possible. It usually happens under heavy
>>>>>>>> load, such as a
>>>>>>>> kernel build or the RPM build of VirtualBox, but it does not always
>>>>>>>> fail with
>>>>>>>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu
>>>>>>>> MHz of
>>>>>>>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock
>>>>>>>> before the
>>>>>>>> fault is 3437 MHz. Nothing is logged when this happens.
>>>>>>>>
>>>>>>>> If I were to get a patch that would show a backtrace when the
>>>>>>>> maximum CPU
>>>>>>>> frequency is changed, perhaps it would be possible to track this
>>>>>>>> bug.
>>>>>>>
>>>>>>>
>>>>>>> I have not yet found the bad commit, but I have reduced the range of
>>>>>>> commits a
>>>>>>> bit. This bug has been difficult to trigger. So far, it has not taken
>>>>>>> over 1/2
>>>>>>> day to appear in bad kernels, thus I am allowing three days before
>>>>>>> deciding that
>>>>>>> a given trial is good. I never saw the problem with 4.7 kernels, but
>>>>>>> I did in
>>>>>>> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit
>>>>>>> 1b05cf6 did
>>>>>>> not show the bug.
>>>>>>>
>>>>>>> Testing continues.
>>>>>>
>>>>>>
>>>>>> And still does. My bisection seemed to be trending toward an
>>>>>> improbable set of
>>>>>> commits, and I needed to do some other work with the machine, thus I
>>>>>> started
>>>>>> running 4.8-rc6. It failed nearly 48 hours after the reboot, which
>>>>>> indicated
>>>>>> that using 3 days to indicate a "good" trial was likely too short. I
>>>>>> am
>>>>>> currently testing the first of the trial and will run it for at least
>>>>>> a week. It
>>>>>> is unlikely that these tests will be complete before 4,8 is released,
>>>>>> even if
>>>>>> -rc8 is needed. I will keep attempting to find the faulty commit.
>>>>>
>>>>>
>>>>> My debugging continues. After 7 days of beating on commit f7816ad, I
>>>>> have
>>>>> concluded that it is likely good. Thus I think the bug lies between
>>>>> commit
>>>>> 581e0cd (bad) and f7816ad (good). I will need to do a long test on
>>>>> commit
>>>>> 1b05cf6, which did not fail with a shorter run.
>>>>
>>>>
>>>> 581e0cd is not a valid mainline commit hash AFAICS.
>>>
>>>
>>> That was a typo. The correct value is 581e0c7.
>>>>
>>>>
>>>> What cpufreq driver do you use?
>>>
>>>
>>> My "Default CPUFreq governor" is on demand.
>>>
>>> Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in
>>>
>>> CONFIG_ACPI_CPU_FREQ_PSS=y
>>> CONFIG_CPU_FREQ=y
>>> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
>>> CONFIG_CPU_FREQ_GOV_COMMON=y
>>> # CONFIG_CPU_FREQ_STAT is not set
>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
>>> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
>>> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
>>> CONFIG_CPU_FREQ_GOV_POWERSAVE=m
>>> CONFIG_CPU_FREQ_GOV_USERSPACE=m
>>> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
>>> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
>>> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
>>> CONFIG_X86_PCC_CPUFREQ=m
>>> CONFIG_X86_ACPI_CPUFREQ=m
>>> CONFIG_X86_ACPI_CPUFREQ_CPB=y
>>>
>>> Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up
>>> going
>>> wrong. Further tests have shown that commit 351a4ded is bad. Once again,
>>> by
>>> bisection seems to be converging to a set of commits that seem unlikely
>>> to cause
>>> this problem. Perhaps commit f7816ad is not really good even though it
>>> survived
>>> 7 days of heavy CPU usage.
>>>
>>> I have been reluctant to post my entire .config on the list. It is
>>> available at
>>> http://pastebin.com/aMZaAKwL.
>>
>>
>> If the governor is ondemand, the driver is acpi-cpufreq, most likely.
>>
>> How do you measure the frequency?
>
>
> Mostly I use a KDE applet named "System load" and look at the "average
> clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
> When the bug triggers, the system gets very slow, and the cpu fan stops even
> though the cpu is still busy.

That sounds like thermal throttling kicking in.

What's there under /sys/class/thermal/ on your system?

> Commit f7816ad, which had run for 7 days without showing the bug, failed
> after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh
> well, that's the way it goes!

Are you confident that the issue was not reproducible before 4.8-rc2?
In particular, what about 4.8-rc1?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 21:30               ` Rafael J. Wysocki
@ 2016-09-26 21:41                 ` Srinivas Pandruvada
  2016-09-26 21:46                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 35+ messages in thread
From: Srinivas Pandruvada @ 2016-09-26 21:41 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Rafael J. Wysocki, Larry Finger, LKML, Linux PM list

On Mon, 2016-09-26 at 23:30 +0200, Rafael J. Wysocki wrote:
> On Mon, Sep 26, 2016 at 11:26 PM, Srinivas Pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> > 
> > On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote:
> > > 
> > > On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
> > 
> > [...]
> > 
> > > 
> > > > 
> > > > I have been reluctant to post my entire .config on the list. It
> > > > is
> > > > available at
> > > > http://pastebin.com/aMZaAKwL.
> > > 
> > > If the governor is ondemand, the driver is acpi-cpufreq, most
> > > likely.
> > > 
> > > How do you measure the frequency?
> > > 
> > Also
> > When you get into this situation, please dump:
> > # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
> > # cat /sys/devices/system/cpu/intel_pstate/*
> 
> The driver is not intel_pstate.
I guessed from
CONFIG_X86_INTEL_PSTATE=y
and
Frequency is not 400 but something like 396.130


Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 21:37               ` Rafael J. Wysocki
@ 2016-09-26 21:46                 ` Srinivas Pandruvada
  2016-09-26 22:15                   ` Larry Finger
  2016-09-26 22:09                 ` Larry Finger
  1 sibling, 1 reply; 35+ messages in thread
From: Srinivas Pandruvada @ 2016-09-26 21:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Larry Finger; +Cc: Rafael J. Wysocki, LKML, Linux PM list

On Mon, 2016-09-26 at 23:37 +0200, Rafael J. Wysocki wrote:
> On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
> <Larry.Finger@lwfinger.net> wrote:
> > 
> > On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:
> > > 
> > > 
> > > On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
> > > > 
> > > > 
> > > > On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:
> > > > > 
> > > > > 
> > > > > On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
> > > > > > 
> > > > > > 
> > > > > > On 09/18/2016 09:54 PM, Larry Finger wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 09/14/2016 11:00 AM, Larry Finger wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 09/09/2016 12:39 PM, Larry Finger wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I have found a regression in kernel 4.8-rc2 that
> > > > > > > > > causes the speed of
> > > > > > > > > my laptop
> > > > > > > > > with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to
> > > > > > > > > suddenly have a
> > > > > > > > > maximum cpu
> > > > > > > > > frequency of ~400 MHz. Unfortunately, I do not know
> > > > > > > > > how to trigger
> > > > > > > > > this problem,
> > > > > > > > > thus a bisection is not possible. It usually happens
> > > > > > > > > under heavy
> > > > > > > > > load, such as a
> > > > > > > > > kernel build or the RPM build of VirtualBox, but it
> > > > > > > > > does not always
> > > > > > > > > fail with
> > > > > > > > > these loads. In my most recent failure, 'hwinfo --
> > > > > > > > > cpu' reports cpu
> > > > > > > > > MHz of
> > > > > > > > > 396.130 for #3. The bogomips value is 5787.73, and
> > > > > > > > > the cpu clock
> > > > > > > > > before the
> > > > > > > > > fault is 3437 MHz. Nothing is logged when this
> > > > > > > > > happens.
> > > > > > > > > 
> > > > > > > > > If I were to get a patch that would show a backtrace
> > > > > > > > > when the
> > > > > > > > > maximum CPU
> > > > > > > > > frequency is changed, perhaps it would be possible to
> > > > > > > > > track this
> > > > > > > > > bug.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > I have not yet found the bad commit, but I have reduced
> > > > > > > > the range of
> > > > > > > > commits a
> > > > > > > > bit. This bug has been difficult to trigger. So far, it
> > > > > > > > has not taken
> > > > > > > > over 1/2
> > > > > > > > day to appear in bad kernels, thus I am allowing three
> > > > > > > > days before
> > > > > > > > deciding that
> > > > > > > > a given trial is good. I never saw the problem with 4.7
> > > > > > > > kernels, but
> > > > > > > > I did in
> > > > > > > > 4.8-rc1. I also know that it appeared before commit
> > > > > > > > 581e0cd. Commit
> > > > > > > > 1b05cf6 did
> > > > > > > > not show the bug.
> > > > > > > > 
> > > > > > > > Testing continues.
> > > > > > > 
> > > > > > > 
> > > > > > > And still does. My bisection seemed to be trending toward
> > > > > > > an
> > > > > > > improbable set of
> > > > > > > commits, and I needed to do some other work with the
> > > > > > > machine, thus I
> > > > > > > started
> > > > > > > running 4.8-rc6. It failed nearly 48 hours after the
> > > > > > > reboot, which
> > > > > > > indicated
> > > > > > > that using 3 days to indicate a "good" trial was likely
> > > > > > > too short. I
> > > > > > > am
> > > > > > > currently testing the first of the trial and will run it
> > > > > > > for at least
> > > > > > > a week. It
> > > > > > > is unlikely that these tests will be complete before 4,8
> > > > > > > is released,
> > > > > > > even if
> > > > > > > -rc8 is needed. I will keep attempting to find the faulty
> > > > > > > commit.
> > > > > > 
> > > > > > 
> > > > > > My debugging continues. After 7 days of beating on commit
> > > > > > f7816ad, I
> > > > > > have
> > > > > > concluded that it is likely good. Thus I think the bug lies
> > > > > > between
> > > > > > commit
> > > > > > 581e0cd (bad) and f7816ad (good). I will need to do a long
> > > > > > test on
> > > > > > commit
> > > > > > 1b05cf6, which did not fail with a shorter run.
> > > > > 
> > > > > 
> > > > > 581e0cd is not a valid mainline commit hash AFAICS.
> > > > 
> > > > 
> > > > That was a typo. The correct value is 581e0c7.
> > > > > 
> > > > > 
> > > > > 
> > > > > What cpufreq driver do you use?
> > > > 
> > > > 
> > > > My "Default CPUFreq governor" is on demand.
> > > > 
> > > > Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config'
> > > > results in
> > > > 
> > > > CONFIG_ACPI_CPU_FREQ_PSS=y
> > > > CONFIG_CPU_FREQ=y
> > > > CONFIG_CPU_FREQ_GOV_ATTR_SET=y
> > > > CONFIG_CPU_FREQ_GOV_COMMON=y
> > > > # CONFIG_CPU_FREQ_STAT is not set
> > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
> > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
> > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
> > > > CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
> > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
> > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
> > > > CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
> > > > CONFIG_CPU_FREQ_GOV_POWERSAVE=m
> > > > CONFIG_CPU_FREQ_GOV_USERSPACE=m
> > > > CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> > > > CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
> > > > # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
> > > > CONFIG_X86_PCC_CPUFREQ=m
> > > > CONFIG_X86_ACPI_CPUFREQ=m
> > > > CONFIG_X86_ACPI_CPUFREQ_CPB=y
> > > > 
> > > > Commit 1b05cf6 did fail on longer testing, thus my bisection
> > > > had ended up
> > > > going
> > > > wrong. Further tests have shown that commit 351a4ded is bad.
> > > > Once again,
> > > > by
> > > > bisection seems to be converging to a set of commits that seem
> > > > unlikely
> > > > to cause
> > > > this problem. Perhaps commit f7816ad is not really good even
> > > > though it
> > > > survived
> > > > 7 days of heavy CPU usage.
> > > > 
> > > > I have been reluctant to post my entire .config on the list. It
> > > > is
> > > > available at
> > > > http://pastebin.com/aMZaAKwL.
> > > 
> > > 
> > > If the governor is ondemand, the driver is acpi-cpufreq, most
> > > likely.
> > > 
> > > How do you measure the frequency?
> > 
> > 
> > Mostly I use a KDE applet named "System load" and look at the
> > "average
> > clock", but the same info is also available in /proc/cpuinfo as
> > "cpu MHz".
> > When the bug triggers, the system gets very slow, and the cpu fan
> > stops even
> > though the cpu is still busy.
> 
> That sounds like thermal throttling kicking in.
> 
This will help to know, if there is thermal throttle from OS.
# cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
# grep -r . /sys/class/thermal/thermal_zone*

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 21:41                 ` Srinivas Pandruvada
@ 2016-09-26 21:46                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 35+ messages in thread
From: Rafael J. Wysocki @ 2016-09-26 21:46 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, Larry Finger, LKML, Linux PM list

On Mon, Sep 26, 2016 at 11:41 PM, Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
> On Mon, 2016-09-26 at 23:30 +0200, Rafael J. Wysocki wrote:
>> On Mon, Sep 26, 2016 at 11:26 PM, Srinivas Pandruvada
>> <srinivas.pandruvada@linux.intel.com> wrote:
>> >
>> > On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote:
>> > >
>> > > On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
>> >
>> > [...]
>> >
>> > >
>> > > >
>> > > > I have been reluctant to post my entire .config on the list. It
>> > > > is
>> > > > available at
>> > > > http://pastebin.com/aMZaAKwL.
>> > >
>> > > If the governor is ondemand, the driver is acpi-cpufreq, most
>> > > likely.
>> > >
>> > > How do you measure the frequency?
>> > >
>> > Also
>> > When you get into this situation, please dump:
>> > # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
>> > # cat /sys/devices/system/cpu/intel_pstate/*
>>
>> The driver is not intel_pstate.
> I guessed from
> CONFIG_X86_INTEL_PSTATE=y
> and
> Frequency is not 400 but something like 396.130

Ah.  Good catch!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 21:37               ` Rafael J. Wysocki
  2016-09-26 21:46                 ` Srinivas Pandruvada
@ 2016-09-26 22:09                 ` Larry Finger
  2016-09-26 22:16                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 35+ messages in thread
From: Larry Finger @ 2016-09-26 22:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, LKML, Linux PM list, Srinivas Pandruvada

On 09/26/2016 04:37 PM, Rafael J. Wysocki wrote:
> On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
> <Larry.Finger@lwfinger.net> wrote:
>> On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:
>>>
>>> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
>>>>
>>>> On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:
>>>>>
>>>>> On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
>>>>>>
>>>>>> On 09/18/2016 09:54 PM, Larry Finger wrote:
>>>>>>>
>>>>>>> On 09/14/2016 11:00 AM, Larry Finger wrote:
>>>>>>>>
>>>>>>>> On 09/09/2016 12:39 PM, Larry Finger wrote:
>>>>>>>>>
>>>>>>>>> I have found a regression in kernel 4.8-rc2 that causes the speed of
>>>>>>>>> my laptop
>>>>>>>>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a
>>>>>>>>> maximum cpu
>>>>>>>>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger
>>>>>>>>> this problem,
>>>>>>>>> thus a bisection is not possible. It usually happens under heavy
>>>>>>>>> load, such as a
>>>>>>>>> kernel build or the RPM build of VirtualBox, but it does not always
>>>>>>>>> fail with
>>>>>>>>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu
>>>>>>>>> MHz of
>>>>>>>>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock
>>>>>>>>> before the
>>>>>>>>> fault is 3437 MHz. Nothing is logged when this happens.
>>>>>>>>>
>>>>>>>>> If I were to get a patch that would show a backtrace when the
>>>>>>>>> maximum CPU
>>>>>>>>> frequency is changed, perhaps it would be possible to track this
>>>>>>>>> bug.
>>>>>>>>
>>>>>>>>
>>>>>>>> I have not yet found the bad commit, but I have reduced the range of
>>>>>>>> commits a
>>>>>>>> bit. This bug has been difficult to trigger. So far, it has not taken
>>>>>>>> over 1/2
>>>>>>>> day to appear in bad kernels, thus I am allowing three days before
>>>>>>>> deciding that
>>>>>>>> a given trial is good. I never saw the problem with 4.7 kernels, but
>>>>>>>> I did in
>>>>>>>> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit
>>>>>>>> 1b05cf6 did
>>>>>>>> not show the bug.
>>>>>>>>
>>>>>>>> Testing continues.
>>>>>>>
>>>>>>>
>>>>>>> And still does. My bisection seemed to be trending toward an
>>>>>>> improbable set of
>>>>>>> commits, and I needed to do some other work with the machine, thus I
>>>>>>> started
>>>>>>> running 4.8-rc6. It failed nearly 48 hours after the reboot, which
>>>>>>> indicated
>>>>>>> that using 3 days to indicate a "good" trial was likely too short. I
>>>>>>> am
>>>>>>> currently testing the first of the trial and will run it for at least
>>>>>>> a week. It
>>>>>>> is unlikely that these tests will be complete before 4,8 is released,
>>>>>>> even if
>>>>>>> -rc8 is needed. I will keep attempting to find the faulty commit.
>>>>>>
>>>>>>
>>>>>> My debugging continues. After 7 days of beating on commit f7816ad, I
>>>>>> have
>>>>>> concluded that it is likely good. Thus I think the bug lies between
>>>>>> commit
>>>>>> 581e0cd (bad) and f7816ad (good). I will need to do a long test on
>>>>>> commit
>>>>>> 1b05cf6, which did not fail with a shorter run.
>>>>>
>>>>>
>>>>> 581e0cd is not a valid mainline commit hash AFAICS.
>>>>
>>>>
>>>> That was a typo. The correct value is 581e0c7.
>>>>>
>>>>>
>>>>> What cpufreq driver do you use?
>>>>
>>>>
>>>> My "Default CPUFreq governor" is on demand.
>>>>
>>>> Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in
>>>>
>>>> CONFIG_ACPI_CPU_FREQ_PSS=y
>>>> CONFIG_CPU_FREQ=y
>>>> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
>>>> CONFIG_CPU_FREQ_GOV_COMMON=y
>>>> # CONFIG_CPU_FREQ_STAT is not set
>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
>>>> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
>>>> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
>>>> CONFIG_CPU_FREQ_GOV_POWERSAVE=m
>>>> CONFIG_CPU_FREQ_GOV_USERSPACE=m
>>>> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
>>>> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
>>>> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
>>>> CONFIG_X86_PCC_CPUFREQ=m
>>>> CONFIG_X86_ACPI_CPUFREQ=m
>>>> CONFIG_X86_ACPI_CPUFREQ_CPB=y
>>>>
>>>> Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up
>>>> going
>>>> wrong. Further tests have shown that commit 351a4ded is bad. Once again,
>>>> by
>>>> bisection seems to be converging to a set of commits that seem unlikely
>>>> to cause
>>>> this problem. Perhaps commit f7816ad is not really good even though it
>>>> survived
>>>> 7 days of heavy CPU usage.
>>>>
>>>> I have been reluctant to post my entire .config on the list. It is
>>>> available at
>>>> http://pastebin.com/aMZaAKwL.
>>>
>>>
>>> If the governor is ondemand, the driver is acpi-cpufreq, most likely.
>>>
>>> How do you measure the frequency?
>>
>>
>> Mostly I use a KDE applet named "System load" and look at the "average
>> clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
>> When the bug triggers, the system gets very slow, and the cpu fan stops even
>> though the cpu is still busy.
>
> That sounds like thermal throttling kicking in.

I think it is because the cpu is idling. If a thermal throttling is responsible, 
why would it not fail for 168 hours, and then fail in 2?

> What's there under /sys/class/thermal/ on your system?

It contains the following directories:

cooling_device0  cooling_device1  cooling_device2  cooling_device3 
cooling_device4  thermal_zone0  thermal_zone1
>
>> Commit f7816ad, which had run for 7 days without showing the bug, failed
>> after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh
>> well, that's the way it goes!
>
> Are you confident that the issue was not reproducible before 4.8-rc2?
> In particular, what about 4.8-rc1?

4.8-rc1 is definitely bad. I am now testing commit 5539204. In the bisect 
visualization, there are a number of cpufreq commits before the test case. If it 
is one of them, it may be a while before I dare call this one "good". In one 
respect, that is good as I will be traveling tomorrow and Wednesday.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 21:46                 ` Srinivas Pandruvada
@ 2016-09-26 22:15                   ` Larry Finger
  0 siblings, 0 replies; 35+ messages in thread
From: Larry Finger @ 2016-09-26 22:15 UTC (permalink / raw)
  To: Srinivas Pandruvada, Rafael J. Wysocki
  Cc: Rafael J. Wysocki, LKML, Linux PM list

On 09/26/2016 04:46 PM, Srinivas Pandruvada wrote:
> On Mon, 2016-09-26 at 23:37 +0200, Rafael J. Wysocki wrote:
>> On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
>> <Larry.Finger@lwfinger.net> wrote:
>>>
>>> On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:
>>>>
>>>>
>>>> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
>>>>>
>>>>>
>>>>> On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:
>>>>>>
>>>>>>
>>>>>> On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 09/18/2016 09:54 PM, Larry Finger wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 09/14/2016 11:00 AM, Larry Finger wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 09/09/2016 12:39 PM, Larry Finger wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have found a regression in kernel 4.8-rc2 that
>>>>>>>>>> causes the speed of
>>>>>>>>>> my laptop
>>>>>>>>>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to
>>>>>>>>>> suddenly have a
>>>>>>>>>> maximum cpu
>>>>>>>>>> frequency of ~400 MHz. Unfortunately, I do not know
>>>>>>>>>> how to trigger
>>>>>>>>>> this problem,
>>>>>>>>>> thus a bisection is not possible. It usually happens
>>>>>>>>>> under heavy
>>>>>>>>>> load, such as a
>>>>>>>>>> kernel build or the RPM build of VirtualBox, but it
>>>>>>>>>> does not always
>>>>>>>>>> fail with
>>>>>>>>>> these loads. In my most recent failure, 'hwinfo --
>>>>>>>>>> cpu' reports cpu
>>>>>>>>>> MHz of
>>>>>>>>>> 396.130 for #3. The bogomips value is 5787.73, and
>>>>>>>>>> the cpu clock
>>>>>>>>>> before the
>>>>>>>>>> fault is 3437 MHz. Nothing is logged when this
>>>>>>>>>> happens.
>>>>>>>>>>
>>>>>>>>>> If I were to get a patch that would show a backtrace
>>>>>>>>>> when the
>>>>>>>>>> maximum CPU
>>>>>>>>>> frequency is changed, perhaps it would be possible to
>>>>>>>>>> track this
>>>>>>>>>> bug.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have not yet found the bad commit, but I have reduced
>>>>>>>>> the range of
>>>>>>>>> commits a
>>>>>>>>> bit. This bug has been difficult to trigger. So far, it
>>>>>>>>> has not taken
>>>>>>>>> over 1/2
>>>>>>>>> day to appear in bad kernels, thus I am allowing three
>>>>>>>>> days before
>>>>>>>>> deciding that
>>>>>>>>> a given trial is good. I never saw the problem with 4.7
>>>>>>>>> kernels, but
>>>>>>>>> I did in
>>>>>>>>> 4.8-rc1. I also know that it appeared before commit
>>>>>>>>> 581e0cd. Commit
>>>>>>>>> 1b05cf6 did
>>>>>>>>> not show the bug.
>>>>>>>>>
>>>>>>>>> Testing continues.
>>>>>>>>
>>>>>>>>
>>>>>>>> And still does. My bisection seemed to be trending toward
>>>>>>>> an
>>>>>>>> improbable set of
>>>>>>>> commits, and I needed to do some other work with the
>>>>>>>> machine, thus I
>>>>>>>> started
>>>>>>>> running 4.8-rc6. It failed nearly 48 hours after the
>>>>>>>> reboot, which
>>>>>>>> indicated
>>>>>>>> that using 3 days to indicate a "good" trial was likely
>>>>>>>> too short. I
>>>>>>>> am
>>>>>>>> currently testing the first of the trial and will run it
>>>>>>>> for at least
>>>>>>>> a week. It
>>>>>>>> is unlikely that these tests will be complete before 4,8
>>>>>>>> is released,
>>>>>>>> even if
>>>>>>>> -rc8 is needed. I will keep attempting to find the faulty
>>>>>>>> commit.
>>>>>>>
>>>>>>>
>>>>>>> My debugging continues. After 7 days of beating on commit
>>>>>>> f7816ad, I
>>>>>>> have
>>>>>>> concluded that it is likely good. Thus I think the bug lies
>>>>>>> between
>>>>>>> commit
>>>>>>> 581e0cd (bad) and f7816ad (good). I will need to do a long
>>>>>>> test on
>>>>>>> commit
>>>>>>> 1b05cf6, which did not fail with a shorter run.
>>>>>>
>>>>>>
>>>>>> 581e0cd is not a valid mainline commit hash AFAICS.
>>>>>
>>>>>
>>>>> That was a typo. The correct value is 581e0c7.
>>>>>>
>>>>>>
>>>>>>
>>>>>> What cpufreq driver do you use?
>>>>>
>>>>>
>>>>> My "Default CPUFreq governor" is on demand.
>>>>>
>>>>> Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config'
>>>>> results in
>>>>>
>>>>> CONFIG_ACPI_CPU_FREQ_PSS=y
>>>>> CONFIG_CPU_FREQ=y
>>>>> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
>>>>> CONFIG_CPU_FREQ_GOV_COMMON=y
>>>>> # CONFIG_CPU_FREQ_STAT is not set
>>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
>>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
>>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
>>>>> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
>>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
>>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
>>>>> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
>>>>> CONFIG_CPU_FREQ_GOV_POWERSAVE=m
>>>>> CONFIG_CPU_FREQ_GOV_USERSPACE=m
>>>>> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
>>>>> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
>>>>> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
>>>>> CONFIG_X86_PCC_CPUFREQ=m
>>>>> CONFIG_X86_ACPI_CPUFREQ=m
>>>>> CONFIG_X86_ACPI_CPUFREQ_CPB=y
>>>>>
>>>>> Commit 1b05cf6 did fail on longer testing, thus my bisection
>>>>> had ended up
>>>>> going
>>>>> wrong. Further tests have shown that commit 351a4ded is bad.
>>>>> Once again,
>>>>> by
>>>>> bisection seems to be converging to a set of commits that seem
>>>>> unlikely
>>>>> to cause
>>>>> this problem. Perhaps commit f7816ad is not really good even
>>>>> though it
>>>>> survived
>>>>> 7 days of heavy CPU usage.
>>>>>
>>>>> I have been reluctant to post my entire .config on the list. It
>>>>> is
>>>>> available at
>>>>> http://pastebin.com/aMZaAKwL.
>>>>
>>>>
>>>> If the governor is ondemand, the driver is acpi-cpufreq, most
>>>> likely.
>>>>
>>>> How do you measure the frequency?
>>>
>>>
>>> Mostly I use a KDE applet named "System load" and look at the
>>> "average
>>> clock", but the same info is also available in /proc/cpuinfo as
>>> "cpu MHz".
>>> When the bug triggers, the system gets very slow, and the cpu fan
>>> stops even
>>> though the cpu is still busy.
>>
>> That sounds like thermal throttling kicking in.
>>
> This will help to know, if there is thermal throttle from OS.
> # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
> # grep -r . /sys/class/thermal/thermal_zone*

With the system OK, I get

finger@linux-1t8h:~/wireless-drivers-next> cat 
/sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
3600000
3600000
3600000
3600000

finger@linux-1t8h:~/wireless-drivers-next> grep -r . 
/sys/class/thermal/thermal_zone*
grep: /sys/class/thermal/thermal_zone0/k_d: Input/output error
grep: /sys/class/thermal/thermal_zone0/k_i: Input/output error
grep: /sys/class/thermal/thermal_zone0/k_po: Input/output error
grep: /sys/class/thermal/thermal_zone0/k_pu: Input/output error
/sys/class/thermal/thermal_zone0/mode:enabled
/sys/class/thermal/thermal_zone0/temp:16000
/sys/class/thermal/thermal_zone0/type:acpitz
grep: /sys/class/thermal/thermal_zone0/integral_cutoff: Input/output error
/sys/class/thermal/thermal_zone0/power/control:auto
/sys/class/thermal/thermal_zone0/power/async:disabled
/sys/class/thermal/thermal_zone0/power/runtime_enabled:disabled
/sys/class/thermal/thermal_zone0/power/runtime_active_kids:0
/sys/class/thermal/thermal_zone0/power/runtime_active_time:0
grep: /sys/class/thermal/thermal_zone0/power/autosuspend_delay_ms: Input/output 
error
/sys/class/thermal/thermal_zone0/power/runtime_status:unsupported
/sys/class/thermal/thermal_zone0/power/runtime_usage:0
/sys/class/thermal/thermal_zone0/power/runtime_suspended_time:0
grep: /sys/class/thermal/thermal_zone0/slope: Input/output error
/sys/class/thermal/thermal_zone0/trip_point_0_temp:102000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
grep: /sys/class/thermal/thermal_zone0/offset: Input/output error
/sys/class/thermal/thermal_zone0/policy:step_wise
/sys/class/thermal/thermal_zone0/passive:0
/sys/class/thermal/thermal_zone0/available_policies:user_space bang_bang 
fair_share step_wise
grep: /sys/class/thermal/thermal_zone0/sustainable_power: Input/output error
/sys/class/thermal/thermal_zone1/k_d:0
/sys/class/thermal/thermal_zone1/k_i:0
/sys/class/thermal/thermal_zone1/k_po:0
/sys/class/thermal/thermal_zone1/k_pu:0
/sys/class/thermal/thermal_zone1/temp:43000
/sys/class/thermal/thermal_zone1/type:x86_pkg_temp
/sys/class/thermal/thermal_zone1/integral_cutoff:0
/sys/class/thermal/thermal_zone1/power/control:auto
/sys/class/thermal/thermal_zone1/power/async:disabled
/sys/class/thermal/thermal_zone1/power/runtime_enabled:disabled
/sys/class/thermal/thermal_zone1/power/runtime_active_kids:0
/sys/class/thermal/thermal_zone1/power/runtime_active_time:0
grep: /sys/class/thermal/thermal_zone1/power/autosuspend_delay_ms: Input/output 
error
/sys/class/thermal/thermal_zone1/power/runtime_status:unsupported
/sys/class/thermal/thermal_zone1/power/runtime_usage:0
/sys/class/thermal/thermal_zone1/power/runtime_suspended_time:0
/sys/class/thermal/thermal_zone1/slope:0
/sys/class/thermal/thermal_zone1/trip_point_0_temp:0
/sys/class/thermal/thermal_zone1/trip_point_0_type:passive
/sys/class/thermal/thermal_zone1/trip_point_1_temp:0
/sys/class/thermal/thermal_zone1/trip_point_1_type:passive
/sys/class/thermal/thermal_zone1/offset:0
/sys/class/thermal/thermal_zone1/policy:step_wise
/sys/class/thermal/thermal_zone1/available_policies:user_space bang_bang 
fair_share step_wise
/sys/class/thermal/thermal_zone1/sustainable_power:0

I will recheck once I trigger another failure.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 22:09                 ` Larry Finger
@ 2016-09-26 22:16                   ` Rafael J. Wysocki
  2016-09-26 23:53                     ` Larry Finger
  0 siblings, 1 reply; 35+ messages in thread
From: Rafael J. Wysocki @ 2016-09-26 22:16 UTC (permalink / raw)
  To: Larry Finger
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, LKML, Linux PM list,
	Srinivas Pandruvada

On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger
<Larry.Finger@lwfinger.net> wrote:
> On 09/26/2016 04:37 PM, Rafael J. Wysocki wrote:
>>
>> On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
>> <Larry.Finger@lwfinger.net> wrote:
>>>
>>> On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:
>>>>
>>>>
>>>> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:

[cut]

>>>
>>> Mostly I use a KDE applet named "System load" and look at the "average
>>> clock", but the same info is also available in /proc/cpuinfo as "cpu
>>> MHz".
>>> When the bug triggers, the system gets very slow, and the cpu fan stops
>>> even
>>> though the cpu is still busy.
>>
>>
>> That sounds like thermal throttling kicking in.
>
>
> I think it is because the cpu is idling. If a thermal throttling is
> responsible, why would it not fail for 168 hours, and then fail in 2?
>
>> What's there under /sys/class/thermal/ on your system?
>
>
> It contains the following directories:
>
> cooling_device0  cooling_device1  cooling_device2  cooling_device3
> cooling_device4  thermal_zone0  thermal_zone1
>>
>>
>>> Commit f7816ad, which had run for 7 days without showing the bug, failed
>>> after about 2 hours today. All my testing since Sept. 9 has been wasted.
>>> Oh
>>> well, that's the way it goes!
>>
>>
>> Are you confident that the issue was not reproducible before 4.8-rc2?
>> In particular, what about 4.8-rc1?
>
>
> 4.8-rc1 is definitely bad. I am now testing commit 5539204. In the bisect
> visualization, there are a number of cpufreq commits before the test case.

Maybe it's better to try diagnose the problem instead of spending more
time on bisection.

I'd like to know whether or not 4.7 was definitely good, though.

> If it is one of them, it may be a while before I dare call this one "good".
> In one respect, that is good as I will be traveling tomorrow and Wednesday.

What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 22:16                   ` Rafael J. Wysocki
@ 2016-09-26 23:53                     ` Larry Finger
  2016-09-27  0:21                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 35+ messages in thread
From: Larry Finger @ 2016-09-26 23:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, LKML, Linux PM list, Srinivas Pandruvada

On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote:
>On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger
<Larry.Finger@lwfinger.net> wrote:
>
> Maybe it's better to try diagnose the problem instead of spending more
> time on bisection.

In my original post, I asked for such help, but nothing until today. I had no 
idea what to check, but now I have a better idea.

> I'd like to know whether or not 4.7 was definitely good, though.

I never saw this problem with 4.7, but given the difficulty in triggering the 
problem, my tests may not have been definitive.
>
>> If it is one of them, it may be a while before I dare call this one "good".
>> In one respect, that is good as I will be traveling tomorrow and Wednesday.
>
> What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?

intel_pstate

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 23:53                     ` Larry Finger
@ 2016-09-27  0:21                       ` Rafael J. Wysocki
  2016-09-27  0:48                         ` Larry Finger
  0 siblings, 1 reply; 35+ messages in thread
From: Rafael J. Wysocki @ 2016-09-27  0:21 UTC (permalink / raw)
  To: Larry Finger
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, LKML, Linux PM list,
	Srinivas Pandruvada

On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
> On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote:
>>
>> On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger
>
> <Larry.Finger@lwfinger.net> wrote:
>>
>>
>> Maybe it's better to try diagnose the problem instead of spending more
>> time on bisection.
>
>
> In my original post, I asked for such help, but nothing until today. I had
> no idea what to check, but now I have a better idea.
>
>> I'd like to know whether or not 4.7 was definitely good, though.
>
>
> I never saw this problem with 4.7, but given the difficulty in triggering
> the problem, my tests may not have been definitive.
>>
>>
>>> If it is one of them, it may be a while before I dare call this one
>>> "good".
>>> In one respect, that is good as I will be traveling tomorrow and
>>> Wednesday.
>>
>>
>> What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?
>
>
> intel_pstate

You probably don't need to worry about all of the cpufreq changes in
4.8-rc, then.  Only a few of them affect intel_pstate and I don't see
how any of them may lead to the observed symptoms.

First off, if you have a reproducer, please run it on 4.7 and see if
you can trigger the issue in there.

Second, it would be good to have a look at the output from the
cpu_frequency and pstate_sample tracepoints around when the issue
triggers.  The pstate_sample one would be more interesting.

But for both we need a reproducer anyway.

It also would be good to rule out the thermal throttling (as per the
Srinivas' comments).

For now, please tell me what's in
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-27  0:21                       ` Rafael J. Wysocki
@ 2016-09-27  0:48                         ` Larry Finger
  2016-09-27  1:30                           ` Srinivas Pandruvada
  2016-09-27  3:12                           ` Doug Smythies
  0 siblings, 2 replies; 35+ messages in thread
From: Larry Finger @ 2016-09-27  0:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, LKML, Linux PM list, Srinivas Pandruvada

On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
>> On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote:
>>>
>>> On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger
>>
>> <Larry.Finger@lwfinger.net> wrote:
>>>
>>>
>>> Maybe it's better to try diagnose the problem instead of spending more
>>> time on bisection.
>>
>>
>> In my original post, I asked for such help, but nothing until today. I had
>> no idea what to check, but now I have a better idea.
>>
>>> I'd like to know whether or not 4.7 was definitely good, though.
>>
>>
>> I never saw this problem with 4.7, but given the difficulty in triggering
>> the problem, my tests may not have been definitive.
>>>
>>>
>>>> If it is one of them, it may be a while before I dare call this one
>>>> "good".
>>>> In one respect, that is good as I will be traveling tomorrow and
>>>> Wednesday.
>>>
>>>
>>> What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?
>>
>>
>> intel_pstate
>
> You probably don't need to worry about all of the cpufreq changes in
> 4.8-rc, then.  Only a few of them affect intel_pstate and I don't see
> how any of them may lead to the observed symptoms.
>
> First off, if you have a reproducer, please run it on 4.7 and see if
> you can trigger the issue in there.

I'm running 4.8-rc7 at the moment hoping to trigger the problem and get the data 
requested by Srinivas. Once I get that, I will try 4.7 again.
>
> Second, it would be good to have a look at the output from the
> cpu_frequency and pstate_sample tracepoints around when the issue
> triggers.  The pstate_sample one would be more interesting.
>
> But for both we need a reproducer anyway.

I do not have a reliable reproducer. The condition has always happened when 
running a high-compute job such as a 'make -j8' on the kernel, or building the 
RPM for openSUSE's implementation of VirtualBox. The latter is what I'm using 
for most of my testing.

> It also would be good to rule out the thermal throttling (as per the
> Srinivas' comments).
>
> For now, please tell me what's in
> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq

800000

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-27  0:48                         ` Larry Finger
@ 2016-09-27  1:30                           ` Srinivas Pandruvada
  2016-09-27  2:53                             ` Larry Finger
  2016-09-27  3:12                           ` Doug Smythies
  1 sibling, 1 reply; 35+ messages in thread
From: Srinivas Pandruvada @ 2016-09-27  1:30 UTC (permalink / raw)
  To: Larry Finger, Rafael J. Wysocki; +Cc: Rafael J. Wysocki, LKML, Linux PM list

On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
> > 
> > On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger <Larry.Finger@lwfinge
> > r.net> wrote:
> > > 
> > > On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote:
> > > > 
> > > > 
> > > > On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger
> > > <Larry.Finger@lwfinger.net> wrote:
> > > > 
> > > > 
> > > > 
> > > > Maybe it's better to try diagnose the problem instead of
> > > > spending more
> > > > time on bisection.
> > > 
> > > In my original post, I asked for such help, but nothing until
> > > today. I had
> > > no idea what to check, but now I have a better idea.
> > > 
> > > > 
> > > > I'd like to know whether or not 4.7 was definitely good,
> > > > though.
> > > 
> > > I never saw this problem with 4.7, but given the difficulty in
> > > triggering
> > > the problem, my tests may not have been definitive.
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > If it is one of them, it may be a while before I dare call
> > > > > this one
> > > > > "good".
> > > > > In one respect, that is good as I will be traveling tomorrow
> > > > > and
> > > > > Wednesday.
> > > > 
> > > > What does "cat
> > > > /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?
> > > 
> > > intel_pstate
> > You probably don't need to worry about all of the cpufreq changes
> > in
> > 4.8-rc, then.  Only a few of them affect intel_pstate and I don't
> > see
> > how any of them may lead to the observed symptoms.
> > 
> > First off, if you have a reproducer, please run it on 4.7 and see
> > if
> > you can trigger the issue in there.
> I'm running 4.8-rc7 at the moment hoping to trigger the problem and
> get the data 
> requested by Srinivas. Once I get that, I will try 4.7 again.
> > 
> > 
> > Second, it would be good to have a look at the output from the
> > cpu_frequency and pstate_sample tracepoints around when the issue
> > triggers.  The pstate_sample one would be more interesting.
> > 
> > But for both we need a reproducer anyway.
> I do not have a reliable reproducer. The condition has always
> happened when 
> running a high-compute job such as a 'make -j8' on the kernel, or
> building the 
> RPM for openSUSE's implementation of VirtualBox. The latter is what
> I'm using 
> for most of my testing.
> 
> > 
> > It also would be good to rule out the thermal throttling (as per
> > the
> > Srinivas' comments).
> > 
> > For now, please tell me what's in
> > /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
> 800000
Your effective freq is lower than 800MHz. One of the possible reason is
thermal throttling.

What distro you are using?


Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-27  1:30                           ` Srinivas Pandruvada
@ 2016-09-27  2:53                             ` Larry Finger
  0 siblings, 0 replies; 35+ messages in thread
From: Larry Finger @ 2016-09-27  2:53 UTC (permalink / raw)
  To: Srinivas Pandruvada, Rafael J. Wysocki
  Cc: Rafael J. Wysocki, LKML, Linux PM list

On 09/26/2016 08:30 PM, Srinivas Pandruvada wrote:
> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
>> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
>>>
>>> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger <Larry.Finger@lwfinge
>>> r.net> wrote:
>>>>
>>>> On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote:
>>>>>
>>>>>
>>>>> On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger
>>>> <Larry.Finger@lwfinger.net> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Maybe it's better to try diagnose the problem instead of
>>>>> spending more
>>>>> time on bisection.
>>>>
>>>> In my original post, I asked for such help, but nothing until
>>>> today. I had
>>>> no idea what to check, but now I have a better idea.
>>>>
>>>>>
>>>>> I'd like to know whether or not 4.7 was definitely good,
>>>>> though.
>>>>
>>>> I never saw this problem with 4.7, but given the difficulty in
>>>> triggering
>>>> the problem, my tests may not have been definitive.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> If it is one of them, it may be a while before I dare call
>>>>>> this one
>>>>>> "good".
>>>>>> In one respect, that is good as I will be traveling tomorrow
>>>>>> and
>>>>>> Wednesday.
>>>>>
>>>>> What does "cat
>>>>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?
>>>>
>>>> intel_pstate
>>> You probably don't need to worry about all of the cpufreq changes
>>> in
>>> 4.8-rc, then.  Only a few of them affect intel_pstate and I don't
>>> see
>>> how any of them may lead to the observed symptoms.
>>>
>>> First off, if you have a reproducer, please run it on 4.7 and see
>>> if
>>> you can trigger the issue in there.
>> I'm running 4.8-rc7 at the moment hoping to trigger the problem and
>> get the data
>> requested by Srinivas. Once I get that, I will try 4.7 again.
>>>
>>>
>>> Second, it would be good to have a look at the output from the
>>> cpu_frequency and pstate_sample tracepoints around when the issue
>>> triggers.  The pstate_sample one would be more interesting.
>>>
>>> But for both we need a reproducer anyway.
>> I do not have a reliable reproducer. The condition has always
>> happened when
>> running a high-compute job such as a 'make -j8' on the kernel, or
>> building the
>> RPM for openSUSE's implementation of VirtualBox. The latter is what
>> I'm using
>> for most of my testing.
>>
>>>
>>> It also would be good to rule out the thermal throttling (as per
>>> the
>>> Srinivas' comments).
>>>
>>> For now, please tell me what's in
>>> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
>> 800000
> Your effective freq is lower than 800MHz. One of the possible reason is
> thermal throttling.
>
> What distro you are using?

openSUSE Leap 42.1.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Regression in 4.8 - CPU speed set very low
  2016-09-27  0:48                         ` Larry Finger
  2016-09-27  1:30                           ` Srinivas Pandruvada
@ 2016-09-27  3:12                           ` Doug Smythies
  2016-09-27  8:48                             ` Larry Finger
  1 sibling, 1 reply; 35+ messages in thread
From: Doug Smythies @ 2016-09-27  3:12 UTC (permalink / raw)
  To: 'Srinivas Pandruvada', 'Larry Finger',
	'Rafael J. Wysocki'
  Cc: 'Rafael J. Wysocki', 'LKML', 'Linux PM list'

On 2016.09.26 18:31 Srinivas Pandruvada wrote:
> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
>> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: 
>>> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
>>> But for both we need a reproducer anyway.
>> I do not have a reliable reproducer. The condition has always
>> happened when 
>> running a high-compute job such as a 'make -j8' on the kernel, or
>> building the 
>> RPM for openSUSE's implementation of VirtualBox. The latter is what
>> I'm using 
>> for most of my testing.

Run some CPU stressor and get all your CPU's going at 100% load.
And watch your core temperatures while you do so.

> 
>>> It also would be good to rule out the thermal throttling (as per
>>> the Srinivas' comments).

It is almost certainly thermal throttling, or similar causing
Clock modulation, of it seems 50%.

>>> 
>>> For now, please tell me what's in
>>> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
>> 800000
> Your effective freq is lower than 800MHz. One of the possible reason is
> thermal throttling.
>
> What distro you are using?

And what make and model of LapTop?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-27  3:12                           ` Doug Smythies
@ 2016-09-27  8:48                             ` Larry Finger
  2016-09-27 11:46                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 35+ messages in thread
From: Larry Finger @ 2016-09-27  8:48 UTC (permalink / raw)
  To: Doug Smythies, 'Srinivas Pandruvada',
	'Rafael J. Wysocki'
  Cc: 'Rafael J. Wysocki', 'LKML', 'Linux PM list'

On 09/26/2016 10:12 PM, Doug Smythies wrote:
> On 2016.09.26 18:31 Srinivas Pandruvada wrote:
>> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
>>> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
>>>> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
>>>> But for both we need a reproducer anyway.
>>> I do not have a reliable reproducer. The condition has always
>>> happened when
>>> running a high-compute job such as a 'make -j8' on the kernel, or
>>> building the
>>> RPM for openSUSE's implementation of VirtualBox. The latter is what
>>> I'm using
>>> for most of my testing.
>
> Run some CPU stressor and get all your CPU's going at 100% load.
> And watch your core temperatures while you do so.

for i in 1 2 3 4; do while : ; do : ; done & done

triggered the fault in a few minutes.
>
>>
>>>> It also would be good to rule out the thermal throttling (as per
>>>> the Srinivas' comments).
>
> It is almost certainly thermal throttling, or similar causing
> Clock modulation, of it seems 50%.

While the infinite loops were running, the temps were:

finger@linux-1t8h:~/rtlwifi_new> sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +83.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0:         +83.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1:         +74.0°C  (high = +84.0°C, crit = +100.0°C)

After the fault occurs, I get

finger@linux-1t8h:~/rtlwifi_new> sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +44.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0:         +43.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1:         +41.0°C  (high = +84.0°C, crit = +100.0°C)

>
>>>>
>>>> For now, please tell me what's in
>>>> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
>>> 800000
>> Your effective freq is lower than 800MHz. One of the possible reason is
>> thermal throttling.
>>
>> What distro you are using?
>
> And what make and model of LapTop?

Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @ 
2.90GHz. That is a dual-core unit with hyperthreading.

@Rafael: As I write this, the system has been running the infinite loop test for 
almost 5 hours with kernel 4.7. I will leave that running while I'm gone, but I 
am certain that it is OK.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-27  8:48                             ` Larry Finger
@ 2016-09-27 11:46                               ` Rafael J. Wysocki
  2016-09-29  2:22                                 ` Larry Finger
  0 siblings, 1 reply; 35+ messages in thread
From: Rafael J. Wysocki @ 2016-09-27 11:46 UTC (permalink / raw)
  To: Larry Finger
  Cc: Doug Smythies, Srinivas Pandruvada, Rafael J. Wysocki,
	Rafael J. Wysocki, LKML, Linux PM list

On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger
<Larry.Finger@lwfinger.net> wrote:
> On 09/26/2016 10:12 PM, Doug Smythies wrote:
>>
>> On 2016.09.26 18:31 Srinivas Pandruvada wrote:
>>>
>>> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
>>>>
>>>> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
>>>>>
>>>>> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
>>>>> But for both we need a reproducer anyway.
>>>>
>>>> I do not have a reliable reproducer. The condition has always
>>>> happened when
>>>> running a high-compute job such as a 'make -j8' on the kernel, or
>>>> building the
>>>> RPM for openSUSE's implementation of VirtualBox. The latter is what
>>>> I'm using
>>>> for most of my testing.
>>
>>
>> Run some CPU stressor and get all your CPU's going at 100% load.
>> And watch your core temperatures while you do so.
>
>
> for i in 1 2 3 4; do while : ; do : ; done & done
>
> triggered the fault in a few minutes.
>>
>>
>>>
>>>>> It also would be good to rule out the thermal throttling (as per
>>>>> the Srinivas' comments).
>>
>>
>> It is almost certainly thermal throttling, or similar causing
>> Clock modulation, of it seems 50%.
>
>
> While the infinite loops were running, the temps were:
>
> finger@linux-1t8h:~/rtlwifi_new> sensors
> coretemp-isa-0000
> Adapter: ISA adapter
> Physical id 0:  +83.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 0:         +83.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 1:         +74.0°C  (high = +84.0°C, crit = +100.0°C)

It looks like the trip point (high) temperature was exceeded causing
thermal throttling to kick in.

> After the fault occurs, I get
>
> finger@linux-1t8h:~/rtlwifi_new> sensors
> coretemp-isa-0000
> Adapter: ISA adapter
> Physical id 0:  +44.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 0:         +43.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 1:         +41.0°C  (high = +84.0°C, crit = +100.0°C)

So after that it stays at 400 MHz forever, right?

>>>>>
>>>>> For now, please tell me what's in
>>>>> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
>>>>
>>>> 800000
>>>
>>> Your effective freq is lower than 800MHz. One of the possible reason is
>>> thermal throttling.
>>>
>>> What distro you are using?
>>
>>
>> And what make and model of LapTop?
>
>
> Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @
> 2.90GHz. That is a dual-core unit with hyperthreading.
>
> @Rafael: As I write this, the system has been running the infinite loop test
> for almost 5 hours with kernel 4.7. I will leave that running while I'm
> gone, but I am certain that it is OK.

OK, and what temperatures do you see while doing this?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-26 21:28             ` Larry Finger
  2016-09-26 21:37               ` Rafael J. Wysocki
@ 2016-09-27 14:51               ` Lennart Sorensen
  2016-09-29  2:26                 ` Larry Finger
  1 sibling, 1 reply; 35+ messages in thread
From: Lennart Sorensen @ 2016-09-27 14:51 UTC (permalink / raw)
  To: Larry Finger; +Cc: Rafael J. Wysocki, LKML, Linux PM list, Srinivas Pandruvada

On Mon, Sep 26, 2016 at 04:28:29PM -0500, Larry Finger wrote:
> Mostly I use a KDE applet named "System load" and look at the "average
> clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
> When the bug triggers, the system gets very slow, and the cpu fan stops even
> though the cpu is still busy.
> 
> Commit f7816ad, which had run for 7 days without showing the bug, failed
> after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh
> well, that's the way it goes!

Is it possible there is no bug and instead you have a hardware problem?

What I am thinking:

CPU fan stops, then CPU gets busy, CPU overheats, thermal throtling
kicks in to protect CPU and it gets VERY slow.

So maybe you have a bad CPU fan that is getting stuck.  Perhaps even if
you have a motherboard that varies the CPU fan depending on need and the
fan doesn't like the lowest speed and sometimes gets stuck when asked
to go slow.

Of course if the CPU fan is the problem that could explain why it takes
varying amounts of time to see the problem.

I suggest checking what the cpu temperature sensors are showing next
time it gets slow.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-27 11:46                               ` Rafael J. Wysocki
@ 2016-09-29  2:22                                 ` Larry Finger
  2016-09-29 12:19                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 35+ messages in thread
From: Larry Finger @ 2016-09-29  2:22 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Doug Smythies, Srinivas Pandruvada, Rafael J. Wysocki, LKML,
	Linux PM list

On 09/27/2016 06:46 AM, Rafael J. Wysocki wrote:
> On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger
> <Larry.Finger@lwfinger.net> wrote:
>> On 09/26/2016 10:12 PM, Doug Smythies wrote:
>>>
>>> On 2016.09.26 18:31 Srinivas Pandruvada wrote:
>>>>
>>>> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
>>>>>
>>>>> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
>>>>>>
>>>>>> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
>>>>>> But for both we need a reproducer anyway.
>>>>>
>>>>> I do not have a reliable reproducer. The condition has always
>>>>> happened when
>>>>> running a high-compute job such as a 'make -j8' on the kernel, or
>>>>> building the
>>>>> RPM for openSUSE's implementation of VirtualBox. The latter is what
>>>>> I'm using
>>>>> for most of my testing.
>>>
>>>
>>> Run some CPU stressor and get all your CPU's going at 100% load.
>>> And watch your core temperatures while you do so.
>>
>>
>> for i in 1 2 3 4; do while : ; do : ; done & done
>>
>> triggered the fault in a few minutes.
>>>
>>>
>>>>
>>>>>> It also would be good to rule out the thermal throttling (as per
>>>>>> the Srinivas' comments).
>>>
>>>
>>> It is almost certainly thermal throttling, or similar causing
>>> Clock modulation, of it seems 50%.
>>
>>
>> While the infinite loops were running, the temps were:
>>
>> finger@linux-1t8h:~/rtlwifi_new> sensors
>> coretemp-isa-0000
>> Adapter: ISA adapter
>> Physical id 0:  +83.0°C  (high = +84.0°C, crit = +100.0°C)
>> Core 0:         +83.0°C  (high = +84.0°C, crit = +100.0°C)
>> Core 1:         +74.0°C  (high = +84.0°C, crit = +100.0°C)
>
> It looks like the trip point (high) temperature was exceeded causing
> thermal throttling to kick in.
>
>> After the fault occurs, I get
>>
>> finger@linux-1t8h:~/rtlwifi_new> sensors
>> coretemp-isa-0000
>> Adapter: ISA adapter
>> Physical id 0:  +44.0°C  (high = +84.0°C, crit = +100.0°C)
>> Core 0:         +43.0°C  (high = +84.0°C, crit = +100.0°C)
>> Core 1:         +41.0°C  (high = +84.0°C, crit = +100.0°C)
>
> So after that it stays at 400 MHz forever, right?
>
>>>>>>
>>>>>> For now, please tell me what's in
>>>>>> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
>>>>>
>>>>> 800000
>>>>
>>>> Your effective freq is lower than 800MHz. One of the possible reason is
>>>> thermal throttling.
>>>>
>>>> What distro you are using?
>>>
>>>
>>> And what make and model of LapTop?
>>
>>
>> Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @
>> 2.90GHz. That is a dual-core unit with hyperthreading.
>>
>> @Rafael: As I write this, the system has been running the infinite loop test
>> for almost 5 hours with kernel 4.7. I will leave that running while I'm
>> gone, but I am certain that it is OK.
>
> OK, and what temperatures do you see while doing this?

finger@linux-1t8h:~/linux-2.6> sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +90.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0:         +90.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1:         +78.0°C  (high = +84.0°C, crit = +100.0°C)

Once again, the CPU temp is greater than the "high" value; however, the clock 
rate continues to hold near 3600 MHz.

My laptop was inadvertently put to sleep while I was gone. I forgot to leave a 
note for my wife and she quieted the noisy cpu fan. :)

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-27 14:51               ` Lennart Sorensen
@ 2016-09-29  2:26                 ` Larry Finger
  2016-09-29 14:17                   ` Lennart Sorensen
  0 siblings, 1 reply; 35+ messages in thread
From: Larry Finger @ 2016-09-29  2:26 UTC (permalink / raw)
  To: Lennart Sorensen
  Cc: Rafael J. Wysocki, LKML, Linux PM list, Srinivas Pandruvada

On 09/27/2016 09:51 AM, Lennart Sorensen wrote:
> On Mon, Sep 26, 2016 at 04:28:29PM -0500, Larry Finger wrote:
>> Mostly I use a KDE applet named "System load" and look at the "average
>> clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
>> When the bug triggers, the system gets very slow, and the cpu fan stops even
>> though the cpu is still busy.
>>
>> Commit f7816ad, which had run for 7 days without showing the bug, failed
>> after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh
>> well, that's the way it goes!
>
> Is it possible there is no bug and instead you have a hardware problem?
>
> What I am thinking:
>
> CPU fan stops, then CPU gets busy, CPU overheats, thermal throtling
> kicks in to protect CPU and it gets VERY slow.
>
> So maybe you have a bad CPU fan that is getting stuck.  Perhaps even if
> you have a motherboard that varies the CPU fan depending on need and the
> fan doesn't like the lowest speed and sometimes gets stuck when asked
> to go slow.
>
> Of course if the CPU fan is the problem that could explain why it takes
> varying amounts of time to see the problem.
>
> I suggest checking what the cpu temperature sensors are showing next
> time it gets slow.

By the time it gets slow, the CPU's cool, and one cannot see the temp just 
before that event happened.

The reason I suspect a bug is that it fails with 4.8-rcX, but not with 4.7. Of 
course, it could be something subtle that slightly changes the heat load, which 
causes the CPU temp to be a little higher so that the effect is triggered.

I am reasonably confident that it is not a hardware problem, but we may have to 
wait until 4.8 is released and gets wider usage. If no one else reports a 
problem, then I am certainly wrong.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-29  2:22                                 ` Larry Finger
@ 2016-09-29 12:19                                   ` Rafael J. Wysocki
  2016-09-29 15:09                                     ` Larry Finger
  2016-09-29 15:56                                     ` Srinivas Pandruvada
  0 siblings, 2 replies; 35+ messages in thread
From: Rafael J. Wysocki @ 2016-09-29 12:19 UTC (permalink / raw)
  To: Larry Finger, Srinivas Pandruvada, Zhang Rui
  Cc: Rafael J. Wysocki, Doug Smythies, LKML, Linux PM list

On Wednesday, September 28, 2016 09:22:59 PM Larry Finger wrote:
> On 09/27/2016 06:46 AM, Rafael J. Wysocki wrote:
> > On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger
> > <Larry.Finger@lwfinger.net> wrote:
> >> On 09/26/2016 10:12 PM, Doug Smythies wrote:
> >>>
> >>> On 2016.09.26 18:31 Srinivas Pandruvada wrote:
> >>>>
> >>>> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
> >>>>>
> >>>>> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
> >>>>>>
> >>>>>> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
> >>>>>> But for both we need a reproducer anyway.
> >>>>>
> >>>>> I do not have a reliable reproducer. The condition has always
> >>>>> happened when
> >>>>> running a high-compute job such as a 'make -j8' on the kernel, or
> >>>>> building the
> >>>>> RPM for openSUSE's implementation of VirtualBox. The latter is what
> >>>>> I'm using
> >>>>> for most of my testing.
> >>>
> >>>
> >>> Run some CPU stressor and get all your CPU's going at 100% load.
> >>> And watch your core temperatures while you do so.
> >>
> >>
> >> for i in 1 2 3 4; do while : ; do : ; done & done
> >>
> >> triggered the fault in a few minutes.
> >>>
> >>>
> >>>>
> >>>>>> It also would be good to rule out the thermal throttling (as per
> >>>>>> the Srinivas' comments).
> >>>
> >>>
> >>> It is almost certainly thermal throttling, or similar causing
> >>> Clock modulation, of it seems 50%.
> >>
> >>
> >> While the infinite loops were running, the temps were:
> >>
> >> finger@linux-1t8h:~/rtlwifi_new> sensors
> >> coretemp-isa-0000
> >> Adapter: ISA adapter
> >> Physical id 0:  +83.0°C  (high = +84.0°C, crit = +100.0°C)
> >> Core 0:         +83.0°C  (high = +84.0°C, crit = +100.0°C)
> >> Core 1:         +74.0°C  (high = +84.0°C, crit = +100.0°C)
> >
> > It looks like the trip point (high) temperature was exceeded causing
> > thermal throttling to kick in.
> >
> >> After the fault occurs, I get
> >>
> >> finger@linux-1t8h:~/rtlwifi_new> sensors
> >> coretemp-isa-0000
> >> Adapter: ISA adapter
> >> Physical id 0:  +44.0°C  (high = +84.0°C, crit = +100.0°C)
> >> Core 0:         +43.0°C  (high = +84.0°C, crit = +100.0°C)
> >> Core 1:         +41.0°C  (high = +84.0°C, crit = +100.0°C)
> >
> > So after that it stays at 400 MHz forever, right?
> >
> >>>>>>
> >>>>>> For now, please tell me what's in
> >>>>>> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
> >>>>>
> >>>>> 800000
> >>>>
> >>>> Your effective freq is lower than 800MHz. One of the possible reason is
> >>>> thermal throttling.
> >>>>
> >>>> What distro you are using?
> >>>
> >>>
> >>> And what make and model of LapTop?
> >>
> >>
> >> Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @
> >> 2.90GHz. That is a dual-core unit with hyperthreading.
> >>
> >> @Rafael: As I write this, the system has been running the infinite loop test
> >> for almost 5 hours with kernel 4.7. I will leave that running while I'm
> >> gone, but I am certain that it is OK.
> >
> > OK, and what temperatures do you see while doing this?
> 
> finger@linux-1t8h:~/linux-2.6> sensors
> coretemp-isa-0000
> Adapter: ISA adapter
> Physical id 0:  +90.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 0:         +90.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 1:         +78.0°C  (high = +84.0°C, crit = +100.0°C)
> 
> Once again, the CPU temp is greater than the "high" value; however, the clock 
> rate continues to hold near 3600 MHz.
> 
> My laptop was inadvertently put to sleep while I was gone. I forgot to leave a 
> note for my wife and she quieted the noisy cpu fan. :)

It looks like in 4.8-rc we made a change that caused the "high" trip point to
be acted on.

Srinivas, Rui, do you recall what that can be?

One more question (I think I asked it previously): In the failing case
(4.8-rc1 and later), when the frequency drops down to the 400 MHz, does it
ever go back higher or is it stuck at that level forever?

In any case, it may help to file a bug at bugzilla.kernel.org against
CPU/thermal or similar and let me know the bug number.  We'll need to
collect some tracepoint data to debug this and some place to put them
into for easy reference.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-29  2:26                 ` Larry Finger
@ 2016-09-29 14:17                   ` Lennart Sorensen
  0 siblings, 0 replies; 35+ messages in thread
From: Lennart Sorensen @ 2016-09-29 14:17 UTC (permalink / raw)
  To: Larry Finger; +Cc: Rafael J. Wysocki, LKML, Linux PM list, Srinivas Pandruvada

On Wed, Sep 28, 2016 at 09:26:42PM -0500, Larry Finger wrote:
> By the time it gets slow, the CPU's cool, and one cannot see the temp just
> before that event happened.

Hmm, I would not expect the CPU to drop from 80 to 40 degrees in a few
seconds if the fan is not spinning.  I wouldn't even expect it if the
fan was spinning.  I would think at least 30 to 60 seconds if not more.

The only way I would think the temperature could change quickly would
be if the heatsink isn't even touching the CPU anymore so there is very
little material to hold the heat in the CPU.

> The reason I suspect a bug is that it fails with 4.8-rcX, but not with 4.7.
> Of course, it could be something subtle that slightly changes the heat load,
> which causes the CPU temp to be a little higher so that the effect is
> triggered.
> 
> I am reasonably confident that it is not a hardware problem, but we may have
> to wait until 4.8 is released and gets wider usage. If no one else reports a
> problem, then I am certainly wrong.

Well hard to reproduce bugs are always really annoying.

This old bug sounds a lot like what you are seeing:
https://bugzilla.redhat.com/show_bug.cgi?id=924570
and it links to this:
https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.6-Thermal-Updates

Apparently turning off turbo boost seems to stop the problem for a lot
of people in that case.  Doesn't explain why it started happening
recently.  And of course that may have been a different problem in
the past.

-- 
Len Sorensen

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-29 12:19                                   ` Rafael J. Wysocki
@ 2016-09-29 15:09                                     ` Larry Finger
  2016-09-29 15:56                                     ` Srinivas Pandruvada
  1 sibling, 0 replies; 35+ messages in thread
From: Larry Finger @ 2016-09-29 15:09 UTC (permalink / raw)
  To: Rafael J. Wysocki, Srinivas Pandruvada, Zhang Rui
  Cc: Rafael J. Wysocki, Doug Smythies, LKML, Linux PM list

On 09/29/2016 07:19 AM, Rafael J. Wysocki wrote:
> On Wednesday, September 28, 2016 09:22:59 PM Larry Finger wrote:
>> On 09/27/2016 06:46 AM, Rafael J. Wysocki wrote:
>>> On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger
>>> <Larry.Finger@lwfinger.net> wrote:
>>>> On 09/26/2016 10:12 PM, Doug Smythies wrote:
>>>>>
>>>>> On 2016.09.26 18:31 Srinivas Pandruvada wrote:
>>>>>>
>>>>>> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
>>>>>>>
>>>>>>> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
>>>>>>>>
>>>>>>>> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
>>>>>>>> But for both we need a reproducer anyway.
>>>>>>>
>>>>>>> I do not have a reliable reproducer. The condition has always
>>>>>>> happened when
>>>>>>> running a high-compute job such as a 'make -j8' on the kernel, or
>>>>>>> building the
>>>>>>> RPM for openSUSE's implementation of VirtualBox. The latter is what
>>>>>>> I'm using
>>>>>>> for most of my testing.
>>>>>
>>>>>
>>>>> Run some CPU stressor and get all your CPU's going at 100% load.
>>>>> And watch your core temperatures while you do so.
>>>>
>>>>
>>>> for i in 1 2 3 4; do while : ; do : ; done & done
>>>>
>>>> triggered the fault in a few minutes.
>>>>>
>>>>>
>>>>>>
>>>>>>>> It also would be good to rule out the thermal throttling (as per
>>>>>>>> the Srinivas' comments).
>>>>>
>>>>>
>>>>> It is almost certainly thermal throttling, or similar causing
>>>>> Clock modulation, of it seems 50%.
>>>>
>>>>
>>>> While the infinite loops were running, the temps were:
>>>>
>>>> finger@linux-1t8h:~/rtlwifi_new> sensors
>>>> coretemp-isa-0000
>>>> Adapter: ISA adapter
>>>> Physical id 0:  +83.0°C  (high = +84.0°C, crit = +100.0°C)
>>>> Core 0:         +83.0°C  (high = +84.0°C, crit = +100.0°C)
>>>> Core 1:         +74.0°C  (high = +84.0°C, crit = +100.0°C)
>>>
>>> It looks like the trip point (high) temperature was exceeded causing
>>> thermal throttling to kick in.
>>>
>>>> After the fault occurs, I get
>>>>
>>>> finger@linux-1t8h:~/rtlwifi_new> sensors
>>>> coretemp-isa-0000
>>>> Adapter: ISA adapter
>>>> Physical id 0:  +44.0°C  (high = +84.0°C, crit = +100.0°C)
>>>> Core 0:         +43.0°C  (high = +84.0°C, crit = +100.0°C)
>>>> Core 1:         +41.0°C  (high = +84.0°C, crit = +100.0°C)
>>>
>>> So after that it stays at 400 MHz forever, right?
>>>
>>>>>>>>
>>>>>>>> For now, please tell me what's in
>>>>>>>> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
>>>>>>>
>>>>>>> 800000
>>>>>>
>>>>>> Your effective freq is lower than 800MHz. One of the possible reason is
>>>>>> thermal throttling.
>>>>>>
>>>>>> What distro you are using?
>>>>>
>>>>>
>>>>> And what make and model of LapTop?
>>>>
>>>>
>>>> Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @
>>>> 2.90GHz. That is a dual-core unit with hyperthreading.
>>>>
>>>> @Rafael: As I write this, the system has been running the infinite loop test
>>>> for almost 5 hours with kernel 4.7. I will leave that running while I'm
>>>> gone, but I am certain that it is OK.
>>>
>>> OK, and what temperatures do you see while doing this?
>>
>> finger@linux-1t8h:~/linux-2.6> sensors
>> coretemp-isa-0000
>> Adapter: ISA adapter
>> Physical id 0:  +90.0°C  (high = +84.0°C, crit = +100.0°C)
>> Core 0:         +90.0°C  (high = +84.0°C, crit = +100.0°C)
>> Core 1:         +78.0°C  (high = +84.0°C, crit = +100.0°C)
>>
>> Once again, the CPU temp is greater than the "high" value; however, the clock
>> rate continues to hold near 3600 MHz.
>>
>> My laptop was inadvertently put to sleep while I was gone. I forgot to leave a
>> note for my wife and she quieted the noisy cpu fan. :)
>
> It looks like in 4.8-rc we made a change that caused the "high" trip point to
> be acted on.
>
> Srinivas, Rui, do you recall what that can be?
>
> One more question (I think I asked it previously): In the failing case
> (4.8-rc1 and later), when the frequency drops down to the 400 MHz, does it
> ever go back higher or is it stuck at that level forever?
>
> In any case, it may help to file a bug at bugzilla.kernel.org against
> CPU/thermal or similar and let me know the bug number.  We'll need to
> collect some tracepoint data to debug this and some place to put them
> into for easy reference.

Sorry if I missed that earlier question. The CPU is stuck at that lower 
frequency until I reboot.

Bug report at https://bugzilla.kernel.org/show_bug.cgi?id=173361. I tried to 
cover the main points of the discussion. Please add the ones that I missed.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-29 12:19                                   ` Rafael J. Wysocki
  2016-09-29 15:09                                     ` Larry Finger
@ 2016-09-29 15:56                                     ` Srinivas Pandruvada
  2016-09-29 16:24                                       ` Larry Finger
                                                         ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Srinivas Pandruvada @ 2016-09-29 15:56 UTC (permalink / raw)
  To: Rafael J. Wysocki, Larry Finger, Zhang Rui
  Cc: Rafael J. Wysocki, Doug Smythies, LKML, Linux PM list

On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote:

[...]

> > My laptop was inadvertently put to sleep while I was gone. I forgot
> > to leave a 
> > note for my wife and she quieted the noisy cpu fan. :)
> It looks like in 4.8-rc we made a change that caused the "high" trip
> point to
> be acted on.
This high trip point we don't expose in thermal subsystem (the thermal
 zone dump didn't show this anywhere as a trip). This is exposed by
core-dts driver only. This is the point BIOS is supposed to act, I
guess that's why you are seeing 50% clock modulation. 


Are you running thermald 

What is?
# ps -e | grep thermald


> 
> Srinivas, Rui, do you recall what that can be?
> 
> One more question (I think I asked it previously): In the failing
> case
> (4.8-rc1 and later), when the frequency drops down to the 400 MHz,
> does it
> ever go back higher or is it stuck at that level forever?
> 
> In any case, it may help to file a bug at bugzilla.kernel.org against
> CPU/thermal or similar and let me know the bug number.  We'll need to
> collect some tracepoint data to debug this and some place to put them
> into for easy reference.
Yes, this is good idea.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-29 15:56                                     ` Srinivas Pandruvada
@ 2016-09-29 16:24                                       ` Larry Finger
  2016-09-29 20:52                                       ` Rafael J. Wysocki
  2016-09-30 21:52                                       ` Larry Finger
  2 siblings, 0 replies; 35+ messages in thread
From: Larry Finger @ 2016-09-29 16:24 UTC (permalink / raw)
  To: Srinivas Pandruvada, Rafael J. Wysocki, Zhang Rui
  Cc: Rafael J. Wysocki, Doug Smythies, LKML, Linux PM list

On 09/29/2016 10:56 AM, Srinivas Pandruvada wrote:
> On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote:
>
> [...]
>
>>> My laptop was inadvertently put to sleep while I was gone. I forgot
>>> to leave a
>>> note for my wife and she quieted the noisy cpu fan. :)
>> It looks like in 4.8-rc we made a change that caused the "high" trip
>> point to
>> be acted on.
> This high trip point we don't expose in thermal subsystem (the thermal
>  zone dump didn't show this anywhere as a trip). This is exposed by
> core-dts driver only. This is the point BIOS is supposed to act, I
> guess that's why you are seeing 50% clock modulation.
>
>
> Are you running thermald
>
> What is?
> # ps -e | grep thermald

The output is blank. I am not running thermald.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-29 15:56                                     ` Srinivas Pandruvada
  2016-09-29 16:24                                       ` Larry Finger
@ 2016-09-29 20:52                                       ` Rafael J. Wysocki
  2016-09-30 21:52                                       ` Larry Finger
  2 siblings, 0 replies; 35+ messages in thread
From: Rafael J. Wysocki @ 2016-09-29 20:52 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Larry Finger, Zhang Rui, Rafael J. Wysocki, Doug Smythies, LKML,
	Linux PM list

On Thursday, September 29, 2016 08:56:16 AM Srinivas Pandruvada wrote:
> On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote:
> 
> [...]
> 
> > > My laptop was inadvertently put to sleep while I was gone. I forgot
> > > to leave a 
> > > note for my wife and she quieted the noisy cpu fan. :)
> > It looks like in 4.8-rc we made a change that caused the "high" trip
> > point to
> > be acted on.
> This high trip point we don't expose in thermal subsystem (the thermal
>  zone dump didn't show this anywhere as a trip). This is exposed by
> core-dts driver only. This is the point BIOS is supposed to act, I
> guess that's why you are seeing 50% clock modulation. 

Right.  That's SMM kicking in.

The real problem is that we get stuck at 400 MHz.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Regression in 4.8 - CPU speed set very low
  2016-09-29 15:56                                     ` Srinivas Pandruvada
  2016-09-29 16:24                                       ` Larry Finger
  2016-09-29 20:52                                       ` Rafael J. Wysocki
@ 2016-09-30 21:52                                       ` Larry Finger
  2 siblings, 0 replies; 35+ messages in thread
From: Larry Finger @ 2016-09-30 21:52 UTC (permalink / raw)
  To: Srinivas Pandruvada, Rafael J. Wysocki, Zhang Rui
  Cc: Rafael J. Wysocki, Doug Smythies, LKML, Linux PM list

On 09/29/2016 10:56 AM, Srinivas Pandruvada wrote:
> On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote:
>
> [...]
>
>>> My laptop was inadvertently put to sleep while I was gone. I forgot
>>> to leave a
>>> note for my wife and she quieted the noisy cpu fan. :)
>> It looks like in 4.8-rc we made a change that caused the "high" trip
>> point to
>> be acted on.
> This high trip point we don't expose in thermal subsystem (the thermal
>  zone dump didn't show this anywhere as a trip). This is exposed by
> core-dts driver only. This is the point BIOS is supposed to act, I
> guess that's why you are seeing 50% clock modulation.
>
>
> Are you running thermald
>
> What is?
> # ps -e | grep thermald
>
>
>>
>> Srinivas, Rui, do you recall what that can be?
>>
>> One more question (I think I asked it previously): In the failing
>> case
>> (4.8-rc1 and later), when the frequency drops down to the 400 MHz,
>> does it
>> ever go back higher or is it stuck at that level forever?
>>
>> In any case, it may help to file a bug at bugzilla.kernel.org against
>> CPU/thermal or similar and let me know the bug number.  We'll need to
>> collect some tracepoint data to debug this and some place to put them
>> into for easy reference.
> Yes, this is good idea.

To complete the record in this thread, the problem also happened with kernel 
4.7, thus it is not a regression in 4.8-rcX. The full discussion is at 
https://bugzilla.kernel.org/show_bug.cgi?id=173361.

Larry

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-09-30 21:52 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-09 17:39 Regression in 4.8 - CPU speed set very low Larry Finger
2016-09-14 16:00 ` Larry Finger
2016-09-19  2:54   ` Larry Finger
2016-09-24  2:45     ` Larry Finger
2016-09-26 11:37       ` Rafael J. Wysocki
2016-09-26 16:15         ` Larry Finger
2016-09-26 21:06           ` Rafael J. Wysocki
2016-09-26 21:26             ` Srinivas Pandruvada
2016-09-26 21:30               ` Rafael J. Wysocki
2016-09-26 21:41                 ` Srinivas Pandruvada
2016-09-26 21:46                   ` Rafael J. Wysocki
2016-09-26 21:28             ` Larry Finger
2016-09-26 21:37               ` Rafael J. Wysocki
2016-09-26 21:46                 ` Srinivas Pandruvada
2016-09-26 22:15                   ` Larry Finger
2016-09-26 22:09                 ` Larry Finger
2016-09-26 22:16                   ` Rafael J. Wysocki
2016-09-26 23:53                     ` Larry Finger
2016-09-27  0:21                       ` Rafael J. Wysocki
2016-09-27  0:48                         ` Larry Finger
2016-09-27  1:30                           ` Srinivas Pandruvada
2016-09-27  2:53                             ` Larry Finger
2016-09-27  3:12                           ` Doug Smythies
2016-09-27  8:48                             ` Larry Finger
2016-09-27 11:46                               ` Rafael J. Wysocki
2016-09-29  2:22                                 ` Larry Finger
2016-09-29 12:19                                   ` Rafael J. Wysocki
2016-09-29 15:09                                     ` Larry Finger
2016-09-29 15:56                                     ` Srinivas Pandruvada
2016-09-29 16:24                                       ` Larry Finger
2016-09-29 20:52                                       ` Rafael J. Wysocki
2016-09-30 21:52                                       ` Larry Finger
2016-09-27 14:51               ` Lennart Sorensen
2016-09-29  2:26                 ` Larry Finger
2016-09-29 14:17                   ` Lennart Sorensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.