Re: kernel CI: printk loglevels in kernel boot logs?

* Re: kernel CI: printk loglevels in kernel boot logs?
       [not found] ` <20171122015610.x3kgzqgtwywlurmz@wfg-t540p.sh.intel.com>
@ 2017-11-22  3:27   ` Fengguang Wu
  2017-11-22  5:26     ` Sergey Senozhatsky
  2017-11-22 11:34     ` Petr Mladek
  0 siblings, 2 replies; 20+ messages in thread
From: Fengguang Wu @ 2017-11-22  3:27 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Mark Brown, Greg Kroah-Hartman, LKML, Petr Mladek,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds

[CC LKML for possible printk improvements]

On Wed, Nov 22, 2017 at 09:56:10AM +0800, Fengguang Wu wrote:
>Hi Kevin,
>
>On Tue, Nov 21, 2017 at 12:27:48PM -0800, Kevin Hilman wrote:
>>Hi Fenguang,
>>
>>In automated testing, for ease of parsing kernel boot logs (especially
>>separating warnings and errors from debug, info etc.)
>>
>>Right now we can get this info from a "dmesg --raw" after bootup, but
>>it would be really nice in certain automation frameworks to have a
>>kernel command-line option to enable printing of loglevels in default
>>boot log.
>
>Agreed.
>
>>This is espeically useful when ingesting kernel logs into advanced
>>search/analytics frameworks (I'm playing with and ELK stack: Elastic
>>Search, Logstash, Kibana).
>>
>>Does that sound like a feature you'd be interested in?
>
>Yes, sure.
>
>>In kernelCI, we're considering submitting a patch to add a
>>"show_loglevel" command-line argument to enable that option on kernel
>>boot.
>
>Thanks for doing that patch! It'll obviously make it easier to catch
>various warnings, which will be useful when used with caution,
>especially when false warnings (wrt. real problems that should be
>fixed) can be effectively filtered out.

[...]

>As you may know I'm currently reporting kernel oops in mainline
>kernel, hoping to clear noisy oops there -- they obviously hurt
>catching and bisecting new oops.
>
>I have to say the warnings are much more noisy than kernel oops
>in 2 ways:
>
>1) It's common for a *normal* kernel boot to have a dozen of old
>warnings.
>
>2) Many warnings do not necessarily mean something should or could be
>fixed -- they may well be mentioning some HW problem, or an alert
>message to the user.
>
>So there is a much bigger and messy problem than catching the warnings:
>ways to effectively mark or filter real warnings that automated testing
>should catch and report.
>
>For filtering, we currently accumulated the below blacklist:
>
>https://github.com/fengguang/lkp-tests/blob/master/etc/kmsg-blacklist
>
[...]

For the marking part, I wonder if there can be a clear rule that
allows developers to distinguish 2 kind of information for users and
testers:

- "bug" message: indicating a regression that should be reported and fixed

- "fact" message: well there's a problem, but we kernel developers
  probably can do nothing about it. It's just to let the user know
  about the fact. The fix might be something like replacing a broken
  disk drive.

Those message types are orthogonal to severity of the problem (log
levels), so the current log levels are actually not sufficient for
distinguishing these kind of situations.

Here are the list of warning ids waiting for bisect in 0day test farm.
As you may see, only relatively few ones may worth bisecting and
reporting. We've already built a pretty large blacklist, however it's
proved to be awkward and ineffective -- it obviously suffers from both
coverage and expertise problems.

    777 hub#-#:#:config_failed,can't_get_hub_status(err#)
    328 Firmware_Bug]:TSC_DEADLINE_disabled_due_to_Errata;please_update_microcode_to_version:#(or_later)
     31 ACPI_Error:Field[CPB3]at_bit_offset/length#exceeds_size_of_target_Buffer(#bits)(#/dsopcode-#)
     30 ACPI_Error:Method_parse/execution_failed~_SB.PMI0._PMC,AE_NOT_EXIST(#/psparse-#)
     29 ACPI_Error:Method_parse/execution_failed~_SB.PMI0._GHL,AE_NOT_EXIST(#/psparse-#)
     28 ACPI_Error:Region_IPMI(ID=#)has_no_handler(#/exfldio-#)
     28 ACPI_Error:No_handler_for_Region[SYSI](#)[IPMI](#/evregion-#)
     27 ACPI_Error:Method_parse/execution_failed~_SB._OSC,AE_AML_BUFFER_LIMIT(#/psparse-#)
     25 tpm_tpm#:A_TPM_error(#)occurred_attempting_get_random
     24 IP-Config:Reopening_network_devices
     19 DHCP/BOOTP:Ignoring_fragmented_reply
      8 xhci_hcd#:#:#:init#:#:#fail
      8 xhci_hcd#:#:#:can't_setup
      6 in_atomic():#,irqs_disabled():#,pid:#,name:swapper
      4 megaraid_sas#:#:#:Init_cmd_return_status_SUCCESS_for_SCSI_host
      4 in_atomic():#,irqs_disabled():#,pid:#,name:kworker
      4 drm:drm_atomic_helper_commit_cleanup_done[drm_kms_helper]]*ERROR*[CRTC:#:pipe_A]flip_done_timed_out
      2 print_req_error:I/O_error,dev_loop#,sector
      2 IP-Config:Failed_to_open_erspan0
      2 in_atomic():#,irqs_disabled():#,pid:#,name:systemd-udevd
      2 in_atomic():#,irqs_disabled():#,pid:#,name:perf
      2 acerhdf:unknown(unsupported)BIOS_version_QEMU/Standard_PC(i440FX+PIIX,#).#-#,please_report,aborting
      1 XFS(sda8):metadata_I/O_error:block#(~xfs_readlink_bmap_ilocked~)error#numblks
      1 XFS(sda8):Metadata_CRC_error_detected_at_xfs_symlink_read_verify[xfs],xfs_symlink_block
      1 XFS(sda8):Metadata_CRC_error_detected_at_xfs_dir3_data_read_verify[xfs],xfs_dir3_data_block
      1 XFS(sda8):Metadata_CRC_error_detected_at_xfs_da3_node_read_verify[xfs],xfs_da3_node_block
      1 XFS(sda8):Metadata_CRC_error_detected_at_xfs_attr3_leaf_read_verify[xfs],xfs_attr3_leaf_block
      1 Memory_failure:#:recovery_action_for_huge_page:Recovered
      1 Memory_failure:#:recovery_action_for_clean_LRU_page:Recovered
      1 Memory_failure:#:Killing_tinjpage:#due_to_hardware_memory_corruption
      1 mce:[Hardware_Error]:TSC#MISC#df87b000d9eff
      1 in_atomic():#,irqs_disabled():#,pid:#,name:xargs
      1 in_atomic():#,irqs_disabled():#,pid:#,name:triad_loop
      1 in_atomic():#,irqs_disabled():#,pid:#,name:tchain_edit
      1 in_atomic():#,irqs_disabled():#,pid:#,name:sh
      1 in_atomic():#,irqs_disabled():#,pid:#,name:sed
      1 in_atomic():#,irqs_disabled():#,pid:#,name:run.sh
      1 in_atomic():#,irqs_disabled():#,pid:#,name:perf_test
      1 in_atomic():#,irqs_disabled():#,pid:#,name:noploop.sh
      1 in_atomic():#,irqs_disabled():#,pid:#,name:jbd2/sda1
      1 in_atomic():#,irqs_disabled():#,pid:#,name:grep
      1 in_atomic():#,irqs_disabled():#,pid:#,name:fallocate
      1 in_atomic():#,irqs_disabled():#,pid:#,name:dmesg
      1 in_atomic():#,irqs_disabled():#,pid:#,name:cc1
      1 do_IRQ:#No_irq_handler_for_vector
      1 Buffer_I/O_error_on_dev_dm-#,logical_block#,async_page_read

We already have a well known answer to "marking bugs", ie. to dump a
pile of call trace that no one can tolerate.

The down side is, for some kind of bugs the developer may not need the
call trace at all. And the call trace might make users unnecessarily
nervous. That's why we have this email thread -- to catch regressions
indicated by some 1-liner kernel warnings. For that kind of situation,
it may be enough to add a common prefix for such messages. For example,

        [kernel bug] your warning/error printk message
        ~~~~~~~~~~~~

Alternatively, some printk messages already clearly stated
"please report" (they seem more targeted for users than testers):

wfg /c/linux% git grep '".*please report'|head
block/elevator.c:645:                  "(nr_sorted=%u), please report this\n",
drivers/ata/pata_hpt37x.c:867:                  pr_err("Unknown HPT366 subtype, please report (%d)\n",
drivers/ata/pata_hpt37x.c:908:          pr_err("PCI table is bogus, please report (%d)\n", dev->device);
drivers/ata/pata_hpt3x2n.c:532:         pr_err("PCI table is bogus, please report (%d)\n", dev->device);
drivers/edac/mce_amd.c:725:                      " please report on LKML.\n");
drivers/gpu/drm/amd/amdgpu/amdgpu_display.c:395:                                DRM_INFO("  DDC: no ddc bus - possible BIOS bug - please report to xorg-driver-ati@lists.x.org\n");
drivers/gpu/drm/nouveau/nouveau_bios.c:1458:                                  "please report\n");
drivers/gpu/drm/radeon/radeon_display.c:803:                            DRM_INFO("  DDC: no ddc bus - possible BIOS bug - please report to xorg-driver-ati@lists.x.org\n");
drivers/hwmon/ibmaem.c:773:                             "Unknown AEM v%d; please report this to the maintainer.\n",
drivers/hwmon/w83781d.c:1367:                    "If reset=1 solved a problem you were having, please report!\n");

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 20+ messages in thread