All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/2] docs: add a document dedicated to regressions
@ 2022-01-03  9:50 Thorsten Leemhuis
  2022-01-03  9:50 ` [RFC PATCH v1 1/2] docs: add a document about regression handling Thorsten Leemhuis
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Thorsten Leemhuis @ 2022-01-03  9:50 UTC (permalink / raw)
  To: linux-doc, Linus Torvalds, Greg Kroah-Hartman
  Cc: workflows, Linux Kernel Mailing List, Randy Dunlap, Jonathan Corbet

'We don't cause regressions' might be the first rule of kernel development, but
it and other aspects of regressions nevertheless are hardly described in the
Linux kernel's documentation. These patches change this by creating a document
dedicated to the topic.

The second patch could easily be folded into the first one, but I kept it
separate, as it might be a bit controversial. This also allows the patch
description to explain some backgrounds for this part of the text. Additionally,
ACKs and Reviewed-by tags can be collected separately this way.

v1/RFC:
- initial version

Thorsten Leemhuis (2):
  docs: add a document about regression handling
  docs: regressions.rst: rules of thumb for handling regressions

 Documentation/admin-guide/index.rst       |   1 +
 Documentation/admin-guide/regressions.rst | 947 ++++++++++++++++++++++
 MAINTAINERS                               |   1 +
 3 files changed, 949 insertions(+)
 create mode 100644 Documentation/admin-guide/regressions.rst


base-commit: b36064425a18e29a3bad9c007b4dd1223f8aadc5
-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 1/2] docs: add a document about regression handling
  2022-01-03  9:50 [RFC PATCH v1 0/2] docs: add a document dedicated to regressions Thorsten Leemhuis
@ 2022-01-03  9:50 ` Thorsten Leemhuis
  2022-01-03 17:07   ` Jakub Kicinski
  2022-01-04 14:17   ` Lukas Bulwahn
  2022-01-03  9:50 ` [RFC PATCH v1 2/2] docs: regressions.rst: rules of thumb for handling regressions Thorsten Leemhuis
  2022-01-03 14:01 ` [RFC PATCH v1 0/2] docs: add a document dedicated to regressions Greg Kroah-Hartman
  2 siblings, 2 replies; 15+ messages in thread
From: Thorsten Leemhuis @ 2022-01-03  9:50 UTC (permalink / raw)
  To: linux-doc, Linus Torvalds, Greg Kroah-Hartman
  Cc: workflows, Linux Kernel Mailing List, Randy Dunlap, Jonathan Corbet

Create a document explaining various aspects around regression handling
and tracking both for users and developers. Among others describe the
first rule of Linux kernel development and what it means in practice.
Also explain what a regression actually is and how to report them
properly. The text additionally provides a brief introduction to the bot
the kernel's regression tracker users to facilitate the work. To sum
things up, provide a few quotes from Linus to show how serious the he
takes regressions.

Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
---
 Documentation/admin-guide/index.rst       |   1 +
 Documentation/admin-guide/regressions.rst | 869 ++++++++++++++++++++++
 MAINTAINERS                               |   1 +
 3 files changed, 871 insertions(+)
 create mode 100644 Documentation/admin-guide/regressions.rst

diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 1bedab498104..17157ee5a416 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -36,6 +36,7 @@ problems and bugs in particular.
 
    reporting-issues
    security-bugs
+   regressions
    bug-hunting
    bug-bisect
    tainted-kernels
diff --git a/Documentation/admin-guide/regressions.rst b/Documentation/admin-guide/regressions.rst
new file mode 100644
index 000000000000..1ff6a0802fc9
--- /dev/null
+++ b/Documentation/admin-guide/regressions.rst
@@ -0,0 +1,869 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+..
+   If you want to distribute this text under CC-BY-4.0 only, please use 'The
+   Linux kernel developers' for author attribution and link this as source:
+   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/regressions.rst
+..
+   Note: Only the content of this RST file as found in the Linux kernel sources
+   is available under CC-BY-4.0, as versions of this text that were processed
+   (for example by the kernel's build system) might contain content taken from
+   files which use a more restrictive license.
+
+
+Regressions
++++++++++++
+
+The first rule of Linux kernel development: '*We don't cause regressions*'.
+Linux founder and lead developer Linus Torvalds strictly enforces the rule
+himself. This document describes what this means in practice and how the Linux
+kernel's development model ensues all reported regressions get addressed; it
+covers aspects relevant for both users and developers.
+
+The important bits for people affected by regressions
+=====================================================
+
+It's a regression if something running fine with one Linux kernel works worse or
+not at all with a newer version. Note, the newer kernel has to be compiled using
+a similar configuration -- for this and other fine print, check out below
+section "What is a 'regression' and what is the 'no regressions rule'?".
+
+Report your regression as outlined in
+`Documentation/admin-guide/reporting-issues.rst`, it already covers all aspects
+important for regressions. Below section "How do I report a regression?"
+highlights them for convenience.
+
+The most important aspect: CC for forward the report to `the regression mailing
+list <https://lore.kernel.org/regressions/>`_ (regressions@lists.linux.dev).
+When doing so, consider mentioning the version range where the regression
+started using a paragraph like this::
+
+       #regzbot introduced v5.13..v5.14-rc1
+
+The Linux kernel regression tracking bot 'regzbot' will then add the report to
+the list of tracked regressions. This is in your interest, as it gets the report
+on the radar of people ensuring all regressions are acted upon in timely manner.
+
+The important bits for people fixing regressions
+================================================
+
+When getting regression reports by mail, check if the reporter CCed `the
+regression mailing list <https://lore.kernel.org/regressions/>`_
+(regressions@lists.linux.dev). If not, forward or bounce the report to the Linux
+kernel's regression tracker (regressions@leemhuis.info), unless you plan sending
+a reply to the report anyway. In that case simply CC the list in a direct reply
+to the report. Also check, if the report included a 'regzbot command' like
+``#regzbot introduced v5.13..v5.14-rc1`` (see above); if not, please include a
+paragraph like the following, to get the regression tracked by the Linux kernel
+regression tracking bot 'regzbot'::
+
+       #regzbot ^introduced v5.13..v5.14-rc1
+
+If the report was filed in a public bug-tracker, forward it to the regression
+list; add the aforementioned paragraph, just omit the caret (the ^) before the
+``introduced``, which make regzbot treat your mail (and not the one you reply
+to) as the report.
+
+When submitting fixes for regressions, always include 'Link:' tags in the commit
+message that point to all places where the issue was reported, as explained in
+`Documentation/process/submitting-patches.rst` and
+:ref:`Documentation/process/5.Posting.rst <development_posting>`. Hence, link to
+any mails in the archive with reports about the issue as well as all bug-tracker
+entries::
+
+       Link: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
+       Link: https://bugzilla.kernel.org/show_bug.cgi?id=215375
+
+This is important for regression tracking, as this allows regzbot to
+automatically associate tracked regression reports with patch postings and
+commits fixing it.
+
+
+All the details on handling Linux kernel regressions
+====================================================
+
+The important basics
+--------------------
+
+What is a 'regression' and what is the 'no regressions rule'?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's a regression if some application or practical use case running fine on one
+Linux kernel works worse or not at all with a newer version compiled using a
+similar configuration. The 'no regressions rule' forbids this to happen. If a
+regression happens by accident, developers that caused it are expected to
+quickly fix the issue.
+
+It thus is a regression when a WiFi driver from Linux 5.13 works fine, but with
+5.14 doesn't work at all, works significantly slower, or misbehaves somehow.
+It's also a regression if a perfectly working application suddenly shows erratic
+behavior with a newer kernel version, which can be caused by changes in procfs,
+sysfs, or one of the many other interfaces Linux provides to userland software.
+But keep in mind, as mentioned earlier: 5.14 in this example needs to be build
+from a configuration similar to the one from 5.13. This can be achieved using
+``make olddefconfig``, as explained in more detail below.
+
+Note the 'practical use case' in the first sentence of this section: developers
+despite the 'no regressions' rule are free to change any aspect of the kernel
+and even APIs or ABIs to userland, as long as no existing application or
+use-case breaks.
+
+Also be aware the 'no regressions' rule covers only interfaces the kernel
+provides to the userland. It thus does not apply to kernel-internal interfaces
+like the module API, which some externally developed drivers use to hook into
+the kernel.
+
+What is the goal of the 'no regressions rule'?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Users should feel safe when updating kernel versions and not have to worry
+something might break. This is in the interest of the kernel developers to make
+updating attractive: they don't want users to stay on stable or longterm Linux
+series either abandoned or more than one and a half year old, as `those might
+have known problems, security issues, or other aspects already improved in later
+versions
+<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
+
+How hard is the rule enforced?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Extraordinarily strict, as can be seen by many mailing list posts from Linux
+creator and lead-developer Linus Torvalds, some of which are quoted at the end
+of this document.
+
+Exceptions to this rule are extremely rare; in the past developers almost always
+turned out to be wrong when they assumed a particular situation was warranting
+an exception.
+
+How is the rule enforced?
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's the duty of the subsystem maintainers, which are watched and supported by
+Linus Torvalds for mainline or stable/longterm tree maintainers like Greg
+Kroah-Hartman. All of them are supported by Thorsten Leemhuis: he's acting as
+'regressions tracker' for the Linux kernel and trying to ensure all regression
+reports are acted upon in timely manner.
+
+The distributed and slightly unstructured nature of the Linux kernel's
+development makes tracking regressions hard. That's why Thorsten relies on the
+help of his Linux kernel regression tracking robot 'regzbot'. It watches mailing
+lists and git trees to semi-automatically associate regression reports to patch
+submissions and commits fixing the issue, as this provides all necessary
+insights into the fixing progress.
+
+To ensure no regression falls through the cracks, the regression tracker or his
+bot need to get aware of every report. That's why you need to get them into the
+loop for regressions, as explained in the next section.
+
+How do I report a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Just report the issue as outlined in
+`Documentation/admin-guide/reporting-issues.rst`, it already describes the
+important points. The following aspects described there are especially relevant
+for regressions:
+
+ * When checking for existing reports to join, first check the `archives of the
+   Linux regressions mailing list <https://lore.kernel.org/regressions/>`_ and
+   `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
+
+ * In your report, mention the last kernel version that worked fine and the
+   first broken one. Even better: try to find the commit causing the regression
+   using a bisection.
+
+ * Remember to let the Linux regressions mailing list
+   (regressions@lists.linux.dev) known about your report:
+
+  * If you report the regression by mail, CC the regressions list.
+
+  * If you report your regression to some bug tracker, forward the filed report
+    by mail to the regressions list while CCing the maintainer and the mailing
+    list for the subsystem in question.
+
+Additionally, you in both cases should directly get the aforementioned Linux
+kernel regression tracking bot into the loop. To do that, include a paragraph
+like this in your report to tell the bot when the regression started to happen::
+
+       #regzbot introduced: v5.13..v5.14-rc1
+
+In this example, v5.13 was the last version that worked, while v5.14-rc1 was the
+first broken one. The smaller the range, the better, as that makes it easier to
+find out what's wrong and who's responsible. That's why you ideally should
+perform a bisection to find the commit causing the regression (the 'culprit').
+If you did, specify it instead::
+
+       #regzbot introduced: 1f2e3d4c5d
+
+Placing such a 'regzbot command' is in your interest, as it will ensure the
+report won't fall through the cracks unnoticed. If you omit this, the Linux
+kernel's regressions tracker will take care of telling regzbot about your
+regression, as long as you sent a copy to the regressions mailing lists. But the
+regression tracker is just one human which sometimes has to rest or occasionally
+might even enjoy some time away from computers (as crazy as that might sound).
+Relying on this person thus will result in an unnecessary delay before the
+regressions becomes mentioned `on the list of tracked and unresolved Linux
+kernel regressions <https://linux-regtracking.leemhuis.info/regzbot/>`_ and the
+weekly regression reports sent by regzbot. Such delays can result in Linus
+Torvalds being unaware of important regressions when deciding between 'continue
+development or call this finished by performing a release?'.
+
+How to add a regression to regzbot's tracking somebody else reported?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use your mailers 'Reply-all' function to send a reply where you CC the
+regressions list (regressions@lists.linux.dev). In that reply create a new
+paragraph with a regzbot command like this::
+
+       #regzbot ^introduced: v5.13..v5.14-rc1
+
+The caret (^) before the 'introduced' makes regzbot treat the parent mail (the
+one you reply to) as the report for the regression you want to see tracked.
+Instead of a version range you can also specify the commit causing the
+regression, as outlined in the previous section.
+
+If the report came in private from a bug tracker, forward it to the list;
+include the aforementioned line, just omit the caret (the ^) before the
+'introduced'; consider adding a line with the line '#regzbot link: <url>' (see
+below) pointing to the place with the initial report.
+
+Alternatively to all the above you can just forward or bounce the report to the
+Linux kernel's regression tracker, but consider the downsides already outlined
+in the previous section.
+
+Do really all regressions get fixed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Nearly all of them are, as long as the change causing the regression (the
+'culprit commit') gets reliably identified. Some regressions can be fixed
+without this, but often it's required.
+
+Who needs to find the commit causing a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's the reporter's duty to find the culprit, but developers of the affected
+subsystem should offer advice and reasonably help where they can.
+
+How can I find the change causing a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Perform a bisection, as roughly outlined in
+`Documentation/admin-guide/reporting-issues.rst` and described in more detail by
+`Documentation/admin-guide/bug-bisect.rst`. It might sound like a lot of work,
+but in many cases finds the culprit relative quickly. If it's hard or
+time-consuming to reliably reproduce the issue, consider teaming up with others
+affected by the problem to narrow down the search range together.
+
+Who can I ask for advice when it comes to regressions?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
+CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
+issue might better be dealt with in private, feel free to omit the list.
+
+
+More details about regressions relevant for reporters
+-----------------------------------------------------
+
+Does a regression need to be fixed, if it can be avoided by updating some other software?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Almost always: yes. If a developer tell you otherwise, ask the regression
+tracker for advice as outlined above.
+
+Does it qualify as a regression if a newer kernel works slower or makes the system consumes more energy?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It does, but the difference has to be significant. A five percent slow-down in a
+micro-benchmark thus is unlikely to qualify as regression, unless it also
+influences the results of a broad benchmark by more than one percent. If in a
+doubt, ask for advice.
+
+Is it a regression, if an externally developed kernel module is incompatible with a newer kernel?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+No, as the 'no regression' rule is about interfaces and services the Linux
+kernel provides to the userland. It thus does not cover building or running
+externally developed kernel modules, as they run in kernel-space and use
+occasionally changed internal interfaces to hook into the kernel.
+
+How are regressions handled that are caused by a fix for security vulnerability?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In extremely rare situations security issues can't be fixed without causing
+regressions; those are given way, as they are the lesser evil in the end.
+Luckily this almost always can be avoided, as key developers for the affected
+area and often Linus Torvalds himself try very hard to fix security issues
+without causing regressions.
+
+If you nevertheless face such a case, check the mailing list archives if people
+tried their best to avoid the regression; if in a doubt, ask for advice as
+outlined above.
+
+What happens if fixing a regression is impossible without causing another regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sadly these things happen, but luckily not very often; if they occur, expert
+developers of the affected code area should look into the issue to find a fix
+that avoids regressions or at least their impact. If you run into such a
+situation you thus do what was outlined already for regressions caused by
+security fixes: check earlier discussions if people already tried their best and
+ask for advice if in a doubt.
+
+A quick note while at it: these situations could be avoided, if you would
+regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each cycle a
+test run. This is best explained by imagining a change integrated between Linux
+v5.14 and v5.15-rc1 which causes a regression, but at the same time is a hard
+requirement for some other improvement applied for 5.15-rc1. All these changes
+often can simply be reverted and the regression thus solved, if someone finds
+and reports it before 5.15 is released. A few days or weeks later after the
+release this solution might become impossible, if some software starts to rely
+on aspects introduced by one of the follow-up changes: reverting all changes
+would cause regressions for users of said software and thus out of the question.
+
+A feature I relied on was removed months ago, but I only noticed now. Does that qualify as regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It does, but often it's hard to fix them due to the aspects outlined in the
+previous section. It hence needs to be dealt with on a case-by-case basis; this
+is another reason why it's in your interest to regular test mainline releases.
+
+Does the 'no regression' rule apply if I seem to be the only person in the world that is affected by a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It does, but only for practical usage: the Linux developers want to be free to
+remove support for hardware only to be found in attics and museums anymore.
+
+Note, sometimes regressions can't be avoided to make progress -- and the latter
+is needed to prevent Linux from stagnation. Hence, if only very few users seem
+to be affected by a regression, it for the greater good might be in their and
+everyone else interest to not insist on the rule. Especially if there is a easy
+way to circumvent the regression somehow, for example by updating some software
+or using a kernel parameter created just for this purpose.
+
+Does the regression rule apply for code in the staging tree as well?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Not according to the `help text for the configuration option covering all
+staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,
+which since its early days states::
+
+       Please note that these drivers are under heavy development, may or
+       may not work, and may contain userspace interfaces that most likely
+       will be changed in the near future.
+
+The staging developers nevertheless often adhere the 'no regressions' rule, but
+sometimes bend it to make progress. That's for example why some users had to
+deal with (often negligible) regressions when a WiFi driver from the staging
+tree got replaced by a totally different one written from scratch.
+
+Why do later versions have to be 'compiled with a similar configuration'?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because the Linux kernel developers sometimes integrate changes known to cause
+regressions, but make them optional and disable them in the kernel's default
+configuration. This trick allows progress, as the 'no regressions' rule
+otherwise would lead to stagnation. Consider for example a new security feature
+which blocks access to some kernel interfaces often abused by malware, but at
+the same time are required to run a few rarely used applications. The outlined
+trick makes both camps happy: people using these applications can leave the new
+security feature off, while everyone else can enable it without running into
+trouble.
+
+How to create a configuration similar to the one of an older kernel?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Start a known-good kernel and configure the newer Linux version with ``make
+olddefconfig``. This makes the kernel's build scripts pick up the configuration
+file (the `.config` file) from the running kernel as base for the new one you
+are about to compile; afterwards they set all new configuration options to their
+default value, which disables new features that might cause regressions.
+
+Can I report a regression with vanilla kernels provided by someone else to the upstream Linux kernel developers?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Only if the newer kernel was compiled with a similar configuration file as the
+older one (see above), as your provider might have enabled some known-to-be
+incompatible feature in the newer kernel. If in a doubt, report this problem to
+the provider and ask for advice.
+
+
+More details about regressions relevant for developers
+------------------------------------------------------
+
+What should I do, if I suspect a change I'm working on might cause regressions?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Evaluate how big the risk of regressions is, for example by performing a code
+search in Linux distributions and Git forges. Also consider asking other
+developers or projects likely to be affected to evaluate or even test the
+proposed change; if problems surface, maybe some middle ground acceptable for
+all can be found.
+
+If the risk of regressions in the end seems to be relative small, go ahead with
+the change, but let all involved parties know about the risk. Hence, make sure
+your patch description makes this aspect obvious. Once the change got merged,
+tell the Linux kernel's regression tracker and the regressions mailing list
+about the risk, so everyone has the change on the radar in case reports trickle
+in. Depending on the risk, you also might want to ask the subsystem maintainer
+to mention the issue in his pull request to mainline.
+
+
+Everything developers need to know about regression tracking
+------------------------------------------------------------
+
+Do I have to use regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's in the interest of everyone if you do, as kernel maintainers like Linus
+Torvalds partly rely on regzbot's tracking in their work -- for example when
+deciding to release a new version or extend the development phase. For this they
+need to be aware of all unfixed regression; to do that, Linus is known to look
+into the weekly reports sent by regzbot.
+
+Do I have to tell regzbot about every regression I stumble upon?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ideally yes: we are all humans and easily forget problems when something more
+important unexpectedly comes up -- for example a bigger problem in the Linux
+kernel or something in real life that's keeping us away from keyboards for a
+while. Hence, it's best to tell regzbot about every regression, except when you
+immediately write a fix and commit it to a tree regularly merged to the affected
+kernel series.
+
+Why does the Linux kernel need a regression tracker, and why does he utilize regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Rules like 'no regressions' need someone to enforce them, otherwise they are
+broken either accidentally or on purpose. History has shown that this is true
+for the Linux kernel as well. That's why Thorsten volunteered to keep an eye on
+things.
+
+Tracking regressions completely manually has proven to be exhausting and
+demotivating, which is why earlier attempts to establish it failed after a
+while. To prevent this from happening again, Thorsten developed Regzbot to
+facilitate the work, with the long term goal to automate regression tracking as
+much as possible for everyone involved.
+
+How does regression tracking work with regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bot keeps track of all the reports and monitor their fixing progress. It
+tries to do that with as little overhead as possible for both reporters and
+developers.
+
+In fact, only reporters or someone helping them gets an extra duty: they need to
+tell regzbot about the regression report using one of the ``#regzbot
+introduced`` commands outlined above.
+
+For developers there normally is no extra work involved, they just need to do
+something that's expected from them already: add 'Link:' tags to the patch
+description pointing to all reports about the issue fixed.
+
+Thanks to these tags regzbot can associate regression reports with patches to
+fix the issue, whenever they get posted for review or applied to a git tree. The
+bot additionally watches out for replies to the report. All this data combined
+provides a good impression about the current status of the fixing process.
+
+How to see which regressions regzbot tracks currently?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_
+for the latest info; alternatively, `search for the latest regression report
+<https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
+which regzbot normally sends out once a week on Sunday evening (UTC), which is a
+few hours before Linus usually publishes new (pre-)releases.
+
+What places is regzbot monitoring?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Regzbot is watching the most important Linux mailing lists as well as the Linux
+next, mainline and stable/longterm git repositories.
+
+How to interact with regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Everyone can interact with the bot using mails containing `regzbot commands`,
+which need to be in their own paragraph (IOW: they need to be separated from the
+rest of the mail using blank lines). One such command is ``#regzbot introduced
+<version or commit>``, which adds a report to the tracking, as already described
+above; ``#regzbot ^introduced <version or commit>`` is another such command,
+which makes regzbot consider the parent mail as a report for a regression which
+it starts to track.
+
+Once one of those two commands has been utilized, other regzbot commands can be
+used. You can write them below one of the `introduced` commands or in replies to
+the mail that used one of them or itself is a reply to that mail:
+
+ * Set or update the title::
+
+       #regzbot title: foo
+
+ * Link to a related discussion (for example the posting of a patch to fix the
+   issue) and monitor it::
+
+       #regzbot monitor: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
+
+   Monitoring only works for lore.kernel.org; regzbot will consider all messages
+   in that thread as related to the fixing process.
+
+ * Point to a place with further details, like a bug-tracker or a related
+   mailing list post::
+
+       #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
+
+ * Mark a regression as fixed by a commit that is heading upstream or already
+   landed::
+
+       #regzbot fixed-by: 1f2e3d4c5d
+
+ * Mark a regression as a duplicate of another one already tracked by regzbot::
+
+       #regzbot dup-of: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
+
+ * Mark a regression as invalid::
+
+       #regzbot invalid: wasn't a regression, problem has always existed
+
+Is there more to tell about regzbot and its commands?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+More detailed and up-to-date information about the Linux kernels regression
+tracking bot can be found on its `project page <https://gitlab.com/knurd42/regzbot>`_,
+which among others contains a
+`getting started guide <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_
+and `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
+which both are more in-depth.
+
+
+Quotes from Linus about regression
+----------------------------------
+
+Find below a few real life examples of how Linus Torvalds expects regressions to
+be handled:
+
+ * From `2017-10-26 (1/2) <https://lore.kernel.org/lkml/CA+55aFwiiQYJ+YoLKCXjN_beDVfu38mg=Ggg5LFOcqHE8Qi7Zw@mail.gmail.com/>`_::
+
+       If you break existing user space setups THAT IS A REGRESSION.
+
+       It's not ok to say "but we'll fix the user space setup".
+
+       Really. NOT OK.
+
+       [...]
+
+       The first rule is:
+
+        - we don't cause regressions
+
+       and the corollary is that when regressions *do* occur, we admit to
+       them and fix them, instead of blaming user space.
+
+       The fact that you have apparently been denying the regression now for
+       three weeks means that I will revert, and I will stop pulling apparmor
+       requests until the people involved understand how kernel development
+       is done.
+
+ * From `2017-10-26 (2/2) <https://lore.kernel.org/lkml/CA+55aFxW7NMAMvYhkvz1UPbUTUJewRt6Yb51QAx5RtrWOwjebg@mail.gmail.com/>`_::
+
+       People should basically always feel like they can update their kernel
+       and simply not have to worry about it.
+
+       I refuse to introduce "you can only update the kernel if you also
+       update that other program" kind of limitations. If the kernel used to
+       work for you, the rule is that it continues to work for you.
+
+       There have been exceptions, but they are few and far between, and they
+       generally have some major and fundamental reasons for having happened,
+       that were basically entirely unavoidable, and people _tried_hard_ to
+       avoid them. Maybe we can't practically support the hardware any more
+       after it is decades old and nobody uses it with modern kernels any
+       more. Maybe there's a serious security issue with how we did things,
+       and people actually depended on that fundamentally broken model. Maybe
+       there was some fundamental other breakage that just _had_ to have a
+       flag day for very core and fundamental reasons.
+
+       And notice that this is very much about *breaking* peoples environments.
+
+       Behavioral changes happen, and maybe we don't even support some
+       feature any more. There's a number of fields in /proc/<pid>/stat that
+       are printed out as zeroes, simply because they don't even *exist* in
+       the kernel any more, or because showing them was a mistake (typically
+       an information leak). But the numbers got replaced by zeroes, so that
+       the code that used to parse the fields still works. The user might not
+       see everything they used to see, and so behavior is clearly different,
+       but things still _work_, even if they might no longer show sensitive
+       (or no longer relevant) information.
+
+       But if something actually breaks, then the change must get fixed or
+       reverted. And it gets fixed in the *kernel*. Not by saying "well, fix
+       your user space then". It was a kernel change that exposed the
+       problem, it needs to be the kernel that corrects for it, because we
+       have a "upgrade in place" model. We don't have a "upgrade with new
+       user space".
+
+       And I seriously will refuse to take code from people who do not
+       understand and honor this very simple rule.
+
+       This rule is also not going to change.
+
+       And yes, I realize that the kernel is "special" in this respect. I'm
+       proud of it.
+
+       I have seen, and can point to, lots of projects that go "We need to
+       break that use case in order to make progress" or "you relied on
+       undocumented behavior, it sucks to be you" or "there's a better way to
+       do what you want to do, and you have to change to that new better
+       way", and I simply don't think that's acceptable outside of very early
+       alpha releases that have experimental users that know what they signed
+       up for. The kernel hasn't been in that situation for the last two
+       decades.
+
+       We do API breakage _inside_ the kernel all the time. We will fix
+       internal problems by saying "you now need to do XYZ", but then it's
+       about internal kernel API's, and the people who do that then also
+       obviously have to fix up all the in-kernel users of that API. Nobody
+       can say "I now broke the API you used, and now _you_ need to fix it
+       up". Whoever broke something gets to fix it too.
+
+       And we simply do not break user space.
+
+ * From `2020-05-21 <https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/>`_::
+
+       The rules about regressions have never been about any kind of
+       documented behavior, or where the code lives.
+
+       The rules about regressions are always about "breaks user workflow".
+
+       Users are literally the _only_ thing that matters.
+
+       No amount of "you shouldn't have used this" or "that behavior was
+       undefined, it's your own fault your app broke" or "that used to work
+       simply because of a kernel bug" is at all relevant.
+
+       Now, reality is never entirely black-and-white. So we've had things
+       like "serious security issue" etc that just forces us to make changes
+       that may break user space. But even then the rule is that we don't
+       really have other options that would allow things to continue.
+
+       And obviously, if users take years to even notice that something
+       broke, or if we have sane ways to work around the breakage that
+       doesn't make for too much trouble for users (ie "ok, there are a
+       handful of users, and they can use a kernel command line to work
+       around it" kind of things) we've also been a bit less strict.
+
+       But no, "that was documented to be broken" (whether it's because the
+       code was in staging or because the man-page said something else) is
+       irrelevant. If staging code is so useful that people end up using it,
+       that means that it's basically regular kernel code with a flag saying
+       "please clean this up".
+
+       The other side of the coin is that people who talk about "API
+       stability" are entirely wrong. API's don't matter either. You can make
+       any changes to an API you like - as long as nobody notices.
+
+       Again, the regression rule is not about documentation, not about
+       API's, and not about the phase of the moon.
+
+       It's entirely about "we caused problems for user space that used to work".
+
+ * From `2012-07-06 <https://lore.kernel.org/all/CA+55aFwnLJ+0sjx92EGREGTWOx84wwKaraSzpTNJwPVV8edw8g@mail.gmail.com/>`_::
+
+       > Now this got me wondering if Debian _unstable_ actually qualifies as a
+       > standard distro userspace.
+
+       Oh, if the kernel breaks some standard user space, that counts. Tons
+       of people run Debian unstable (and from my limited interactions with
+       it, for damn good reasons: -stable tends to run so old versions of
+       everything that you have to sometimes deal with cuneiform writing when
+       using it)
+
+ * From `2017-11-05 <https://lore.kernel.org/all/CA+55aFzUvbGjD8nQ-+3oiMBx14c_6zOj2n7KLN3UsJ-qsd4Dcw@mail.gmail.com/>`_::
+
+       And our regression rule has never been "behavior doesn't change".
+       That would mean that we could never make any changes at all.
+
+       For example, we do things like add new error handling etc all the
+       time, which we then sometimes even add tests for in our kselftest
+       directory.
+
+       So clearly behavior changes all the time and we don't consider that a
+       regression per se.
+
+       The rule for a regression for the kernel is that some real user
+       workflow breaks. Not some test. Not a "look, I used to be able to do
+       X, now I can't".
+
+ * From `2018-08-03 <https://lore.kernel.org/all/CA+55aFwWZX=CXmWDTkDGb36kf12XmTehmQjbiMPCqCRG2hi9kw@mail.gmail.com/>`_::
+
+       YOU ARE MISSING THE #1 KERNEL RULE.
+
+       We do not regress, and we do not regress exactly because your are 100% wrong.
+
+       And the reason you state for your opinion is in fact exactly *WHY* you
+       are wrong.
+
+       Your "good reasons" are pure and utter garbage.
+
+       The whole point of "we do not regress" is so that people can upgrade
+       the kernel and never have to worry about it.
+
+       > Kernel had a bug which has been fixed
+
+       That is *ENTIRELY* immaterial.
+
+       Guys, whether something was buggy or not DOES NOT MATTER.
+
+       Why?
+
+       Bugs happen. That's a fact of life. Arguing that "we had to break
+       something because we were fixing a bug" is completely insane. We fix
+       tens of bugs every single day, thinking that "fixing a bug" means that
+       we can break something is simply NOT TRUE.
+
+       So bugs simply aren't even relevant to the discussion. They happen,
+       they get found, they get fixed, and it has nothing to do with "we
+       break users".
+
+       Because the only thing that matters IS THE USER.
+
+       How hard is that to understand?
+
+       Anybody who uses "but it was buggy" as an argument is entirely missing
+       the point. As far as the USER was concerned, it wasn't buggy - it
+       worked for him/her.
+
+       Maybe it worked *because* the user had taken the bug into account,
+       maybe it worked because the user didn't notice - again, it doesn't
+       matter. It worked for the user.
+
+       Breaking a user workflow for a "bug" is absolutely the WORST reason
+       for breakage you can imagine.
+
+       It's basically saying "I took something that worked, and I broke it,
+       but now it's better". Do you not see how f*cking insane that statement
+       is?
+
+       And without users, your program is not a program, it's a pointless
+       piece of code that you might as well throw away.
+
+       Seriously. This is *why* the #1 rule for kernel development is "we
+       don't break users". Because "I fixed a bug" is absolutely NOT AN
+       ARGUMENT if that bug fix broke a user setup. You actually introduced a
+       MUCH BIGGER bug by "fixing" something that the user clearly didn't
+       even care about.
+
+       And dammit, we upgrade the kernel ALL THE TIME without upgrading any
+       other programs at all. It is absolutely required, because flag-days
+       and dependencies are horribly bad.
+
+       And it is also required simply because I as a kernel developer do not
+       upgrade random other tools that I don't even care about as I develop
+       the kernel, and I want any of my users to feel safe doing the same
+       time.
+
+       So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel
+       without upgrading some other random binary, then we have a problem.
+
+ * From `2021-06-05 <https://lore.kernel.org/all/CAHk-=wiUVqHN76YUwhkjZzwTdjMMJf_zN4+u7vEJjmEGh3recw@mail.gmail.com/>`_::
+
+       THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.
+
+       Honestly, security people need to understand that "not working" is not
+       a success case of security. It's a failure case.
+
+       Yes, "not working" may be secure. But security in that case is *pointless*.
+
+ * From `2021-07-30 <https://lore.kernel.org/lkml/CAHk-=witY33b-vqqp=ApqyoFDpx9p+n4PwG9N-TvF8bq7-tsHw@mail.gmail.com/>`_::
+
+       But we have the policy that regressions aren't about documentation or
+       even sane behavior.
+
+       Regressions are about whether a user application broke in a noticeable way.
+
+ * From `2011-05-06 (1/3) <https://lore.kernel.org/all/BANLkTim9YvResB+PwRp7QTK-a5VNg2PvmQ@mail.gmail.com/>`_::
+
+       Binary compatibility is more important.
+
+       And if binaries don't use the interface to parse the format (or just
+       parse it wrongly - see the fairly recent example of adding uuid's to
+       /proc/self/mountinfo), then it's a regression.
+
+       And regressions get reverted, unless there are security issues or
+       similar that makes us go "Oh Gods, we really have to break things".
+
+       I don't understand why this simple logic is so hard for some kernel
+       developers to understand. Reality matters. Your personal wishes matter
+       NOT AT ALL.
+
+       If you made an interface that can be used without parsing the
+       interface description, then we're stuck with the interface. Theory
+       simply doesn't matter.
+
+       You could help fix the tools, and try to avoid the compatibility
+       issues that way. There aren't that many of them.
+
+ * From `2011-05-06 (2/3) <https://lore.kernel.org/all/BANLkTi=KVXjKR82sqsz4gwjr+E0vtqCmvA@mail.gmail.com/>`_::
+
+       it's clearly NOT an internal tracepoint. By definition. It's being
+       used by powertop.
+
+ * From `2011-05-06 (3/3) <https://lore.kernel.org/all/BANLkTinazaXRdGovYL7rRVp+j6HbJ7pzhg@mail.gmail.com/>`_::
+
+       We have programs that use that ABI and thus it's a regression if they break.
+
+ * From `2006-02-21 <https://lore.kernel.org/lkml/Pine.LNX.4.64.0602211631310.30245@g5.osdl.org/>`_::
+
+       The fact is, if changing the kernel breaks user-space, it's a regression.
+       IT DOES NOT MATTER WHETHER IT'S IN /sbin/hotplug OR ANYTHING ELSE. If it
+       was installed by a distribution, it's user-space. If it got installed by
+       "vmlinux", it's the kernel.
+
+       The only piece of user-space code we ship with the kernel is the system
+       call trampoline etc that the kernel sets up. THOSE interfaces we can
+       really change, because it changes with the kernel.
+
+ * From `2019-09-15 <https://lore.kernel.org/lkml/CAHk-=wiP4K8DRJWsCo=20hn_6054xBamGKF2kPgUzpB5aMaofA@mail.gmail.com/>`_::
+
+       One _particularly_ last-minute revert is the top-most commit (ignoring
+       the version change itself) done just before the release, and while
+       it's very annoying, it's perhaps also instructive.
+
+       What's instructive about it is that I reverted a commit that wasn't
+       actually buggy. In fact, it was doing exactly what it set out to do,
+       and did it very well. In fact it did it _so_ well that the much
+       improved IO patterns it caused then ended up revealing a user-visible
+       regression due to a real bug in a completely unrelated area.
+
+       The actual details of that regression are not the reason I point that
+       revert out as instructive, though. It's more that it's an instructive
+       example of what counts as a regression, and what the whole "no
+       regressions" kernel rule means. The reverted commit didn't change any
+       API's, and it didn't introduce any new bugs. But it ended up exposing
+       another problem, and as such caused a kernel upgrade to fail for a
+       user. So it got reverted.
+
+       The point here being that we revert based on user-reported _behavior_,
+       not based on some "it changes the ABI" or "it caused a bug" concept.
+       The problem was really pre-existing, and it just didn't happen to
+       trigger before. The better IO patterns introduced by the change just
+       happened to expose an old bug, and people had grown to depend on the
+       previously benign behavior of that old issue.
+
+       And never fear, we'll re-introduce the fix that improved on the IO
+       patterns once we've decided just how to handle the fact that we had a
+       bad interaction with an interface that people had then just happened
+       to rely on incidental behavior for before. It's just that we'll have
+       to hash through how to do that (there are no less than three different
+       patches by three different developers being discussed, and there might
+       be more coming...). In the meantime, I reverted the thing that exposed
+       the problem to users for this release, even if I hope it will be
+       re-introduced (perhaps even backported as a stable patch) once we have
+       consensus about the issue it exposed.
+
+       Take-away from the whole thing: it's not about whether you change the
+       kernel-userspace ABI, or fix a bug, or about whether the old code
+       "should never have worked in the first place". It's about whether
+       something breaks existing users' workflow.
+
+       Anyway, that was my little aside on the whole regression thing.  Since
+       it's that "first rule of kernel programming", I felt it is perhaps
+       worth just bringing it up every once in a while.
diff --git a/MAINTAINERS b/MAINTAINERS
index 27a83bb940d4..1b740c922867 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10351,6 +10351,7 @@ KERNEL REGRESSIONS
 M:	Thorsten Leemhuis <linux@leemhuis.info>
 L:	regressions@lists.linux.dev
 S:	Supported
+F:	Documentation/admin-guide/regressions.rst
 
 KERNEL SELFTEST FRAMEWORK
 M:	Shuah Khan <shuah@kernel.org>
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 2/2] docs: regressions.rst: rules of thumb for handling regressions
  2022-01-03  9:50 [RFC PATCH v1 0/2] docs: add a document dedicated to regressions Thorsten Leemhuis
  2022-01-03  9:50 ` [RFC PATCH v1 1/2] docs: add a document about regression handling Thorsten Leemhuis
@ 2022-01-03  9:50 ` Thorsten Leemhuis
  2022-01-04 12:16   ` Lukas Bulwahn
  2022-01-03 14:01 ` [RFC PATCH v1 0/2] docs: add a document dedicated to regressions Greg Kroah-Hartman
  2 siblings, 1 reply; 15+ messages in thread
From: Thorsten Leemhuis @ 2022-01-03  9:50 UTC (permalink / raw)
  To: linux-doc, Linus Torvalds, Greg Kroah-Hartman
  Cc: workflows, Linux Kernel Mailing List, Randy Dunlap, Jonathan Corbet

Add a section with a few rules of thumb about how quickly regressions
should be fixed. They are written after studying the quotes from Linus
found in the modified document and especially influenced by statements
like "Users are literally the _only_ thing that matters" and "without
users, your program is not a program, it's a pointless piece of code
that you might as well throw away". The author interpreted those in
perspective to how the various Linux kernel series are maintained and
what those practices might mean for users running into a regression when
updating.

That for example lead to the paragraph starting with "Aim to get fixes
for regressions mainlined within one week after identifying the culprit,
if the regression was introduced in a stable/longterm release or the
devel cycle for the latest mainline release". This is a pretty high bar,
but on the other hand needed to not leave users out in the cold for to
long. This can quickly happen, as the previous stable series is normally
stamped "End of Life" about three or four weeks after a new mainline
release, which makes a lot of users switch during this timeframe. Any of
them risk running into regressions not promptly fixed; even worse, once
the previous stable series is EOLed for real, users that face a
regression might be left with only three options:

 (1) continue running an outdated and thus potentially insecure kernel
     version from an abandoned stable series

 (2) run the kernel with the regression

 (3) downgrade to an earlier longterm series still supported

This is better avoided, as (1) puts users and their data in danger, (2)
will only be possible if it's a minor regression that doesn't interfere
with booting or serious usage, and (3) might be regression itself or
even impossible, as some users will require drivers or features only
introduced after the latest longterm series took of. In the end this
lead to the "Aim to fix regression within one week" part.

Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
Hi! A lot of developers are doing a good job in fixing regressions, but
I noticed it sometimes it takes many weeks to get even simple fixes for
regressions merged. Most of the time this is due to these factors:

 * it takes a long time to get the fix ready, as some developers
   apparently don't prioritize work on fixing regressions

 * fully developed fixes linger in git trees of maintainers for weeks,
   sometimes even without the fix being in linux-next

This is especially a problem for regressions introduced in mainline, but
only found after the release in the release or a stable kernel series
derived from it. Sometimes fixes for these regressions are even left
lying around for weeks until the next merge window, which contributes to
a huge pile of fixes getting backported to stable and longterm releases
during or shortly after merge windows. Asking developers to speed things
up rarely helped, as people have different options on how fast regression
fixes need to be developed and merged upstream.

That's why it would be a great help to my work as regression tracker if
we had some rough written down guideliones for handling regressions, as
proposed by the patch below. I'm well aware that the texts sets a pretty
high bar. That's because I approached primarily from the point of a
user, as can be seen by the patch description.

The proposed text likely will lead to some discussions, that's why I
submit this part separately from the rest of the new document on
regressions, which is added in patch 1/2; I also CCed Linus and Greg on
this patch and hope they state their opinion or ACK is. In the end I can
easily tone this down or write something totally different: that's
totally fine for me, I'm mainly interested to have some expectations
roughly documented to get everyone on the same page.
---
 Documentation/admin-guide/regressions.rst | 78 +++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/Documentation/admin-guide/regressions.rst b/Documentation/admin-guide/regressions.rst
index 1ff6a0802fc9..5f02a001e53c 100644
--- a/Documentation/admin-guide/regressions.rst
+++ b/Documentation/admin-guide/regressions.rst
@@ -63,6 +63,10 @@ list; add the aforementioned paragraph, just omit the caret (the ^) before the
 ``introduced``, which make regzbot treat your mail (and not the one you reply
 to) as the report.
 
+Try to fix regressions quickly once the culprit got identified. Fixes for most
+regressions should be mainlined within two weeks, but some should be addressed
+within two or three days.
+
 When submitting fixes for regressions, always include 'Link:' tags in the commit
 message that point to all places where the issue was reported, as explained in
 `Documentation/process/submitting-patches.rst` and
@@ -229,6 +233,80 @@ Alternatively to all the above you can just forward or bounce the report to the
 Linux kernel's regression tracker, but consider the downsides already outlined
 in the previous section.
 
+How quickly should regressions get fixed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Developers should fix any reported regression as quickly as possible, to provide
+affected users with a solution in timely manner and prevent more users from
+running into the issue; nevertheless developers need to take enough time and
+care to ensure regression fixes do not cause additional damage.
+
+In the end though, developers should give their best to prevent users from
+running into situations where a regression leaves them only three options: "run
+a kernel with a regression that seriously impacts usage", "continue running an
+outdated and thus potentially insecure kernel version for more than two weeks
+after a regression's culprit got identified", and "downgrade to a still
+supported kernel series that's missing required features".
+
+How to realize this depends a lot on the situation. Here are a few rules of
+thumb for developers, in order or importance:
+
+ * Prioritize work on handling reports about regression and fixing them over all
+   other Linux kernel work, unless the latter concerns acute security issues or
+   bugs causing data loss or damage.
+
+ * Always consider reverting the culprit commits and reapplying them later
+   together with necessary fixes, as this might be the least dangerous and
+   quickest way to fix a regression.
+
+ * Try to get any regressions introduced in the current development cycle
+   resolved before its end. If you fear a fix might be too risky to apply only
+   days before a new mainline release, let Linus decide: submit the fix
+   separately to him as soon as possible with the explanation of the
+   situation. He then can make a call and postpone the release if necessary,
+   for example if multiple such changes show up in his inbox.
+
+ * Address regressions in stable, longterm, or proper mainline releases with
+   more urgency than regressions in mainline pre-releases. That changes after
+   the release of the fifth pre-release, aka '-rc5': mainline then becomes as
+   important, to ensure all the improvements and fixes ideally get at least one
+   week of testing together before Linus releases a new mainline version.
+
+ * Fix regressions within two or three days, if they are critical for some
+   reason -- for example, if the issue is likely to affect many users of the
+   kernel series in question on all or certain architectures. This thus
+   includes fixes for compile errors in mainline, as they might prevent testers
+   and continuous integration systems from doing their work.
+
+ * Aim to get fixes for regressions mainlined within one week after the culprit
+   was identified, if the regression was introduced in a stable/longterm
+   release or the development cycle for the latest mainline release (say
+   v5.14). If possible, try to address the issue even quicker, if the previous
+   stable series (v5.13.y) will be abandoned soon or already got stamped
+   'End-of-Life' (EOL) -- this usually happens about three to four weeks after
+   a new mainline release.
+
+ * Try to fix all other regression within two weeks after the culprit was found.
+   Two or three additional weeks are acceptable for performance regressions and
+   other issues which are annoying, but don't prevent anyone from running Linux
+   -- unless it's an issue in the current development cycle, which should be
+   addressed before the release. A few weeks in total are also acceptable if a
+   regression can only be fixed with a risky change and at the same time is
+   affecting only a few users; as much time is also acceptable if the regression
+   is already present in the second newest longterm kernel series.
+
+Note: The aforementioned timeframes for getting a regression resolved are meant
+to include getting the fix tested, reviewed, and merged into mainline, ideally
+with the fix being in Linux next for two days. Developers need to keep in mind
+that each of these steps takes some time.
+
+Subsystem maintainer are expected to assist in reaching those periods by doing
+timely reviews and quick handling of accepted patches. They thus might have to
+send git-pull requests earlier or more often than usually; depending on the fix,
+it might even be acceptable to skip testing in Linux-next. Especially fixes for
+regressions in stable and longterm kernels need to be handled quickly, as the
+fix needs to reach mainline before it can be backported there.
+
 Do really all regressions get fixed?
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 0/2] docs: add a document dedicated to regressions
  2022-01-03  9:50 [RFC PATCH v1 0/2] docs: add a document dedicated to regressions Thorsten Leemhuis
  2022-01-03  9:50 ` [RFC PATCH v1 1/2] docs: add a document about regression handling Thorsten Leemhuis
  2022-01-03  9:50 ` [RFC PATCH v1 2/2] docs: regressions.rst: rules of thumb for handling regressions Thorsten Leemhuis
@ 2022-01-03 14:01 ` Greg Kroah-Hartman
  2 siblings, 0 replies; 15+ messages in thread
From: Greg Kroah-Hartman @ 2022-01-03 14:01 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: linux-doc, Linus Torvalds, workflows, Linux Kernel Mailing List,
	Randy Dunlap, Jonathan Corbet

On Mon, Jan 03, 2022 at 10:50:49AM +0100, Thorsten Leemhuis wrote:
> 'We don't cause regressions' might be the first rule of kernel development, but
> it and other aspects of regressions nevertheless are hardly described in the
> Linux kernel's documentation. These patches change this by creating a document
> dedicated to the topic.
> 
> The second patch could easily be folded into the first one, but I kept it
> separate, as it might be a bit controversial. This also allows the patch
> description to explain some backgrounds for this part of the text. Additionally,
> ACKs and Reviewed-by tags can be collected separately this way.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 1/2] docs: add a document about regression handling
  2022-01-03  9:50 ` [RFC PATCH v1 1/2] docs: add a document about regression handling Thorsten Leemhuis
@ 2022-01-03 17:07   ` Jakub Kicinski
  2022-01-03 17:20     ` Thorsten Leemhuis
  2022-01-04 14:17   ` Lukas Bulwahn
  1 sibling, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2022-01-03 17:07 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: linux-doc, Linus Torvalds, Greg Kroah-Hartman, workflows,
	Linux Kernel Mailing List, Randy Dunlap, Jonathan Corbet

On Mon,  3 Jan 2022 10:50:50 +0100 Thorsten Leemhuis wrote:
> +How to see which regressions regzbot tracks currently?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_
> +for the latest info; alternatively, `search for the latest regression report
> +<https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
> +which regzbot normally sends out once a week on Sunday evening (UTC), which is a
> +few hours before Linus usually publishes new (pre-)releases.

Cool, I wonder if it would be a useful feature to be able to filter by
mailing lists involved or such to give maintainers a quick overview of
regressions they are on the hook for?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 1/2] docs: add a document about regression handling
  2022-01-03 17:07   ` Jakub Kicinski
@ 2022-01-03 17:20     ` Thorsten Leemhuis
  2022-01-03 17:55       ` Jakub Kicinski
  0 siblings, 1 reply; 15+ messages in thread
From: Thorsten Leemhuis @ 2022-01-03 17:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: linux-doc, Linus Torvalds, Greg Kroah-Hartman, workflows,
	Linux Kernel Mailing List, Randy Dunlap, Jonathan Corbet



On 03.01.22 18:07, Jakub Kicinski wrote:
> On Mon,  3 Jan 2022 10:50:50 +0100 Thorsten Leemhuis wrote:
>> +How to see which regressions regzbot tracks currently?
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_
>> +for the latest info; alternatively, `search for the latest regression report
>> +<https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
>> +which regzbot normally sends out once a week on Sunday evening (UTC), which is a
>> +few hours before Linus usually publishes new (pre-)releases.
> 
> Cool, I wonder if it would be a useful feature to be able to filter by
> mailing lists involved or such to give maintainers a quick overview of
> regressions they are on the hook for?

Ha, that's a great idea, many thx. I have been scratching my head for a
while already how to give maintainers a better overview, but the only
thing I came up with was "check the merge path a commit causing the
regression took", which has a few obvious downsides (it for example
won't work if the culprit is not known yet). This should work a lot better.

But be warned, will likely take a few weeks (months?) before I get to
implement that: I have less time to work on the regzbot code than in the
past weeks, as I have to take care of a few other things first (most of
them related to regzbot).

Ciao, Thorsten


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 1/2] docs: add a document about regression handling
  2022-01-03 17:20     ` Thorsten Leemhuis
@ 2022-01-03 17:55       ` Jakub Kicinski
  0 siblings, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2022-01-03 17:55 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: linux-doc, Linus Torvalds, Greg Kroah-Hartman, workflows,
	Linux Kernel Mailing List, Randy Dunlap, Jonathan Corbet

On Mon, 3 Jan 2022 18:20:23 +0100 Thorsten Leemhuis wrote:
> On 03.01.22 18:07, Jakub Kicinski wrote:
> > On Mon,  3 Jan 2022 10:50:50 +0100 Thorsten Leemhuis wrote:  
> >> +How to see which regressions regzbot tracks currently?
> >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> +
> >> +Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_
> >> +for the latest info; alternatively, `search for the latest regression report
> >> +<https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
> >> +which regzbot normally sends out once a week on Sunday evening (UTC), which is a
> >> +few hours before Linus usually publishes new (pre-)releases.  
> > 
> > Cool, I wonder if it would be a useful feature to be able to filter by
> > mailing lists involved or such to give maintainers a quick overview of
> > regressions they are on the hook for?  
> 
> Ha, that's a great idea, many thx. I have been scratching my head for a
> while already how to give maintainers a better overview, but the only
> thing I came up with was "check the merge path a commit causing the
> regression took", which has a few obvious downsides (it for example
> won't work if the culprit is not known yet). This should work a lot better.
> 
> But be warned, will likely take a few weeks (months?) before I get to
> implement that: I have less time to work on the regzbot code than in the
> past weeks, as I have to take care of a few other things first (most of
> them related to regzbot).

No worries, do ping when you got it ready tho :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 2/2] docs: regressions.rst: rules of thumb for handling regressions
  2022-01-03  9:50 ` [RFC PATCH v1 2/2] docs: regressions.rst: rules of thumb for handling regressions Thorsten Leemhuis
@ 2022-01-04 12:16   ` Lukas Bulwahn
  2022-01-04 13:29     ` Thorsten Leemhuis
  0 siblings, 1 reply; 15+ messages in thread
From: Lukas Bulwahn @ 2022-01-04 12:16 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: open list:DOCUMENTATION, Linus Torvalds, Greg Kroah-Hartman,
	workflows, Linux Kernel Mailing List, Randy Dunlap,
	Jonathan Corbet

On Mon, Jan 3, 2022 at 3:23 PM Thorsten Leemhuis <linux@leemhuis.info> wrote:
>
> Add a section with a few rules of thumb about how quickly regressions
> should be fixed. They are written after studying the quotes from Linus
> found in the modified document and especially influenced by statements
> like "Users are literally the _only_ thing that matters" and "without
> users, your program is not a program, it's a pointless piece of code
> that you might as well throw away". The author interpreted those in
> perspective to how the various Linux kernel series are maintained and
> what those practices might mean for users running into a regression when
> updating.
>
> That for example lead to the paragraph starting with "Aim to get fixes
> for regressions mainlined within one week after identifying the culprit,
> if the regression was introduced in a stable/longterm release or the
> devel cycle for the latest mainline release". This is a pretty high bar,
> but on the other hand needed to not leave users out in the cold for to
> long. This can quickly happen, as the previous stable series is normally
> stamped "End of Life" about three or four weeks after a new mainline
> release, which makes a lot of users switch during this timeframe. Any of
> them risk running into regressions not promptly fixed; even worse, once
> the previous stable series is EOLed for real, users that face a
> regression might be left with only three options:
>
>  (1) continue running an outdated and thus potentially insecure kernel
>      version from an abandoned stable series
>
>  (2) run the kernel with the regression
>
>  (3) downgrade to an earlier longterm series still supported
>
> This is better avoided, as (1) puts users and their data in danger, (2)
> will only be possible if it's a minor regression that doesn't interfere
> with booting or serious usage, and (3) might be regression itself or
> even impossible, as some users will require drivers or features only
> introduced after the latest longterm series took of. In the end this
> lead to the "Aim to fix regression within one week" part.
>
> Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
> ---
> Hi! A lot of developers are doing a good job in fixing regressions, but
> I noticed it sometimes it takes many weeks to get even simple fixes for
> regressions merged. Most of the time this is due to these factors:
>
>  * it takes a long time to get the fix ready, as some developers
>    apparently don't prioritize work on fixing regressions
>
>  * fully developed fixes linger in git trees of maintainers for weeks,
>    sometimes even without the fix being in linux-next
>
> This is especially a problem for regressions introduced in mainline, but
> only found after the release in the release or a stable kernel series
> derived from it. Sometimes fixes for these regressions are even left
> lying around for weeks until the next merge window, which contributes to
> a huge pile of fixes getting backported to stable and longterm releases
> during or shortly after merge windows. Asking developers to speed things
> up rarely helped, as people have different options on how fast regression
> fixes need to be developed and merged upstream.
>
> That's why it would be a great help to my work as regression tracker if
> we had some rough written down guideliones for handling regressions, as
> proposed by the patch below. I'm well aware that the texts sets a pretty
> high bar. That's because I approached primarily from the point of a
> user, as can be seen by the patch description.
>
> The proposed text likely will lead to some discussions, that's why I
> submit this part separately from the rest of the new document on
> regressions, which is added in patch 1/2; I also CCed Linus and Greg on
> this patch and hope they state their opinion or ACK is. In the end I can
> easily tone this down or write something totally different: that's
> totally fine for me, I'm mainly interested to have some expectations
> roughly documented to get everyone on the same page.
> ---
>  Documentation/admin-guide/regressions.rst | 78 +++++++++++++++++++++++
>  1 file changed, 78 insertions(+)
>
> diff --git a/Documentation/admin-guide/regressions.rst b/Documentation/admin-guide/regressions.rst
> index 1ff6a0802fc9..5f02a001e53c 100644
> --- a/Documentation/admin-guide/regressions.rst
> +++ b/Documentation/admin-guide/regressions.rst
> @@ -63,6 +63,10 @@ list; add the aforementioned paragraph, just omit the caret (the ^) before the
>  ``introduced``, which make regzbot treat your mail (and not the one you reply
>  to) as the report.
>
> +Try to fix regressions quickly once the culprit got identified. Fixes for most

s/got/gets/ --- at least, that is what the gmail grammar spelling suggests :)

> +regressions should be mainlined within two weeks, but some should be addressed
> +within two or three days.
> +
>  When submitting fixes for regressions, always include 'Link:' tags in the commit
>  message that point to all places where the issue was reported, as explained in
>  `Documentation/process/submitting-patches.rst` and
> @@ -229,6 +233,80 @@ Alternatively to all the above you can just forward or bounce the report to the
>  Linux kernel's regression tracker, but consider the downsides already outlined
>  in the previous section.
>
> +How quickly should regressions get fixed?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Developers should fix any reported regression as quickly as possible, to provide
> +affected users with a solution in timely manner and prevent more users from
> +running into the issue; nevertheless developers need to take enough time and
> +care to ensure regression fixes do not cause additional damage.
> +
> +In the end though, developers should give their best to prevent users from
> +running into situations where a regression leaves them only three options: "run
> +a kernel with a regression that seriously impacts usage", "continue running an
> +outdated and thus potentially insecure kernel version for more than two weeks
> +after a regression's culprit got identified", and "downgrade to a still
> +supported kernel series that's missing required features".
> +
> +How to realize this depends a lot on the situation. Here are a few rules of
> +thumb for developers, in order or importance:
> +
> + * Prioritize work on handling reports about regression and fixing them over all
> +   other Linux kernel work, unless the latter concerns acute security issues or
> +   bugs causing data loss or damage.
> +
> + * Always consider reverting the culprit commits and reapplying them later
> +   together with necessary fixes, as this might be the least dangerous and
> +   quickest way to fix a regression.
> +
> + * Try to get any regressions introduced in the current development cycle
> +   resolved before its end. If you fear a fix might be too risky to apply only
> +   days before a new mainline release, let Linus decide: submit the fix
> +   separately to him as soon as possible with the explanation of the
> +   situation. He then can make a call and postpone the release if necessary,
> +   for example if multiple such changes show up in his inbox.
> +
> + * Address regressions in stable, longterm, or proper mainline releases with
> +   more urgency than regressions in mainline pre-releases. That changes after
> +   the release of the fifth pre-release, aka '-rc5': mainline then becomes as
> +   important, to ensure all the improvements and fixes ideally get at least one
> +   week of testing together before Linus releases a new mainline version.
> +
> + * Fix regressions within two or three days, if they are critical for some
> +   reason -- for example, if the issue is likely to affect many users of the
> +   kernel series in question on all or certain architectures. This thus
> +   includes fixes for compile errors in mainline, as they might prevent testers
> +   and continuous integration systems from doing their work.
> +
> + * Aim to get fixes for regressions mainlined within one week after the culprit
> +   was identified, if the regression was introduced in a stable/longterm
> +   release or the development cycle for the latest mainline release (say
> +   v5.14). If possible, try to address the issue even quicker, if the previous
> +   stable series (v5.13.y) will be abandoned soon or already got stamped
> +   'End-of-Life' (EOL) -- this usually happens about three to four weeks after
> +   a new mainline release.
> +
> + * Try to fix all other regression within two weeks after the culprit was found.

s/regression/regressions/

> +   Two or three additional weeks are acceptable for performance regressions and
> +   other issues which are annoying, but don't prevent anyone from running Linux
> +   -- unless it's an issue in the current development cycle, which should be
> +   addressed before the release. A few weeks in total are also acceptable if a
> +   regression can only be fixed with a risky change and at the same time is
> +   affecting only a few users; as much time is also acceptable if the regression
> +   is already present in the second newest longterm kernel series.
> +
> +Note: The aforementioned timeframes for getting a regression resolved are meant

s/timeframes/time frames/

> +to include getting the fix tested, reviewed, and merged into mainline, ideally
> +with the fix being in Linux next for two days. Developers need to keep in mind

s/Linux next/linux-next/

> +that each of these steps takes some time.
> +
> +Subsystem maintainer are expected to assist in reaching those periods by doing
> +timely reviews and quick handling of accepted patches. They thus might have to
> +send git-pull requests earlier or more often than usually; depending on the fix,
> +it might even be acceptable to skip testing in Linux-next. Especially fixes for

s/Linux-next/linux-next/

Thorsten, thanks for this process documentation. It was a nice and
comprehensible read for me. Let us hope it helps contributors and
maintainers to adopt those recommendations. If you need any support of
any kind (more contributors, financial support) for such further
documentation on the development process, please reach out to me and I
will see what I can do.

Reviewed-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>

Lukas

> +regressions in stable and longterm kernels need to be handled quickly, as the
> +fix needs to reach mainline before it can be backported there.
> +
>  Do really all regressions get fixed?
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 2/2] docs: regressions.rst: rules of thumb for handling regressions
  2022-01-04 12:16   ` Lukas Bulwahn
@ 2022-01-04 13:29     ` Thorsten Leemhuis
  2022-01-04 14:42       ` Jonathan Corbet
  0 siblings, 1 reply; 15+ messages in thread
From: Thorsten Leemhuis @ 2022-01-04 13:29 UTC (permalink / raw)
  To: Lukas Bulwahn
  Cc: open list:DOCUMENTATION, Linus Torvalds, Greg Kroah-Hartman,
	workflows, Linux Kernel Mailing List, Randy Dunlap,
	Jonathan Corbet

On 04.01.22 13:16, Lukas Bulwahn wrote:
> On Mon, Jan 3, 2022 at 3:23 PM Thorsten Leemhuis <linux@leemhuis.info> wrote:

>> diff --git a/Documentation/admin-guide/regressions.rst b/Documentation/admin-guide/regressions.rst
>> index 1ff6a0802fc9..5f02a001e53c 100644
>> --- a/Documentation/admin-guide/regressions.rst
>> +++ b/Documentation/admin-guide/regressions.rst
>> @@ -63,6 +63,10 @@ list; add the aforementioned paragraph, just omit the caret (the ^) before the
>>  ``introduced``, which make regzbot treat your mail (and not the one you reply
>>  to) as the report.
>>
>> +Try to fix regressions quickly once the culprit got identified. Fixes for most
> 
> s/got/gets/ --- at least, that is what the gmail grammar spelling suggests :)

Hmm, LanguageTool didn't complain. Not totally sure, maybe both
approaches are okay. But the variant suggested by the gmail checker
might be the better one.

You comment made me put my text in google docs, which found about
fifteen other places where something was wrong. Should have done this
sooner, sorry. :-/

> [a lot of helpful comments]

Many thx, fixed all of them locally.

> Thorsten, thanks for this process documentation. It was a nice and
> comprehensible read for me. Let us hope it helps contributors and
> maintainers to adopt those recommendations.

Time will tell. Guess it will take a while.

> If you need any support of any kind (more contributors,

If you known people looking for a kernel docs text to work on, I have
two related ideas that might be of interest for them:

* the kernel docs IMHO could need a text explaining how to use "make
localmodconfig" to ordinary users -- for example, when preparing for a
bisection or a quick test of the latest mainline tree. Something like
this maybe, but modernized (and maybe with a explanation how to clone
the tree without getting the history from ten years ago):
http://www.h-online.com/open/features/Good-and-quick-kernel-configuration-creation-1403046.html
(that's a translation of a German text I wrote a decade ago...)

* the kernel docs contain a text explaining bisection, but it iirc is
brief and quite hard to understand for users that are new to this.
That's why I think it would be wise to improve or even rewrite the text,
to make it more accessible.

> financial support) for such further
> documentation on the development process, please reach out to me and I
> will see what I can do.

Sounds great. I might do that sooner or later for the two ideas I
outlined above, but that is unlikely to happen in the next few months.

> Reviewed-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>

Great, thx!

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 1/2] docs: add a document about regression handling
  2022-01-03  9:50 ` [RFC PATCH v1 1/2] docs: add a document about regression handling Thorsten Leemhuis
  2022-01-03 17:07   ` Jakub Kicinski
@ 2022-01-04 14:17   ` Lukas Bulwahn
  2022-01-04 17:57     ` Thorsten Leemhuis
  1 sibling, 1 reply; 15+ messages in thread
From: Lukas Bulwahn @ 2022-01-04 14:17 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: open list:DOCUMENTATION, Linus Torvalds, Greg Kroah-Hartman,
	workflows, Linux Kernel Mailing List, Randy Dunlap,
	Jonathan Corbet

On Mon, Jan 3, 2022 at 3:23 PM Thorsten Leemhuis <linux@leemhuis.info> wrote:
>
> Create a document explaining various aspects around regression handling
> and tracking both for users and developers. Among others describe the
> first rule of Linux kernel development and what it means in practice.
> Also explain what a regression actually is and how to report them
> properly. The text additionally provides a brief introduction to the bot
> the kernel's regression tracker users to facilitate the work. To sum
> things up, provide a few quotes from Linus to show how serious the he
> takes regressions.
>
> Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
> ---
>  Documentation/admin-guide/index.rst       |   1 +
>  Documentation/admin-guide/regressions.rst | 869 ++++++++++++++++++++++
>  MAINTAINERS                               |   1 +
>  3 files changed, 871 insertions(+)
>  create mode 100644 Documentation/admin-guide/regressions.rst
>
> diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
> index 1bedab498104..17157ee5a416 100644
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -36,6 +36,7 @@ problems and bugs in particular.
>
>     reporting-issues
>     security-bugs
> +   regressions
>     bug-hunting
>     bug-bisect
>     tainted-kernels
> diff --git a/Documentation/admin-guide/regressions.rst b/Documentation/admin-guide/regressions.rst
> new file mode 100644
> index 000000000000..1ff6a0802fc9
> --- /dev/null
> +++ b/Documentation/admin-guide/regressions.rst
> @@ -0,0 +1,869 @@
> +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
> +..
> +   If you want to distribute this text under CC-BY-4.0 only, please use 'The
> +   Linux kernel developers' for author attribution and link this as source:
> +   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/regressions.rst
> +..
> +   Note: Only the content of this RST file as found in the Linux kernel sources
> +   is available under CC-BY-4.0, as versions of this text that were processed
> +   (for example by the kernel's build system) might contain content taken from
> +   files which use a more restrictive license.
> +
> +
> +Regressions
> ++++++++++++
> +
> +The first rule of Linux kernel development: '*We don't cause regressions*'.
> +Linux founder and lead developer Linus Torvalds strictly enforces the rule
> +himself. This document describes what this means in practice and how the Linux
> +kernel's development model ensues all reported regressions get addressed; it

Did you mean "ensues" here or is it a typo of "ensures"?

> +covers aspects relevant for both users and developers.
> +
> +The important bits for people affected by regressions
> +=====================================================
> +
> +It's a regression if something running fine with one Linux kernel works worse or
> +not at all with a newer version. Note, the newer kernel has to be compiled using
> +a similar configuration -- for this and other fine print, check out below
> +section "What is a 'regression' and what is the 'no regressions rule'?".
> +
> +Report your regression as outlined in
> +`Documentation/admin-guide/reporting-issues.rst`, it already covers all aspects
> +important for regressions. Below section "How do I report a regression?"
> +highlights them for convenience.
> +
> +The most important aspect: CC for forward the report to `the regression mailings

s/for/or/ --- you mean CC or forward, right?

> +list <https://lore.kernel.org/regressions/>`_ (regressions@lists.linux.dev).
> +When doing so, consider mentioning the version range where the regression
> +started using a paragraph like this::
> +
> +       #regzbot introduced v5.13..v5.14-rc1
> +
> +The Linux kernel regression tracking bot 'regzbot' will then add the report to
> +the list of tracked regressions. This is in your interest, as it gets the report
> +on the radar of people ensuring all regressions are acted upon in timely manner.

s/in timely manner/in a timely manner/

> +
> +The important bits for people fixing regressions
> +================================================
> +
> +When getting regression reports by mail, check if the reporter CCed `the
> +regression mailing list <https://lore.kernel.org/regressions/>`_
> +(regressions@lists.linux.dev). If not, forward or bounce the report to the Linux
> +kernel's regression tracker (regressions@leemhuis.info), unless you plan sending
> +a reply to the report anyway. In that case simply CC the list in a direct reply
> +to the report. Also check, if the report included a 'regzbot command' like
> +``#regzbot introduced v5.13..v5.14-rc1`` (see above); if not, please include a
> +paragraph like the following, to get the regression tracked by the Linux kernel
> +regression tracking bot 'regzbot'::
> +
> +       #regzbot ^introduced v5.13..v5.14-rc1
> +
> +If the report was filed in a public bug-tracker, forward it to the regression

s/bug-tracker/bug tracker/

> +list; add the aforementioned paragraph, just omit the caret (the ^) before the
> +``introduced``, which make regzbot treat your mail (and not the one you reply

s/make/makes/

> +to) as the report.
> +
> +When submitting fixes for regressions, always include 'Link:' tags in the commit
> +message that point to all places where the issue was reported, as explained in
> +`Documentation/process/submitting-patches.rst` and
> +:ref:`Documentation/process/5.Posting.rst <development_posting>`. Hence, link to
> +any mails in the archive with reports about the issue as well as all bug-tracker
> +entries::
> +
> +       Link: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
> +       Link: https://bugzilla.kernel.org/show_bug.cgi?id=215375
> +
> +This is important for regression tracking, as this allows regzbot to
> +automatically associate tracked regression reports with patch postings and
> +commits fixing it.
> +
> +
> +All the details on handling Linux kernel regressions
> +====================================================
> +
> +The important basics
> +--------------------
> +
> +What is a 'regression' and what is the 'no regressions rule'?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It's a regression if some application or practical use case running fine on one
> +Linux kernel works worse or not at all with a newer version compiled using a
> +similar configuration. The 'no regressions rule' forbids this to happen. If a
> +regression happens by accident, developers that caused it are expected to
> +quickly fix the issue.
> +
> +It thus is a regression when a WiFi driver from Linux 5.13 works fine, but with
> +5.14 doesn't work at all, works significantly slower, or misbehaves somehow.
> +It's also a regression if a perfectly working application suddenly shows erratic
> +behavior with a newer kernel version, which can be caused by changes in procfs,
> +sysfs, or one of the many other interfaces Linux provides to userland software.
> +But keep in mind, as mentioned earlier: 5.14 in this example needs to be build

s/build/built/ --- if you put into Google docs, you probably have seen
that by now.

> +from a configuration similar to the one from 5.13. This can be achieved using
> +``make olddefconfig``, as explained in more detail below.
> +
> +Note the 'practical use case' in the first sentence of this section: developers
> +despite the 'no regressions' rule are free to change any aspect of the kernel
> +and even APIs or ABIs to userland, as long as no existing application or
> +use-case breaks.

s/use-case/use case/

> +
> +Also be aware the 'no regressions' rule covers only interfaces the kernel
> +provides to the userland. It thus does not apply to kernel-internal interfaces
> +like the module API, which some externally developed drivers use to hook into
> +the kernel.
> +
> +What is the goal of the 'no regressions rule'?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Users should feel safe when updating kernel versions and not have to worry
> +something might break. This is in the interest of the kernel developers to make
> +updating attractive: they don't want users to stay on stable or longterm Linux
> +series either abandoned or more than one and a half year old, as `those might
> +have known problems, security issues, or other aspects already improved in later
> +versions
> +<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
> +
Maybe add something like this:

A larger user community means more exposure and more confidence that
any critical bug introduced is likely to be found closer to the point
in time it was introduced, and hence the shipped kernels have less
critical bugs.

Just to close the line of thought here.

> +How hard is the rule enforced?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Extraordinarily strict, as can be seen by many mailing list posts from Linux
> +creator and lead-developer Linus Torvalds, some of which are quoted at the end

s/lead-developer/lead developer/

> +of this document.
> +
> +Exceptions to this rule are extremely rare; in the past developers almost always
> +turned out to be wrong when they assumed a particular situation was warranting
> +an exception.
> +
> +How is the rule enforced?
> +~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It's the duty of the subsystem maintainers, which are watched and supported by
> +Linus Torvalds for mainline or stable/longterm tree maintainers like Greg
> +Kroah-Hartman. All of them are supported by Thorsten Leemhuis: he's acting as
> +'regressions tracker' for the Linux kernel and trying to ensure all regression
> +reports are acted upon in timely manner.
> +
> +The distributed and slightly unstructured nature of the Linux kernel's
> +development makes tracking regressions hard. That's why Thorsten relies on the
> +help of his Linux kernel regression tracking robot 'regzbot'. It watches mailing
> +lists and git trees to semi-automatically associate regression reports to patch
> +submissions and commits fixing the issue, as this provides all necessary
> +insights into the fixing progress.
> +
> +To ensure no regression falls through the cracks, the regression tracker or his
> +bot need to get aware of every report. That's why you need to get them into the
> +loop for regressions, as explained in the next section.
> +
> +How do I report a regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Just report the issue as outlined in
> +`Documentation/admin-guide/reporting-issues.rst`, it already describes the
> +important points. The following aspects described there are especially relevant
> +for regressions:
> +
> + * When checking for existing reports to join, first check the `archives of the
> +   Linux regressions mailing list <https://lore.kernel.org/regressions/>`_ and
> +   `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
> +
> + * In your report, mention the last kernel version that worked fine and the
> +   first broken one. Even better: try to find the commit causing the regression
> +   using a bisection.
> +
> + * Remember to let the Linux regressions mailing list
> +   (regressions@lists.linux.dev) known about your report:
> +
> +  * If you report the regression by mail, CC the regressions list.
> +
> +  * If you report your regression to some bug tracker, forward the filed report
> +    by mail to the regressions list while CCing the maintainer and the mailing
> +    list for the subsystem in question.
> +
> +Additionally, you in both cases should directly get the aforementioned Linux
> +kernel regression tracking bot into the loop. To do that, include a paragraph
> +like this in your report to tell the bot when the regression started to happen::
> +
> +       #regzbot introduced: v5.13..v5.14-rc1
> +
> +In this example, v5.13 was the last version that worked, while v5.14-rc1 was the
> +first broken one. The smaller the range, the better, as that makes it easier to
> +find out what's wrong and who's responsible. That's why you ideally should
> +perform a bisection to find the commit causing the regression (the 'culprit').
> +If you did, specify it instead::
> +
> +       #regzbot introduced: 1f2e3d4c5d
> +
> +Placing such a 'regzbot command' is in your interest, as it will ensure the
> +report won't fall through the cracks unnoticed. If you omit this, the Linux
> +kernel's regressions tracker will take care of telling regzbot about your
> +regression, as long as you sent a copy to the regressions mailing lists. But the
> +regression tracker is just one human which sometimes has to rest or occasionally
> +might even enjoy some time away from computers (as crazy as that might sound).
> +Relying on this person thus will result in an unnecessary delay before the
> +regressions becomes mentioned `on the list of tracked and unresolved Linux
> +kernel regressions <https://linux-regtracking.leemhuis.info/regzbot/>`_ and the
> +weekly regression reports sent by regzbot. Such delays can result in Linus
> +Torvalds being unaware of important regressions when deciding between 'continue
> +development or call this finished by performing a release?'.
> +
> +How to add a regression to regzbot's tracking somebody else reported?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Use your mailers 'Reply-all' function to send a reply where you CC the
> +regressions list (regressions@lists.linux.dev). In that reply create a new
> +paragraph with a regzbot command like this::
> +
> +       #regzbot ^introduced: v5.13..v5.14-rc1
> +
> +The caret (^) before the 'introduced' makes regzbot treat the parent mail (the
> +one you reply to) as the report for the regression you want to see tracked.
> +Instead of a version range you can also specify the commit causing the
> +regression, as outlined in the previous section.
> +
> +If the report came in private from a bug tracker, forward it to the list;
> +include the aforementioned line, just omit the caret (the ^) before the
> +'introduced'; consider adding a line with the line '#regzbot link: <url>' (see
> +below) pointing to the place with the initial report.
> +
> +Alternatively to all the above you can just forward or bounce the report to the
> +Linux kernel's regression tracker, but consider the downsides already outlined
> +in the previous section.
> +
> +Do really all regressions get fixed?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Nearly all of them are, as long as the change causing the regression (the
> +'culprit commit') gets reliably identified. Some regressions can be fixed
> +without this, but often it's required.
> +
> +Who needs to find the commit causing a regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It's the reporter's duty to find the culprit, but developers of the affected
> +subsystem should offer advice and reasonably help where they can.
> +
> +How can I find the change causing a regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Perform a bisection, as roughly outlined in
> +`Documentation/admin-guide/reporting-issues.rst` and described in more detail by
> +`Documentation/admin-guide/bug-bisect.rst`. It might sound like a lot of work,
> +but in many cases finds the culprit relative quickly. If it's hard or

s/relative/relatively/

> +time-consuming to reliably reproduce the issue, consider teaming up with others
> +affected by the problem to narrow down the search range together.
> +
> +Who can I ask for advice when it comes to regressions?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Send a mail to the regressions mailing list (regressions@lists.linux.dev) while
> +CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the
> +issue might better be dealt with in private, feel free to omit the list.
> +
> +
> +More details about regressions relevant for reporters
> +-----------------------------------------------------
> +
> +Does a regression need to be fixed, if it can be avoided by updating some other software?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Almost always: yes. If a developer tell you otherwise, ask the regression
> +tracker for advice as outlined above.
> +
> +Does it qualify as a regression if a newer kernel works slower or makes the system consumes more energy?

s/consumes/consume/

Okay, that is how far I got reading for now.

Lukas

> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It does, but the difference has to be significant. A five percent slow-down in a
> +micro-benchmark thus is unlikely to qualify as regression, unless it also
> +influences the results of a broad benchmark by more than one percent. If in a
> +doubt, ask for advice.
> +
> +Is it a regression, if an externally developed kernel module is incompatible with a newer kernel?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +No, as the 'no regression' rule is about interfaces and services the Linux
> +kernel provides to the userland. It thus does not cover building or running
> +externally developed kernel modules, as they run in kernel-space and use
> +occasionally changed internal interfaces to hook into the kernel.
> +
> +How are regressions handled that are caused by a fix for security vulnerability?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +In extremely rare situations security issues can't be fixed without causing
> +regressions; those are given way, as they are the lesser evil in the end.
> +Luckily this almost always can be avoided, as key developers for the affected
> +area and often Linus Torvalds himself try very hard to fix security issues
> +without causing regressions.
> +
> +If you nevertheless face such a case, check the mailing list archives if people
> +tried their best to avoid the regression; if in a doubt, ask for advice as
> +outlined above.
> +
> +What happens if fixing a regression is impossible without causing another regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Sadly these things happen, but luckily not very often; if they occur, expert
> +developers of the affected code area should look into the issue to find a fix
> +that avoids regressions or at least their impact. If you run into such a
> +situation you thus do what was outlined already for regressions caused by
> +security fixes: check earlier discussions if people already tried their best and
> +ask for advice if in a doubt.
> +
> +A quick note while at it: these situations could be avoided, if you would
> +regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each cycle a
> +test run. This is best explained by imagining a change integrated between Linux
> +v5.14 and v5.15-rc1 which causes a regression, but at the same time is a hard
> +requirement for some other improvement applied for 5.15-rc1. All these changes
> +often can simply be reverted and the regression thus solved, if someone finds
> +and reports it before 5.15 is released. A few days or weeks later after the
> +release this solution might become impossible, if some software starts to rely
> +on aspects introduced by one of the follow-up changes: reverting all changes
> +would cause regressions for users of said software and thus out of the question.
> +
> +A feature I relied on was removed months ago, but I only noticed now. Does that qualify as regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It does, but often it's hard to fix them due to the aspects outlined in the
> +previous section. It hence needs to be dealt with on a case-by-case basis; this
> +is another reason why it's in your interest to regular test mainline releases.
> +
> +Does the 'no regression' rule apply if I seem to be the only person in the world that is affected by a regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It does, but only for practical usage: the Linux developers want to be free to
> +remove support for hardware only to be found in attics and museums anymore.
> +
> +Note, sometimes regressions can't be avoided to make progress -- and the latter
> +is needed to prevent Linux from stagnation. Hence, if only very few users seem
> +to be affected by a regression, it for the greater good might be in their and
> +everyone else interest to not insist on the rule. Especially if there is a easy
> +way to circumvent the regression somehow, for example by updating some software
> +or using a kernel parameter created just for this purpose.
> +
> +Does the regression rule apply for code in the staging tree as well?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Not according to the `help text for the configuration option covering all
> +staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,
> +which since its early days states::
> +
> +       Please note that these drivers are under heavy development, may or
> +       may not work, and may contain userspace interfaces that most likely
> +       will be changed in the near future.
> +
> +The staging developers nevertheless often adhere the 'no regressions' rule, but
> +sometimes bend it to make progress. That's for example why some users had to
> +deal with (often negligible) regressions when a WiFi driver from the staging
> +tree got replaced by a totally different one written from scratch.
> +
> +Why do later versions have to be 'compiled with a similar configuration'?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Because the Linux kernel developers sometimes integrate changes known to cause
> +regressions, but make them optional and disable them in the kernel's default
> +configuration. This trick allows progress, as the 'no regressions' rule
> +otherwise would lead to stagnation. Consider for example a new security feature
> +which blocks access to some kernel interfaces often abused by malware, but at
> +the same time are required to run a few rarely used applications. The outlined
> +trick makes both camps happy: people using these applications can leave the new
> +security feature off, while everyone else can enable it without running into
> +trouble.
> +
> +How to create a configuration similar to the one of an older kernel?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Start a known-good kernel and configure the newer Linux version with ``make
> +olddefconfig``. This makes the kernel's build scripts pick up the configuration
> +file (the `.config` file) from the running kernel as base for the new one you
> +are about to compile; afterwards they set all new configuration options to their
> +default value, which disables new features that might cause regressions.
> +
> +Can I report a regression with vanilla kernels provided by someone else to the upstream Linux kernel developers?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Only if the newer kernel was compiled with a similar configuration file as the
> +older one (see above), as your provider might have enabled some known-to-be
> +incompatible feature in the newer kernel. If in a doubt, report this problem to
> +the provider and ask for advice.
> +
> +
> +More details about regressions relevant for developers
> +------------------------------------------------------
> +
> +What should I do, if I suspect a change I'm working on might cause regressions?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Evaluate how big the risk of regressions is, for example by performing a code
> +search in Linux distributions and Git forges. Also consider asking other
> +developers or projects likely to be affected to evaluate or even test the
> +proposed change; if problems surface, maybe some middle ground acceptable for
> +all can be found.
> +
> +If the risk of regressions in the end seems to be relative small, go ahead with
> +the change, but let all involved parties know about the risk. Hence, make sure
> +your patch description makes this aspect obvious. Once the change got merged,
> +tell the Linux kernel's regression tracker and the regressions mailing list
> +about the risk, so everyone has the change on the radar in case reports trickle
> +in. Depending on the risk, you also might want to ask the subsystem maintainer
> +to mention the issue in his pull request to mainline.
> +
> +
> +Everything developers need to know about regression tracking
> +------------------------------------------------------------
> +
> +Do I have to use regzbot?
> +~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It's in the interest of everyone if you do, as kernel maintainers like Linus
> +Torvalds partly rely on regzbot's tracking in their work -- for example when
> +deciding to release a new version or extend the development phase. For this they
> +need to be aware of all unfixed regression; to do that, Linus is known to look
> +into the weekly reports sent by regzbot.
> +
> +Do I have to tell regzbot about every regression I stumble upon?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Ideally yes: we are all humans and easily forget problems when something more
> +important unexpectedly comes up -- for example a bigger problem in the Linux
> +kernel or something in real life that's keeping us away from keyboards for a
> +while. Hence, it's best to tell regzbot about every regression, except when you
> +immediately write a fix and commit it to a tree regularly merged to the affected
> +kernel series.
> +
> +Why does the Linux kernel need a regression tracker, and why does he utilize regzbot?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Rules like 'no regressions' need someone to enforce them, otherwise they are
> +broken either accidentally or on purpose. History has shown that this is true
> +for the Linux kernel as well. That's why Thorsten volunteered to keep an eye on
> +things.
> +
> +Tracking regressions completely manually has proven to be exhausting and
> +demotivating, which is why earlier attempts to establish it failed after a
> +while. To prevent this from happening again, Thorsten developed Regzbot to
> +facilitate the work, with the long term goal to automate regression tracking as
> +much as possible for everyone involved.
> +
> +How does regression tracking work with regzbot?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The bot keeps track of all the reports and monitor their fixing progress. It
> +tries to do that with as little overhead as possible for both reporters and
> +developers.
> +
> +In fact, only reporters or someone helping them gets an extra duty: they need to
> +tell regzbot about the regression report using one of the ``#regzbot
> +introduced`` commands outlined above.
> +
> +For developers there normally is no extra work involved, they just need to do
> +something that's expected from them already: add 'Link:' tags to the patch
> +description pointing to all reports about the issue fixed.
> +
> +Thanks to these tags regzbot can associate regression reports with patches to
> +fix the issue, whenever they get posted for review or applied to a git tree. The
> +bot additionally watches out for replies to the report. All this data combined
> +provides a good impression about the current status of the fixing process.
> +
> +How to see which regressions regzbot tracks currently?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_
> +for the latest info; alternatively, `search for the latest regression report
> +<https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
> +which regzbot normally sends out once a week on Sunday evening (UTC), which is a
> +few hours before Linus usually publishes new (pre-)releases.
> +
> +What places is regzbot monitoring?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Regzbot is watching the most important Linux mailing lists as well as the Linux
> +next, mainline and stable/longterm git repositories.
> +
> +How to interact with regzbot?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Everyone can interact with the bot using mails containing `regzbot commands`,
> +which need to be in their own paragraph (IOW: they need to be separated from the
> +rest of the mail using blank lines). One such command is ``#regzbot introduced
> +<version or commit>``, which adds a report to the tracking, as already described
> +above; ``#regzbot ^introduced <version or commit>`` is another such command,
> +which makes regzbot consider the parent mail as a report for a regression which
> +it starts to track.
> +
> +Once one of those two commands has been utilized, other regzbot commands can be
> +used. You can write them below one of the `introduced` commands or in replies to
> +the mail that used one of them or itself is a reply to that mail:
> +
> + * Set or update the title::
> +
> +       #regzbot title: foo
> +
> + * Link to a related discussion (for example the posting of a patch to fix the
> +   issue) and monitor it::
> +
> +       #regzbot monitor: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
> +
> +   Monitoring only works for lore.kernel.org; regzbot will consider all messages
> +   in that thread as related to the fixing process.
> +
> + * Point to a place with further details, like a bug-tracker or a related
> +   mailing list post::
> +
> +       #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
> +
> + * Mark a regression as fixed by a commit that is heading upstream or already
> +   landed::
> +
> +       #regzbot fixed-by: 1f2e3d4c5d
> +
> + * Mark a regression as a duplicate of another one already tracked by regzbot::
> +
> +       #regzbot dup-of: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/
> +
> + * Mark a regression as invalid::
> +
> +       #regzbot invalid: wasn't a regression, problem has always existed
> +
> +Is there more to tell about regzbot and its commands?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +More detailed and up-to-date information about the Linux kernels regression
> +tracking bot can be found on its `project page <https://gitlab.com/knurd42/regzbot>`_,
> +which among others contains a
> +`getting started guide <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_
> +and `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
> +which both are more in-depth.
> +
> +
> +Quotes from Linus about regression
> +----------------------------------
> +
> +Find below a few real life examples of how Linus Torvalds expects regressions to
> +be handled:
> +
> + * From `2017-10-26 (1/2) <https://lore.kernel.org/lkml/CA+55aFwiiQYJ+YoLKCXjN_beDVfu38mg=Ggg5LFOcqHE8Qi7Zw@mail.gmail.com/>`_::
> +
> +       If you break existing user space setups THAT IS A REGRESSION.
> +
> +       It's not ok to say "but we'll fix the user space setup".
> +
> +       Really. NOT OK.
> +
> +       [...]
> +
> +       The first rule is:
> +
> +        - we don't cause regressions
> +
> +       and the corollary is that when regressions *do* occur, we admit to
> +       them and fix them, instead of blaming user space.
> +
> +       The fact that you have apparently been denying the regression now for
> +       three weeks means that I will revert, and I will stop pulling apparmor
> +       requests until the people involved understand how kernel development
> +       is done.
> +
> + * From `2017-10-26 (2/2) <https://lore.kernel.org/lkml/CA+55aFxW7NMAMvYhkvz1UPbUTUJewRt6Yb51QAx5RtrWOwjebg@mail.gmail.com/>`_::
> +
> +       People should basically always feel like they can update their kernel
> +       and simply not have to worry about it.
> +
> +       I refuse to introduce "you can only update the kernel if you also
> +       update that other program" kind of limitations. If the kernel used to
> +       work for you, the rule is that it continues to work for you.
> +
> +       There have been exceptions, but they are few and far between, and they
> +       generally have some major and fundamental reasons for having happened,
> +       that were basically entirely unavoidable, and people _tried_hard_ to
> +       avoid them. Maybe we can't practically support the hardware any more
> +       after it is decades old and nobody uses it with modern kernels any
> +       more. Maybe there's a serious security issue with how we did things,
> +       and people actually depended on that fundamentally broken model. Maybe
> +       there was some fundamental other breakage that just _had_ to have a
> +       flag day for very core and fundamental reasons.
> +
> +       And notice that this is very much about *breaking* peoples environments.
> +
> +       Behavioral changes happen, and maybe we don't even support some
> +       feature any more. There's a number of fields in /proc/<pid>/stat that
> +       are printed out as zeroes, simply because they don't even *exist* in
> +       the kernel any more, or because showing them was a mistake (typically
> +       an information leak). But the numbers got replaced by zeroes, so that
> +       the code that used to parse the fields still works. The user might not
> +       see everything they used to see, and so behavior is clearly different,
> +       but things still _work_, even if they might no longer show sensitive
> +       (or no longer relevant) information.
> +
> +       But if something actually breaks, then the change must get fixed or
> +       reverted. And it gets fixed in the *kernel*. Not by saying "well, fix
> +       your user space then". It was a kernel change that exposed the
> +       problem, it needs to be the kernel that corrects for it, because we
> +       have a "upgrade in place" model. We don't have a "upgrade with new
> +       user space".
> +
> +       And I seriously will refuse to take code from people who do not
> +       understand and honor this very simple rule.
> +
> +       This rule is also not going to change.
> +
> +       And yes, I realize that the kernel is "special" in this respect. I'm
> +       proud of it.
> +
> +       I have seen, and can point to, lots of projects that go "We need to
> +       break that use case in order to make progress" or "you relied on
> +       undocumented behavior, it sucks to be you" or "there's a better way to
> +       do what you want to do, and you have to change to that new better
> +       way", and I simply don't think that's acceptable outside of very early
> +       alpha releases that have experimental users that know what they signed
> +       up for. The kernel hasn't been in that situation for the last two
> +       decades.
> +
> +       We do API breakage _inside_ the kernel all the time. We will fix
> +       internal problems by saying "you now need to do XYZ", but then it's
> +       about internal kernel API's, and the people who do that then also
> +       obviously have to fix up all the in-kernel users of that API. Nobody
> +       can say "I now broke the API you used, and now _you_ need to fix it
> +       up". Whoever broke something gets to fix it too.
> +
> +       And we simply do not break user space.
> +
> + * From `2020-05-21 <https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/>`_::
> +
> +       The rules about regressions have never been about any kind of
> +       documented behavior, or where the code lives.
> +
> +       The rules about regressions are always about "breaks user workflow".
> +
> +       Users are literally the _only_ thing that matters.
> +
> +       No amount of "you shouldn't have used this" or "that behavior was
> +       undefined, it's your own fault your app broke" or "that used to work
> +       simply because of a kernel bug" is at all relevant.
> +
> +       Now, reality is never entirely black-and-white. So we've had things
> +       like "serious security issue" etc that just forces us to make changes
> +       that may break user space. But even then the rule is that we don't
> +       really have other options that would allow things to continue.
> +
> +       And obviously, if users take years to even notice that something
> +       broke, or if we have sane ways to work around the breakage that
> +       doesn't make for too much trouble for users (ie "ok, there are a
> +       handful of users, and they can use a kernel command line to work
> +       around it" kind of things) we've also been a bit less strict.
> +
> +       But no, "that was documented to be broken" (whether it's because the
> +       code was in staging or because the man-page said something else) is
> +       irrelevant. If staging code is so useful that people end up using it,
> +       that means that it's basically regular kernel code with a flag saying
> +       "please clean this up".
> +
> +       The other side of the coin is that people who talk about "API
> +       stability" are entirely wrong. API's don't matter either. You can make
> +       any changes to an API you like - as long as nobody notices.
> +
> +       Again, the regression rule is not about documentation, not about
> +       API's, and not about the phase of the moon.
> +
> +       It's entirely about "we caused problems for user space that used to work".
> +
> + * From `2012-07-06 <https://lore.kernel.org/all/CA+55aFwnLJ+0sjx92EGREGTWOx84wwKaraSzpTNJwPVV8edw8g@mail.gmail.com/>`_::
> +
> +       > Now this got me wondering if Debian _unstable_ actually qualifies as a
> +       > standard distro userspace.
> +
> +       Oh, if the kernel breaks some standard user space, that counts. Tons
> +       of people run Debian unstable (and from my limited interactions with
> +       it, for damn good reasons: -stable tends to run so old versions of
> +       everything that you have to sometimes deal with cuneiform writing when
> +       using it)
> +
> + * From `2017-11-05 <https://lore.kernel.org/all/CA+55aFzUvbGjD8nQ-+3oiMBx14c_6zOj2n7KLN3UsJ-qsd4Dcw@mail.gmail.com/>`_::
> +
> +       And our regression rule has never been "behavior doesn't change".
> +       That would mean that we could never make any changes at all.
> +
> +       For example, we do things like add new error handling etc all the
> +       time, which we then sometimes even add tests for in our kselftest
> +       directory.
> +
> +       So clearly behavior changes all the time and we don't consider that a
> +       regression per se.
> +
> +       The rule for a regression for the kernel is that some real user
> +       workflow breaks. Not some test. Not a "look, I used to be able to do
> +       X, now I can't".
> +
> + * From `2018-08-03 <https://lore.kernel.org/all/CA+55aFwWZX=CXmWDTkDGb36kf12XmTehmQjbiMPCqCRG2hi9kw@mail.gmail.com/>`_::
> +
> +       YOU ARE MISSING THE #1 KERNEL RULE.
> +
> +       We do not regress, and we do not regress exactly because your are 100% wrong.
> +
> +       And the reason you state for your opinion is in fact exactly *WHY* you
> +       are wrong.
> +
> +       Your "good reasons" are pure and utter garbage.
> +
> +       The whole point of "we do not regress" is so that people can upgrade
> +       the kernel and never have to worry about it.
> +
> +       > Kernel had a bug which has been fixed
> +
> +       That is *ENTIRELY* immaterial.
> +
> +       Guys, whether something was buggy or not DOES NOT MATTER.
> +
> +       Why?
> +
> +       Bugs happen. That's a fact of life. Arguing that "we had to break
> +       something because we were fixing a bug" is completely insane. We fix
> +       tens of bugs every single day, thinking that "fixing a bug" means that
> +       we can break something is simply NOT TRUE.
> +
> +       So bugs simply aren't even relevant to the discussion. They happen,
> +       they get found, they get fixed, and it has nothing to do with "we
> +       break users".
> +
> +       Because the only thing that matters IS THE USER.
> +
> +       How hard is that to understand?
> +
> +       Anybody who uses "but it was buggy" as an argument is entirely missing
> +       the point. As far as the USER was concerned, it wasn't buggy - it
> +       worked for him/her.
> +
> +       Maybe it worked *because* the user had taken the bug into account,
> +       maybe it worked because the user didn't notice - again, it doesn't
> +       matter. It worked for the user.
> +
> +       Breaking a user workflow for a "bug" is absolutely the WORST reason
> +       for breakage you can imagine.
> +
> +       It's basically saying "I took something that worked, and I broke it,
> +       but now it's better". Do you not see how f*cking insane that statement
> +       is?
> +
> +       And without users, your program is not a program, it's a pointless
> +       piece of code that you might as well throw away.
> +
> +       Seriously. This is *why* the #1 rule for kernel development is "we
> +       don't break users". Because "I fixed a bug" is absolutely NOT AN
> +       ARGUMENT if that bug fix broke a user setup. You actually introduced a
> +       MUCH BIGGER bug by "fixing" something that the user clearly didn't
> +       even care about.
> +
> +       And dammit, we upgrade the kernel ALL THE TIME without upgrading any
> +       other programs at all. It is absolutely required, because flag-days
> +       and dependencies are horribly bad.
> +
> +       And it is also required simply because I as a kernel developer do not
> +       upgrade random other tools that I don't even care about as I develop
> +       the kernel, and I want any of my users to feel safe doing the same
> +       time.
> +
> +       So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel
> +       without upgrading some other random binary, then we have a problem.
> +
> + * From `2021-06-05 <https://lore.kernel.org/all/CAHk-=wiUVqHN76YUwhkjZzwTdjMMJf_zN4+u7vEJjmEGh3recw@mail.gmail.com/>`_::
> +
> +       THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.
> +
> +       Honestly, security people need to understand that "not working" is not
> +       a success case of security. It's a failure case.
> +
> +       Yes, "not working" may be secure. But security in that case is *pointless*.
> +
> + * From `2021-07-30 <https://lore.kernel.org/lkml/CAHk-=witY33b-vqqp=ApqyoFDpx9p+n4PwG9N-TvF8bq7-tsHw@mail.gmail.com/>`_::
> +
> +       But we have the policy that regressions aren't about documentation or
> +       even sane behavior.
> +
> +       Regressions are about whether a user application broke in a noticeable way.
> +
> + * From `2011-05-06 (1/3) <https://lore.kernel.org/all/BANLkTim9YvResB+PwRp7QTK-a5VNg2PvmQ@mail.gmail.com/>`_::
> +
> +       Binary compatibility is more important.
> +
> +       And if binaries don't use the interface to parse the format (or just
> +       parse it wrongly - see the fairly recent example of adding uuid's to
> +       /proc/self/mountinfo), then it's a regression.
> +
> +       And regressions get reverted, unless there are security issues or
> +       similar that makes us go "Oh Gods, we really have to break things".
> +
> +       I don't understand why this simple logic is so hard for some kernel
> +       developers to understand. Reality matters. Your personal wishes matter
> +       NOT AT ALL.
> +
> +       If you made an interface that can be used without parsing the
> +       interface description, then we're stuck with the interface. Theory
> +       simply doesn't matter.
> +
> +       You could help fix the tools, and try to avoid the compatibility
> +       issues that way. There aren't that many of them.
> +
> + * From `2011-05-06 (2/3) <https://lore.kernel.org/all/BANLkTi=KVXjKR82sqsz4gwjr+E0vtqCmvA@mail.gmail.com/>`_::
> +
> +       it's clearly NOT an internal tracepoint. By definition. It's being
> +       used by powertop.
> +
> + * From `2011-05-06 (3/3) <https://lore.kernel.org/all/BANLkTinazaXRdGovYL7rRVp+j6HbJ7pzhg@mail.gmail.com/>`_::
> +
> +       We have programs that use that ABI and thus it's a regression if they break.
> +
> + * From `2006-02-21 <https://lore.kernel.org/lkml/Pine.LNX.4.64.0602211631310.30245@g5.osdl.org/>`_::
> +
> +       The fact is, if changing the kernel breaks user-space, it's a regression.
> +       IT DOES NOT MATTER WHETHER IT'S IN /sbin/hotplug OR ANYTHING ELSE. If it
> +       was installed by a distribution, it's user-space. If it got installed by
> +       "vmlinux", it's the kernel.
> +
> +       The only piece of user-space code we ship with the kernel is the system
> +       call trampoline etc that the kernel sets up. THOSE interfaces we can
> +       really change, because it changes with the kernel.
> +
> + * From `2019-09-15 <https://lore.kernel.org/lkml/CAHk-=wiP4K8DRJWsCo=20hn_6054xBamGKF2kPgUzpB5aMaofA@mail.gmail.com/>`_::
> +
> +       One _particularly_ last-minute revert is the top-most commit (ignoring
> +       the version change itself) done just before the release, and while
> +       it's very annoying, it's perhaps also instructive.
> +
> +       What's instructive about it is that I reverted a commit that wasn't
> +       actually buggy. In fact, it was doing exactly what it set out to do,
> +       and did it very well. In fact it did it _so_ well that the much
> +       improved IO patterns it caused then ended up revealing a user-visible
> +       regression due to a real bug in a completely unrelated area.
> +
> +       The actual details of that regression are not the reason I point that
> +       revert out as instructive, though. It's more that it's an instructive
> +       example of what counts as a regression, and what the whole "no
> +       regressions" kernel rule means. The reverted commit didn't change any
> +       API's, and it didn't introduce any new bugs. But it ended up exposing
> +       another problem, and as such caused a kernel upgrade to fail for a
> +       user. So it got reverted.
> +
> +       The point here being that we revert based on user-reported _behavior_,
> +       not based on some "it changes the ABI" or "it caused a bug" concept.
> +       The problem was really pre-existing, and it just didn't happen to
> +       trigger before. The better IO patterns introduced by the change just
> +       happened to expose an old bug, and people had grown to depend on the
> +       previously benign behavior of that old issue.
> +
> +       And never fear, we'll re-introduce the fix that improved on the IO
> +       patterns once we've decided just how to handle the fact that we had a
> +       bad interaction with an interface that people had then just happened
> +       to rely on incidental behavior for before. It's just that we'll have
> +       to hash through how to do that (there are no less than three different
> +       patches by three different developers being discussed, and there might
> +       be more coming...). In the meantime, I reverted the thing that exposed
> +       the problem to users for this release, even if I hope it will be
> +       re-introduced (perhaps even backported as a stable patch) once we have
> +       consensus about the issue it exposed.
> +
> +       Take-away from the whole thing: it's not about whether you change the
> +       kernel-userspace ABI, or fix a bug, or about whether the old code
> +       "should never have worked in the first place". It's about whether
> +       something breaks existing users' workflow.
> +
> +       Anyway, that was my little aside on the whole regression thing.  Since
> +       it's that "first rule of kernel programming", I felt it is perhaps
> +       worth just bringing it up every once in a while.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 27a83bb940d4..1b740c922867 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10351,6 +10351,7 @@ KERNEL REGRESSIONS
>  M:     Thorsten Leemhuis <linux@leemhuis.info>
>  L:     regressions@lists.linux.dev
>  S:     Supported
> +F:     Documentation/admin-guide/regressions.rst
>
>  KERNEL SELFTEST FRAMEWORK
>  M:     Shuah Khan <shuah@kernel.org>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 2/2] docs: regressions.rst: rules of thumb for handling regressions
  2022-01-04 13:29     ` Thorsten Leemhuis
@ 2022-01-04 14:42       ` Jonathan Corbet
  2022-01-04 15:09         ` Randy Dunlap
  0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Corbet @ 2022-01-04 14:42 UTC (permalink / raw)
  To: Thorsten Leemhuis, Lukas Bulwahn
  Cc: open list:DOCUMENTATION, Linus Torvalds, Greg Kroah-Hartman,
	workflows, Linux Kernel Mailing List, Randy Dunlap

Thorsten Leemhuis <linux@leemhuis.info> writes:

> On 04.01.22 13:16, Lukas Bulwahn wrote:
>> On Mon, Jan 3, 2022 at 3:23 PM Thorsten Leemhuis <linux@leemhuis.info> wrote:
>>> +Try to fix regressions quickly once the culprit got identified. Fixes for most
>> 
>> s/got/gets/ --- at least, that is what the gmail grammar spelling suggests :)
>
> Hmm, LanguageTool didn't complain. Not totally sure, maybe both
> approaches are okay. But the variant suggested by the gmail checker
> might be the better one.

So we're deeply into nit territory, but "gets" would be the correct
tense there.  Even better, though, is to avoid using "to get" in this
way at all.  I'm informed that "to get" is one of the hardest verbs for
non-native speakers, well, to get, so I try to avoid it in my own
writing.  "once the culprit is identified" or "has been identified"
would both be good here.

>> financial support) for such further
>> documentation on the development process, please reach out to me and I
>> will see what I can do.

Financial support for documentation work?  Now there's a nice idea...:)

(back to real work now)

jon

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 2/2] docs: regressions.rst: rules of thumb for handling regressions
  2022-01-04 14:42       ` Jonathan Corbet
@ 2022-01-04 15:09         ` Randy Dunlap
  2022-01-04 18:02           ` Thorsten Leemhuis
  0 siblings, 1 reply; 15+ messages in thread
From: Randy Dunlap @ 2022-01-04 15:09 UTC (permalink / raw)
  To: Jonathan Corbet, Thorsten Leemhuis, Lukas Bulwahn
  Cc: open list:DOCUMENTATION, Linus Torvalds, Greg Kroah-Hartman,
	workflows, Linux Kernel Mailing List



On 1/4/22 06:42, Jonathan Corbet wrote:
> Thorsten Leemhuis <linux@leemhuis.info> writes:
> 
>> On 04.01.22 13:16, Lukas Bulwahn wrote:
>>> On Mon, Jan 3, 2022 at 3:23 PM Thorsten Leemhuis <linux@leemhuis.info> wrote:
>>>> +Try to fix regressions quickly once the culprit got identified. Fixes for most
>>>
>>> s/got/gets/ --- at least, that is what the gmail grammar spelling suggests :)
>>
>> Hmm, LanguageTool didn't complain. Not totally sure, maybe both
>> approaches are okay. But the variant suggested by the gmail checker
>> might be the better one.
> 
> So we're deeply into nit territory, but "gets" would be the correct
> tense there.  Even better, though, is to avoid using "to get" in this
> way at all.  I'm informed that "to get" is one of the hardest verbs for
> non-native speakers, well, to get, so I try to avoid it in my own
> writing.  "once the culprit is identified" or "has been identified"
> would both be good here.

Agreed. Any uses of the verb get/got are best avoided.

-- 
~Randy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 1/2] docs: add a document about regression handling
  2022-01-04 14:17   ` Lukas Bulwahn
@ 2022-01-04 17:57     ` Thorsten Leemhuis
  2022-01-05  8:45       ` Lukas Bulwahn
  0 siblings, 1 reply; 15+ messages in thread
From: Thorsten Leemhuis @ 2022-01-04 17:57 UTC (permalink / raw)
  To: Lukas Bulwahn
  Cc: open list:DOCUMENTATION, Linus Torvalds, Greg Kroah-Hartman,
	workflows, Linux Kernel Mailing List, Randy Dunlap,
	Jonathan Corbet


On 04.01.22 15:17, Lukas Bulwahn wrote:
> On Mon, Jan 3, 2022 at 3:23 PM Thorsten Leemhuis <linux@leemhuis.info> wrote:
>>
>> Create a document explaining various aspects around regression handling
>> and tracking both for users and developers. Among others describe the
>> first rule of Linux kernel development and what it means in practice.
>> Also explain what a regression actually is and how to report them
>> properly. The text additionally provides a brief introduction to the bot
>> the kernel's regression tracker users to facilitate the work. To sum
>> things up, provide a few quotes from Linus to show how serious the he
>> takes regressions.
>>
>> [...]
>
> [lots of helpful suggestions for fixes and small improvements]

Many thx, addressed all of them, not worth commenting on each of them
individually.


>> +What is the goal of the 'no regressions rule'?
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Users should feel safe when updating kernel versions and not have to worry
>> +something might break. This is in the interest of the kernel developers to make
>> +updating attractive: they don't want users to stay on stable or longterm Linux
>> +series either abandoned or more than one and a half year old, as `those might
>> +have known problems, security issues, or other aspects already improved in later
>> +versions
>> +<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
>> +
> Maybe add something like this:
> 
> A larger user community means more exposure and more confidence that
> any critical bug introduced is likely to be found closer to the point
> in time it was introduced, and hence the shipped kernels have less
> critical bugs.
> 
> Just to close the line of thought here.

Hmmm. How about this instead:

The kernel developers also want to make it simple and appealing for
users to test the latest (pre-)release, as it's a lot easier to track
down and fix problems, if they are reported shortly after being introduced.
> Okay, that is how far I got reading for now.

Great, many thx for your help, much appreciated. FWIW, find below the
current version of the plain text which contains a few more fixes. Note,
thunderbird will insert wrong line breaks here.

Ciao, Thorsten



Does it qualify as a regression if a newer kernel works slower or makes
the system consume more energy?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It does, but the difference has to be significant. A five percent
slow-down in a micro-benchmark thus is unlikely to qualify as
regression, unless it also influences the results of a broad benchmark
by more than one percent. If in doubt, ask for advice.

Is it a regression, if an externally developed kernel module is
incompatible with a newer kernel?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No, as the 'no regression' rule is about interfaces and services the
Linux kernel provides to the userland. It thus does not cover building
or running externally developed kernel modules, as they run in
kernel-space and use occasionally changed internal interfaces to hook
into the kernel.

How are regressions handled that are caused by a fix for security
vulnerability?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In extremely rare situations security issues can't be fixed without
causing regressions; those are given way, as they are the lesser evil in
the end. Luckily this almost always can be avoided, as key developers
for the affected area and often Linus Torvalds himself try very hard to
fix security issues without causing regressions.

If you nevertheless face such a case, check the mailing list archives if
people tried their best to avoid the regression; if in doubt, ask for
advice as outlined above.

What happens if fixing a regression is impossible without causing
another regression?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sadly these things happen, but luckily not very often; if they occur,
expert developers of the affected code area should look into the issue
to find a fix that avoids regressions or at least their impact. If you
run into such a situation you thus do what was outlined already for
regressions caused by security fixes: check earlier discussions if
people already tried their best and ask for advice if in doubt.

A quick note while at it: these situations could be avoided, if you
would regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from
each cycle a test run. This is best explained by imagining a change
integrated between Linux v5.14 and v5.15-rc1 which causes a regression,
but at the same time is a hard requirement for some other improvement
applied for 5.15-rc1. All these changes often can simply be reverted and
the regression thus solved, if someone finds and reports it before 5.15
is released. A few days or weeks later after the release this solution
might become impossible, if some software starts to rely on aspects
introduced by one of the follow-up changes: reverting all changes would
cause regressions for users of said software and thus out of the question.

A feature I relied on was removed months ago, but I only noticed now.
Does that qualify as regression?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It does, but often it's hard to fix them due to the aspects outlined in
the previous section. It hence needs to be dealt with on a case-by-case
basis; this is another reason why it's in your interest to regularly
test mainline releases.

Does the 'no regression' rule apply if I seem to be the only person in
the world that is affected by a regression?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It does, but only for practical usage: the Linux developers want to be
free to remove support for hardware only to be found in attics and
museums anymore.

Note, sometimes regressions can't be avoided to make progress -- and the
latter is needed to prevent Linux from stagnation. Hence, if only very
few users seem to be affected by a regression, it for the greater good
might be in their and everyone else's interest to not insist on the
rule. Especially if there is an easy way to circumvent the regression
somehow, for example by updating some software or using a kernel
parameter created just for this purpose.

Does the regression rule apply for code in the staging tree as well?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Not according to the `help text for the configuration option covering
all staging code
<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,
which since its early days states::

       Please note that these drivers are under heavy development, may or
       may not work, and may contain userspace interfaces that most likely
       will be changed in the near future.

The staging developers nevertheless often adhere to the 'no regressions'
rule, but sometimes bend it to make progress. That's for example why
some users had to deal with (often negligible) regressions when a WiFi
driver from the staging tree was replaced by a totally different one
written from scratch.

Why do later versions have to be 'compiled with a similar configuration'?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Because the Linux kernel developers sometimes integrate changes known to
cause regressions, but make them optional and disable them in the
kernel's default configuration. This trick allows progress, as the 'no
regressions' rule otherwise would lead to stagnation. Consider for
example a new security feature which blocks access to some kernel
interfaces often abused by malware, but at the same time are required to
run a few rarely used applications. The outlined trick makes both camps
happy: people using these applications can leave the new security
feature off, while everyone else can enable it without running into trouble.

How to create a configuration similar to the one of an older kernel?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Start a known-good kernel and configure the newer Linux version with
``make olddefconfig``. This makes the kernel's build scripts pick up the
configuration file (the `.config` file) from the running kernel as base
for the new one you are about to compile; afterwards they set all new
configuration options to their default value, which disables new
features that might cause regressions.

Can I report a regression with vanilla kernels provided by someone else
to the upstream Linux kernel developers?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Only if the newer kernel was compiled with a similar configuration file
as the older one (see above), as your provider might have enabled some
known-to-be incompatible feature in the newer kernel. If in a doubt,
report this problem to the provider and ask for advice.


More details about regressions relevant for developers
------------------------------------------------------

What should I do, if I suspect a change I'm working on might cause
regressions?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Evaluate how big the risk of regressions is, for example by performing a
code search in Linux distributions and Git forges. Also consider asking
other developers or projects likely to be affected to evaluate or even
test the proposed change; if problems surface, maybe some middle ground
acceptable for all can be found.

If the risk of regressions in the end seems to be relatively small, go
ahead with the change, but let all involved parties know about the risk.
Hence, make sure your patch description makes this aspect obvious. Once
the change is merged, tell the Linux kernel's regression tracker and the
regressions mailing list about the risk, so everyone has the change on
the radar in case reports trickle in. Depending on the risk, you also
might want to ask the subsystem maintainer to mention the issue in his
pull request to mainline.


Everything developers need to know about regression tracking
------------------------------------------------------------

Do I have to use regzbot?
~~~~~~~~~~~~~~~~~~~~~~~~~

It's in the interest of everyone if you do, as kernel maintainers like
Linus Torvalds partly rely on regzbot's tracking in their work -- for
example when deciding to release a new version or extend the development
phase. For this they need to be aware of all unfixed regression; to do
that, Linus is known to look into the weekly reports sent by regzbot.

Do I have to tell regzbot about every regression I stumble upon?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ideally yes: we are all humans and easily forget problems when something
more important unexpectedly comes up -- for example a bigger problem in
the Linux kernel or something in real life that's keeping us away from
keyboards for a while. Hence, it's best to tell regzbot about every
regression, except when you immediately write a fix and commit it to a
tree regularly merged to the affected kernel series.

Why does the Linux kernel need a regression tracker, and why does he
utilize regzbot?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Rules like 'no regressions' need someone to enforce them, otherwise they
are broken either accidentally or on purpose. History has shown that
this is true for the Linux kernel as well. That's why Thorsten
volunteered to keep an eye on things.

Tracking regressions completely manually has proven to be exhausting and
demotivating, which is why earlier attempts to establish it failed after
a while. To prevent this from happening again, Thorsten developed
Regzbot to facilitate the work, with the long term goal to automate
regression tracking as much as possible for everyone involved.

How does regression tracking work with regzbot?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The bot keeps track of all the reports and monitors their fixing
progress. It tries to do that with as little overhead as possible for
both reporters and developers.

In fact, only reporters or someone helping them are burdened with an
extra duty: they need to tell regzbot about the regression report using
one of the ``#regzbot introduced`` commands outlined above.

For developers there normally is no extra work involved, they just need
to do something that's expected from them already: add 'Link:' tags to
the patch description pointing to all reports about the issue fixed.

Thanks to these tags regzbot can associate regression reports with
patches to fix the issue, whenever they are posted for review or applied
to a git tree. The bot additionally watches out for replies to the
report. All this data combined provides a good impression about the
current status of the fixing process.

How to see which regressions regzbot tracks currently?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Check `regzbot's web-interface
<https://linux-regtracking.leemhuis.info/regzbot/>`_ for the latest
info; alternatively, `search for the latest regression report
<https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
which regzbot normally sends out once a week on Sunday evening (UTC),
which is a few hours before Linus usually publishes new (pre-)releases.

What places is regzbot monitoring?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Regzbot is watching the most important Linux mailing lists as well as
the linux-next, mainline and stable/longterm git repositories.

How to interact with regzbot?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Everyone can interact with the bot using mails containing `regzbot
commands`, which need to be in their own paragraph (IOW: they need to be
separated from the rest of the mail using blank lines). One such command
is ``#regzbot introduced <version or commit>``, which adds a report to
the tracking, as already described above; ``#regzbot ^introduced
<version or commit>`` is another such command, which makes regzbot
consider the parent mail as a report for a regression which it starts to
track.

Once one of those two commands has been utilized, other regzbot commands
can be used. You can write them below one of the `introduced` commands
or in replies to the mail that used one of them or itself is a reply to
that mail:

 * Set or update the title::

       #regzbot title: foo

 * Link to a related discussion (for example the posting of a patch to
fix the issue) and monitor it::

       #regzbot monitor:
https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/

   Monitoring only works for lore.kernel.org; regzbot will consider all
messages in that thread as related to the fixing process.

 * Point to a place with further details, like a bug tracker or a
related mailing list post::

       #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789

 * Mark a regression as fixed by a commit that is heading upstream or
already landed::

       #regzbot fixed-by: 1f2e3d4c5d

 * Mark a regression as a duplicate of another one already tracked by
regzbot::

       #regzbot dup-of:
https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/

 * Mark a regression as invalid::

       #regzbot invalid: wasn't a regression, problem has always existed

Is there more to tell about regzbot and its commands?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

More detailed and up-to-date information about the Linux kernels
regression tracking bot can be found on its `project page
<https://gitlab.com/knurd42/regzbot>`_, which among others contains a
`getting started guide
<https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_
and `reference documentation
<https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
which both are more in-depth.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 2/2] docs: regressions.rst: rules of thumb for handling regressions
  2022-01-04 15:09         ` Randy Dunlap
@ 2022-01-04 18:02           ` Thorsten Leemhuis
  0 siblings, 0 replies; 15+ messages in thread
From: Thorsten Leemhuis @ 2022-01-04 18:02 UTC (permalink / raw)
  To: Randy Dunlap, Jonathan Corbet, Lukas Bulwahn
  Cc: open list:DOCUMENTATION, Linus Torvalds, Greg Kroah-Hartman,
	workflows, Linux Kernel Mailing List

On 04.01.22 16:09, Randy Dunlap wrote:
> On 1/4/22 06:42, Jonathan Corbet wrote:
>> Thorsten Leemhuis <linux@leemhuis.info> writes:
>>
>>> On 04.01.22 13:16, Lukas Bulwahn wrote:
>>>> On Mon, Jan 3, 2022 at 3:23 PM Thorsten Leemhuis <linux@leemhuis.info> wrote:
>>>>> +Try to fix regressions quickly once the culprit got identified. Fixes for most
>>>>
>>>> s/got/gets/ --- at least, that is what the gmail grammar spelling suggests :)
>>>
>>> Hmm, LanguageTool didn't complain. Not totally sure, maybe both
>>> approaches are okay. But the variant suggested by the gmail checker
>>> might be the better one.
>>
>> So we're deeply into nit territory, but "gets" would be the correct
>> tense there.  Even better, though, is to avoid using "to get" in this
>> way at all.  I'm informed that "to get" is one of the hardest verbs for
>> non-native speakers, well, to get, so I try to avoid it in my own
>> writing.  "once the culprit is identified" or "has been identified"
>> would both be good here.
> 
> Agreed. Any uses of the verb get/got are best avoided.

Ahh, good to known, thx to both of you. I guess my English teachers
tried to put that into my head like 30 years ago, but I assume the lossy
compression algorithm in there threw it away...

Went through the document and removed all get/got, was not that hard
most of the time.

Ciao, Thorsten




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 1/2] docs: add a document about regression handling
  2022-01-04 17:57     ` Thorsten Leemhuis
@ 2022-01-05  8:45       ` Lukas Bulwahn
  0 siblings, 0 replies; 15+ messages in thread
From: Lukas Bulwahn @ 2022-01-05  8:45 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: open list:DOCUMENTATION, Linus Torvalds, Greg Kroah-Hartman,
	workflows, Linux Kernel Mailing List, Randy Dunlap,
	Jonathan Corbet

On Tue, Jan 4, 2022 at 6:57 PM Thorsten Leemhuis <linux@leemhuis.info> wrote:
>
>
> On 04.01.22 15:17, Lukas Bulwahn wrote:
> > On Mon, Jan 3, 2022 at 3:23 PM Thorsten Leemhuis <linux@leemhuis.info> wrote:
> >>
> >> Create a document explaining various aspects around regression handling
> >> and tracking both for users and developers. Among others describe the
> >> first rule of Linux kernel development and what it means in practice.
> >> Also explain what a regression actually is and how to report them
> >> properly. The text additionally provides a brief introduction to the bot
> >> the kernel's regression tracker users to facilitate the work. To sum
> >> things up, provide a few quotes from Linus to show how serious the he
> >> takes regressions.
> >>
> >> [...]
> >
> > [lots of helpful suggestions for fixes and small improvements]
>
> Many thx, addressed all of them, not worth commenting on each of them
> individually.
>
>
> >> +What is the goal of the 'no regressions rule'?
> >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> +
> >> +Users should feel safe when updating kernel versions and not have to worry
> >> +something might break. This is in the interest of the kernel developers to make
> >> +updating attractive: they don't want users to stay on stable or longterm Linux
> >> +series either abandoned or more than one and a half year old, as `those might
> >> +have known problems, security issues, or other aspects already improved in later
> >> +versions
> >> +<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
> >> +
> > Maybe add something like this:
> >
> > A larger user community means more exposure and more confidence that
> > any critical bug introduced is likely to be found closer to the point
> > in time it was introduced, and hence the shipped kernels have less
> > critical bugs.
> >
> > Just to close the line of thought here.
>
> Hmmm. How about this instead:
>
> The kernel developers also want to make it simple and appealing for
> users to test the latest (pre-)release, as it's a lot easier to track
> down and fix problems, if they are reported shortly after being introduced.

Yes, your sentence conveys the same point and is much more down to
earth. My sentence looks much more "academic".

> > Okay, that is how far I got reading for now.
>
> Great, many thx for your help, much appreciated. FWIW, find below the
> current version of the plain text which contains a few more fixes. Note,
> thunderbird will insert wrong line breaks here.
>
> Ciao, Thorsten
>

All good, I will wait until the next version of this patch series shows up.

Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-01-05  8:46 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-03  9:50 [RFC PATCH v1 0/2] docs: add a document dedicated to regressions Thorsten Leemhuis
2022-01-03  9:50 ` [RFC PATCH v1 1/2] docs: add a document about regression handling Thorsten Leemhuis
2022-01-03 17:07   ` Jakub Kicinski
2022-01-03 17:20     ` Thorsten Leemhuis
2022-01-03 17:55       ` Jakub Kicinski
2022-01-04 14:17   ` Lukas Bulwahn
2022-01-04 17:57     ` Thorsten Leemhuis
2022-01-05  8:45       ` Lukas Bulwahn
2022-01-03  9:50 ` [RFC PATCH v1 2/2] docs: regressions.rst: rules of thumb for handling regressions Thorsten Leemhuis
2022-01-04 12:16   ` Lukas Bulwahn
2022-01-04 13:29     ` Thorsten Leemhuis
2022-01-04 14:42       ` Jonathan Corbet
2022-01-04 15:09         ` Randy Dunlap
2022-01-04 18:02           ` Thorsten Leemhuis
2022-01-03 14:01 ` [RFC PATCH v1 0/2] docs: add a document dedicated to regressions Greg Kroah-Hartman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.