All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC - kernel test result specification (KTAP)
@ 2021-08-10 23:25 Rae Moar
  2021-08-11 20:53 ` Brendan Higgins
  2021-08-28  8:20 ` Kees Cook
  0 siblings, 2 replies; 14+ messages in thread
From: Rae Moar @ 2021-08-10 23:25 UTC (permalink / raw)
  To: Brendan Higgins, David Gow, Tim.Bird, shuah, Daniel Latypov
  Cc: kunit-dev, linux-kselftest, linux-kernel

We are looking to further standardise the output format used by kernel
test frameworks like kselftest and KUnit. Thus far we have used the
TAP (Test Anything Protocol) specification, but it has been extended
in many different ways, so we would like to agree on a common "Kernel
TAP" (KTAP) format to resolve these differences. Thus, below is a
draft of a specification of KTAP. Note that this specification is
largely based on the current format of test results for KUnit tests.

Additionally, this specification was heavily inspired by the KTAP
specification draft by Tim Bird
(https://lore.kernel.org/linux-kselftest/CY4PR13MB1175B804E31E502221BC8163FD830@CY4PR13MB1175.namprd13.prod.outlook.com/T/).
However, there are some notable differences to his specification. One
such difference is the format of nested tests is more fully specified
in the following specification. However, they are specified in a way
which may not be compatible with many kselftest nested tests.

=====================
Specification of KTAP
=====================

TAP, or the Test Anything Protocol is a format for specifying test
results used by a number of projects. It's website and specification
are found at: https://testanything.org/. The Linux Kernel uses TAP
output for test results. However, KUnit (and other Kernel testing
frameworks such as kselftest) have some special needs for test results
which don't gel perfectly with the original TAP specification. Thus, a
"Kernel TAP" (KTAP) format is specified to extend and alter TAP to
support these use-cases.

KTAP Output consists of 5 major elements (all line-based):
- The version line
- Plan lines
- Test case result lines
- Diagnostic lines
- A bail out line

An important component in this specification of KTAP is the
specification of the format of nested tests. This can be found in the
section on nested tests below.

The version line
----------------

The first line of KTAP output must be the version line. As this
specification documents the first version of KTAP,  the recommended
version line is "KTAP version 1". However, since all kernel testing
frameworks use TAP version lines, "TAP version 14" and "TAP version
13" are all acceptable version lines. Version lines with other
versions of TAP or KTAP will not cause the parsing of the test results
to fail but it will produce an error.

Plan lines
----------

Plan lines must follow the format of "1..N" where N is the number of
subtests. The second line of KTAP output must be a plan line, which
indicates the number of tests at the highest level, such that the
tests do not have a parent. Also, in the instance of a test having
subtests, the second line of the test after the subtest header must be
a plan line which indicates the number of subtests within that test.

Test case result lines
----------------------

Test case result lines must have the format:

    <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]

The result can be either "ok", which indicates the test case passed,
or "not ok", which indicates that the test case failed.

The number represents the number of the test case or suite being
performed. The first test case or suite must have the number 1 and the
number must increase by 1 for each additional test case or result at
the same level and within the same testing suite.

The "-" character is optional.

The description is a description of the test, generally the name of
the test, and can be any string of words (can't include #). The
description is optional.

The directive is used to indicate if a test was skipped. The format
for the directive is: "# SKIP [<skip_description>]". The
skip_description is optional and can be any string of words to
describe why the test was skipped. The result of the test case result
line can be either "ok" or "not ok" if the skip directive is used.
Finally, note that TAP 14 specification includes TODO directives but
these are not supported for KTAP.

Examples of test case result lines:

Test passed:

    ok 1 - test_case_name

Test was skipped:

    not ok 1 - test_case_name # SKIP test_case_name should be skipped

Test failed:

    not_ok 1 - test_case_name

Diagnostic lines
----------------

Diagnostic lines are used for description of testing operations.
Diagnostic lines are generally formatted as "#
<diagnostic_description>", where the description can be any string.
However, in practice, diagnostic lines are all lines that don't follow
the format of any other KTAP line format. Diagnostic lines can be
anywhere in the test output after the first two lines. There are a few
special diagnostic lines. Diagnostic lines of the format "# Subtest:
<test_name>" indicate the start of a test with subtests. Also,
diagnostic lines of the format "# <test_name>: <description>" refer to
a specific test and tend to occur before the test result line of that
test but are optional.

Bail out line
-------------

A bail out line can occur anywhere in the KTAP output and will
indicate that a test has crashed. The format of a bail out line is
"Bail out! [<description>]",  where the description can give
information on why the bail out occurred and can be any string.

Nested tests
------------

The new specification for KTAP will support an arbitrary number of
nested subtests. Thus, tests can now have subtests and those subtests
can have subtests. This can be useful to further categorize tests and
organize test results.

The new required format for a test with subtests consists of: a
subtest header line, a plan line, all subtests, and a final test
result line.

The first line of the test must be the subtest header line with the
format: "# Subtest: <test_name>".

The second line of the test must be the plan line, which is formatted
as "1..N", where N is the number of subtests.

Following the plan line, all lines pertaining to the subtests will follow.

Finally, the last line of the test is a final test result line with
the format: "(ok|not ok) <number> [-] [<description>] [<directive>]
[<diagnostic data>]", which follows the same format as the general
test result lines described in this section. The result line should
indicate the result of the subtests. Thus, if one of the subtests
fail, the test should fail. The description in the final test result
line should match the name of the test in the subtest header.

An example format:

KTAP version 1
1..1
    # Subtest: test_suite
    1..2
    ok 1 - test_1
    ok 2 - test_2
ok 1 - test_suite

An example format with multiple levels of nested testing:

KTAP version 1
1..1
    # Subtest: test_suite
    1..2
        # Subtest: sub_test_suite
        1..2
        ok 1 - test_1
        ok 2 test_2
    ok 1 - sub_test_suite
    ok 2 - test
ok 1 - test_suite

In the instance that the plan line is missing, the end of the test
will be denoted by the final result line containing a description that
matches the name of the test given in the subtest header. Note that
thus, if the plan line is missing and one of the subtests have a
matching name to the test suite this will cause errors.

Lastly, indentation is also recommended for improved readability.


Major differences between TAP 14 and KTAP specification
-------------------------------------------------------

Note that the major differences between TAP 14 and KTAP specification:
- yaml and json are not allowed in diagnostic messages
- TODO directive not allowed
- KTAP allows for an arbitrary number of tests to be nested with
specified nested test format

Example of KTAP
---------------

KTAP version 1
1..1
    # Subtest: test_suite
    1..1
        # Subtest: sub_test_suite
        1..2
        ok 1 - test_1
        ok 2 test_2
    ok 1 - sub_test_suite
ok 1 - test_suite

=========================================
Note on incompatibilities with kselftests
=========================================

To my knowledge, the above specification seems to generally accept the
TAP format of many non-nested test results of kselftests.

An example of a common kselftests TAP format for non-nested test
results that are accepted by the above specification:

TAP version 13
1..2
# selftests: vDSO: vdso_test_gettimeofday
# The time is 1628024856.096879
ok 1 selftests: vDSO: vdso_test_gettimeofday
# selftests: vDSO: vdso_test_getcpu
# Could not find __vdso_getcpu
ok 2 selftests: vDSO: vdso_test_getcpu # SKIP

However, one major difference noted with kselftests is the use of more
directives than the "# SKIP" directive. kselftest also supports XPASS
and XFAIL directives. Some additional examples found in kselftests:

    not ok 5 selftests: netfilter: nft_concat_range.sh # TIMEOUT 45 seconds

    not ok 45 selftests: kvm: kvm_binary_stats_test # exit=127

Should the specification be expanded to include these directives?

However, the general format for kselftests with nested test results
seems to differ from the above specification. It seems that a general
format for nested tests is as follows:

TAP version 13
1..2
# selftests: membarrier: membarrier_test_single_thread
# TAP version 13
# 1..2
# ok 1 sys_membarrier available
# ok 2 sys membarrier invalid command test: command = -1, flags = 0,
errno = 22. Failed as expected
ok 1 selftests: membarrier: membarrier_test_single_thread
# selftests: membarrier: membarrier_test_multi_thread
# TAP version 13
# 1..2
# ok 1 sys_membarrier available
# ok 2 sys membarrier invalid command test: command = -1, flags = 0,
errno = 22. Failed as expected
ok 2 selftests: membarrier: membarrier_test_multi_thread

The major differences here, that do not match the above specification,
are use of "# " as an indentation and using a TAP version line to
denote a new test with subtests rather than the subtest header line
described above. If these are widely utilized formats in kselftests,
should we include both versions in the specification or should we
attempt to agree on a single format for nested tests? I personally
believe we should try to agree on a single format for nested tests.
This would allow for a cleaner specification of KTAP and would reduce
possible confusion.

====

So what do people think about the above specification?
How should we handle the differences with kselftests?
If this specification is accepted, where should the specification be documented?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC - kernel test result specification (KTAP)
  2021-08-10 23:25 RFC - kernel test result specification (KTAP) Rae Moar
@ 2021-08-11 20:53 ` Brendan Higgins
  2021-08-28  8:20 ` Kees Cook
  1 sibling, 0 replies; 14+ messages in thread
From: Brendan Higgins @ 2021-08-11 20:53 UTC (permalink / raw)
  To: Rae Moar
  Cc: David Gow, Tim.Bird, shuah, Daniel Latypov, kunit-dev,
	linux-kselftest, linux-kernel

On Tue, Aug 10, 2021 at 4:25 PM Rae Moar <rmoar@google.com> wrote:
>
> We are looking to further standardise the output format used by kernel
> test frameworks like kselftest and KUnit. Thus far we have used the
> TAP (Test Anything Protocol) specification, but it has been extended
> in many different ways, so we would like to agree on a common "Kernel
> TAP" (KTAP) format to resolve these differences. Thus, below is a
> draft of a specification of KTAP. Note that this specification is
> largely based on the current format of test results for KUnit tests.
>
> Additionally, this specification was heavily inspired by the KTAP
> specification draft by Tim Bird
> (https://lore.kernel.org/linux-kselftest/CY4PR13MB1175B804E31E502221BC8163FD830@CY4PR13MB1175.namprd13.prod.outlook.com/T/).
> However, there are some notable differences to his specification. One
> such difference is the format of nested tests is more fully specified
> in the following specification. However, they are specified in a way
> which may not be compatible with many kselftest nested tests.

Thanks for putting this together, Rae! I would definitely like to see
us moving forward with standardizing fully on the KTAP spec.

I am surprised to not yet see any comments here.

I was thinking that this might make a good BoF topic at LPC, but I
guess that depends on how interested people are in this topic.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC - kernel test result specification (KTAP)
  2021-08-10 23:25 RFC - kernel test result specification (KTAP) Rae Moar
  2021-08-11 20:53 ` Brendan Higgins
@ 2021-08-28  8:20 ` Kees Cook
  2021-08-30 17:34   ` Brendan Higgins
                     ` (3 more replies)
  1 sibling, 4 replies; 14+ messages in thread
From: Kees Cook @ 2021-08-28  8:20 UTC (permalink / raw)
  To: Rae Moar
  Cc: Brendan Higgins, David Gow, Tim.Bird, Shuah Khan, Daniel Latypov,
	kunit-dev, linux-kselftest, linux-kernel, kernelci,
	Guillaume Tucker

[+kernelci, +Guillaume]

Hi!

Please keep me in CC for these kinds of proposals. And thanks for looking
at this again! Please understand that while I may be coming across as
rather negative here, I would like to have a rational and documented
specification for the kernel's test output, too. My main objection here
is that we already _have_ a specification, and it's already being parsed
by machines, so making changes without strong justification is going to
be met with resistance. :) So, with that in mind, on to my reply...

On Tue, Aug 10, 2021 at 04:25:03PM -0700, Rae Moar wrote:
> We are looking to further standardise the output format used by kernel
> test frameworks like kselftest and KUnit. Thus far we have used the
> TAP (Test Anything Protocol) specification, but it has been extended
> in many different ways, so we would like to agree on a common "Kernel
> TAP" (KTAP) format to resolve these differences. Thus, below is a
> draft of a specification of KTAP. Note that this specification is
> largely based on the current format of test results for KUnit tests.

The kernel's largest producer of TAP is kselftest, and the largest
consumer is LAVA[1]. I would want buy-in from at least those responsible
for either side of those two things. (And I'm one of the people working
on both sides of it.)

The fundamental purpose of the kernel's TAP is to have many independent
tests runnable and parseable, specifically without any embedded framework
knowledge (or, even, any knowledge of TAP).

The tests run by kselftest come from 2 different TAP-producing harnesses
(kselftest.h for C, kselftest/runner.sh for TAP-agnostic tests) as well
as several hand-rolled instances in Shell, Python, and C. There used to
be more, but I have been steadily fixing their syntax[2] and merging[3]
separate implementations for a while now.

[1] https://github.com/Linaro/test-definitions/commit/8bd338bbcfa5a03efcf1d12e25b5d341d5a29cbc
[2] https://git.kernel.org/linus/b0df366bbd701c45e93af0dcb87ce22398589d1d
[3] https://git.kernel.org/linus/e80068be21824e4d2726eeea41cac6178d212415

> Additionally, this specification was heavily inspired by the KTAP
> specification draft by Tim Bird
> (https://lore.kernel.org/linux-kselftest/CY4PR13MB1175B804E31E502221BC8163FD830@CY4PR13MB1175.namprd13.prod.outlook.com/T/).
> However, there are some notable differences to his specification. One
> such difference is the format of nested tests is more fully specified
> in the following specification. However, they are specified in a way
> which may not be compatible with many kselftest nested tests.

I commented extensively on that thread. :)

> 
> =====================
> Specification of KTAP
> =====================
> 
> TAP, or the Test Anything Protocol is a format for specifying test
> results used by a number of projects. It's website and specification
> are found at: https://testanything.org/. The Linux Kernel uses TAP
> output for test results. However, KUnit (and other Kernel testing
> frameworks such as kselftest) have some special needs for test results
> which don't gel perfectly with the original TAP specification. Thus, a
> "Kernel TAP" (KTAP) format is specified to extend and alter TAP to
> support these use-cases.
> 
> KTAP Output consists of 5 major elements (all line-based):
> - The version line
> - Plan lines
> - Test case result lines

These are required.

> - Diagnostic lines

This is optional.

> - A bail out line

Bail out should be optional, and possibly not included at all. (See
below.)

> 
> An important component in this specification of KTAP is the
> specification of the format of nested tests. This can be found in the
> section on nested tests below.
> 
> The version line
> ----------------
> 
> The first line of KTAP output must be the version line. As this
> specification documents the first version of KTAP,  the recommended
> version line is "KTAP version 1". However, since all kernel testing
> frameworks use TAP version lines, "TAP version 14" and "TAP version
> 13" are all acceptable version lines. Version lines with other
> versions of TAP or KTAP will not cause the parsing of the test results
> to fail but it will produce an error.

Maybe "TAP version Linux.1" ? I don't want to needlessly break existing
parsers.

> 
> Plan lines
> ----------
> 
> Plan lines must follow the format of "1..N" where N is the number of
> subtests. The second line of KTAP output must be a plan line, which
> indicates the number of tests at the highest level, such that the
> tests do not have a parent. Also, in the instance of a test having
> subtests, the second line of the test after the subtest header must be
> a plan line which indicates the number of subtests within that test.

I do not want "subtests" as a specification concept, only "nested
tests".

> 
> Test case result lines
> ----------------------
> 
> Test case result lines must have the format:
> 
>     <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]

"[<diagnostic data>]" is not defined below. I think what you mean is
"directive details".

I suggest:

<result> <number>[ [- ]<description>][# <directive> [<directive details>]]

> The result can be either "ok", which indicates the test case passed,
> or "not ok", which indicates that the test case failed.

Yes.

> 
> The number represents the number of the test case or suite being
> performed. The first test case or suite must have the number 1 and the
> number must increase by 1 for each additional test case or result at
> the same level and within the same testing suite.

Yes, though parsers should handle ordering failures and missed test
results (this is the purpose of the "plan" line).

> 
> The "-" character is optional.
> 
> The description is a description of the test, generally the name of
> the test, and can be any string of words (can't include #). The
> description is optional.

Yes, though the optional "- " is strictly part of the optional
description.

> The directive is used to indicate if a test was skipped. The format

The directive is a single word -- currently only "SKIP" is recognized
by TAP 14(?), but kselftest uses also XFAIL, XPASS, TIMEOUT, and
error. (Though I would love to hear what "error" is intended to be used
for, if different from regular "not ok".)

(This could also be called "context" rather than "directive".)

> for the directive is: "# SKIP [<skip_description>]". The
> skip_description is optional and can be any string of words to
> describe why the test was skipped.

I would call this "directive details".

> The result of the test case result
> line can be either "ok" or "not ok" if the skip directive is used.
> Finally, note that TAP 14 specification includes TODO directives but
> these are not supported for KTAP.
> 
> Examples of test case result lines:
> 
> Test passed:
> 
>     ok 1 - test_case_name
> 
> Test was skipped:
> 
>     not ok 1 - test_case_name # SKIP test_case_name should be skipped
> 
> Test failed:
> 
>     not_ok 1 - test_case_name

This isn't valid. No "_" is recognized for "ok" vs "not ok".

> 
> Diagnostic lines
> ----------------
> 
> Diagnostic lines are used for description of testing operations.
> Diagnostic lines are generally formatted as "#
> <diagnostic_description>", where the description can be any string.
> However, in practice, diagnostic lines are all lines that don't follow
> the format of any other KTAP line format.

I still think there should be a distinction between "diagnostic lines"
and "unknown lines". For example, if kselftest is running on a console,
the dmesg output may be intermixed, and possibly temporally offset. Such
lines may not be useful, and may not be strictly correlated, where-as
the output from kselftest is at least self-consistent.

> Diagnostic lines can be
> anywhere in the test output after the first two lines.

I don't see a reason for this strictness. They can be anywhere.

> There are a few
> special diagnostic lines.

No; diagnostic lines must have no meaning: they are for humans and nothing
else. If other details are needed for machines, we should explicitly
create new format lines. I made a mistake when I used a diagnostic line
for embedding the test names. :( There is a need for parsers to discover
the name of a given test, though, so we do likely need something for this.

> Diagnostic lines of the format "# Subtest:
> <test_name>" indicate the start of a test with subtests. Also,
> diagnostic lines of the format "# <test_name>: <description>" refer to
> a specific test and tend to occur before the test result line of that
> test but are optional.

I don't think the above should be included in the spec -- diag lines
should have no parseable meaning.

> 
> Bail out line
> -------------
> 
> A bail out line can occur anywhere in the KTAP output and will
> indicate that a test has crashed. The format of a bail out line is
> "Bail out! [<description>]",  where the description can give
> information on why the bail out occurred and can be any string.

I'm not a fan of the Bail out line. It's not a problem, exactly, but I
find it redundant. If we want an "end of test" line, let's make one.
"Bail out" is a special case of exit. If we want to handle test exit,
let's define it. E.g. make kselftest's test summary lines not
diagnostic lines:

# FAILED: 85 / 87 tests passed.
# Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0

Also, parsers should treat a new "TAP version..." line at the same
nesting level as indication that the prior test has ended (and any tests
without result lines should be considered failed).

> 
> Nested tests
> ------------
> 
> The new specification for KTAP will support an arbitrary number of
> nested subtests. Thus, tests can now have subtests and those subtests
> can have subtests. This can be useful to further categorize tests and
> organize test results.
> 
> The new required format for a test with subtests consists of: a
> subtest header line, a plan line, all subtests, and a final test
> result line.
> 
> The first line of the test must be the subtest header line with the
> format: "# Subtest: <test_name>".

Please no. There is no reason to force a nested test to suddenly have
to know about its test execution depth/environment. A subtest is just a
nested TAP result. That it is nested is only meaningful to the parser, not
the executing test, and it must have the same output, nested or not. (e.g.
running the test alone or running the test under a full kselftest run,
the only difference is the indentation level.)

> The second line of the test must be the plan line, which is formatted
> as "1..N", where N is the number of subtests.
> 
> Following the plan line, all lines pertaining to the subtests will follow.
> 
> Finally, the last line of the test is a final test result line with
> the format: "(ok|not ok) <number> [-] [<description>] [<directive>]
> [<diagnostic data>]", which follows the same format as the general
> test result lines described in this section. The result line should
> indicate the result of the subtests. Thus, if one of the subtests
> fail, the test should fail. The description in the final test result
> line should match the name of the test in the subtest header.
> 
> An example format:
> 
> KTAP version 1
> 1..1
>     # Subtest: test_suite
>     1..2
>     ok 1 - test_1
>     ok 2 - test_2
> ok 1 - test_suite

Again, I see only downsides to this. Nesting for the spec is simple
indentation-based recursion. Let's just keep what we already have:

TAP version 13
1..1
# TAP version 13
# 1..2
# ok 1 - test_1
# ok 2 - test_2
ok 1 - test_suite


> An example format with multiple levels of nested testing:
> 
> KTAP version 1
> 1..1
>     # Subtest: test_suite
>     1..2
>         # Subtest: sub_test_suite
>         1..2
>         ok 1 - test_1
>         ok 2 test_2
>     ok 1 - sub_test_suite
>     ok 2 - test
> ok 1 - test_suite
> 
> In the instance that the plan line is missing, the end of the test
> will be denoted by the final result line containing a description that
> matches the name of the test given in the subtest header. Note that
> thus, if the plan line is missing and one of the subtests have a
> matching name to the test suite this will cause errors.

A plan line is required. No special cases are needed. :)

If nesting level is lost, a parser will understand the nested test to
have ended, but probably so did the test runner:

TAP version 13
1..1
    TAP version 13
    1..2
        TAP version 13
        1..3
        ok 1 - test_1
        ok 2 test_2
    not ok 1 - sub test unfinished plan
    ok 2 - test
not ok 1 - test_suite

> Lastly, indentation is also recommended for improved readability.

Indentation is _required_. :)

Whether this is "# " or "  " I don't really care, as long as the change
is coordinated. "  " is easier for humans to read, but may make parsing of
"unknown" lines more difficult for machines.

> 
> Major differences between TAP 14 and KTAP specification
> -------------------------------------------------------
> 
> Note that the major differences between TAP 14 and KTAP specification:
> - yaml and json are not allowed in diagnostic messages

Agreed -- these are overkill (and very difficult to implement as they
_follow_ the test result line: anything generating them has already
finished running).

> - TODO directive not allowed

I would just say "unrecognized".

> - KTAP allows for an arbitrary number of tests to be nested with
> specified nested test format

Yup.

> 
> Example of KTAP
> ---------------
> 
> KTAP version 1
> 1..1
>     # Subtest: test_suite
>     1..1
>         # Subtest: sub_test_suite
>         1..2
>         ok 1 - test_1
>         ok 2 test_2
>     ok 1 - sub_test_suite
> ok 1 - test_suite

For a good example, please include all the possible combinations (SKIP,
not ok, diagnostics, etc etc)

> 
> =========================================
> Note on incompatibilities with kselftests
> =========================================
> 
> To my knowledge, the above specification seems to generally accept the
> TAP format of many non-nested test results of kselftests.
> 
> An example of a common kselftests TAP format for non-nested test
> results that are accepted by the above specification:
> 
> TAP version 13
> 1..2
> # selftests: vDSO: vdso_test_gettimeofday
> # The time is 1628024856.096879
> ok 1 selftests: vDSO: vdso_test_gettimeofday
> # selftests: vDSO: vdso_test_getcpu
> # Could not find __vdso_getcpu
> ok 2 selftests: vDSO: vdso_test_getcpu # SKIP
> 
> However, one major difference noted with kselftests is the use of more
> directives than the "# SKIP" directive. kselftest also supports XPASS
> and XFAIL directives. Some additional examples found in kselftests:
> 
>     not ok 5 selftests: netfilter: nft_concat_range.sh # TIMEOUT 45 seconds
> 
>     not ok 45 selftests: kvm: kvm_binary_stats_test # exit=127
> 
> Should the specification be expanded to include these directives?

Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
be reported without the '#')

> 
> However, the general format for kselftests with nested test results
> seems to differ from the above specification. It seems that a general
> format for nested tests is as follows:
> 
> TAP version 13
> 1..2
> # selftests: membarrier: membarrier_test_single_thread
> # TAP version 13
> # 1..2
> # ok 1 sys_membarrier available
> # ok 2 sys membarrier invalid command test: command = -1, flags = 0,
> errno = 22. Failed as expected
> ok 1 selftests: membarrier: membarrier_test_single_thread
> # selftests: membarrier: membarrier_test_multi_thread
> # TAP version 13
> # 1..2
> # ok 1 sys_membarrier available
> # ok 2 sys membarrier invalid command test: command = -1, flags = 0,
> errno = 22. Failed as expected
> ok 2 selftests: membarrier: membarrier_test_multi_thread
> 
> The major differences here, that do not match the above specification,
> are use of "# " as an indentation and using a TAP version line to
> denote a new test with subtests rather than the subtest header line
> described above. If these are widely utilized formats in kselftests,
> should we include both versions in the specification or should we
> attempt to agree on a single format for nested tests? I personally
> believe we should try to agree on a single format for nested tests.
> This would allow for a cleaner specification of KTAP and would reduce
> possible confusion.

We already use "# " and the nested TAP lines to denote subtests. Without
a good reason to change it, we should avoid the churn with the existing
parsers.

> ====
> 
> So what do people think about the above specification?
> How should we handle the differences with kselftests?

I'm probably a broken record by now, but kselftests _is_ the
specification. ;) What about it needs changing, and why?

> If this specification is accepted, where should the specification be documented?

I imagine some .rst file under Documentation/dev-tools/, linked to from
kselftest.rst and kunit/...rst

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC - kernel test result specification (KTAP)
  2021-08-28  8:20 ` Kees Cook
@ 2021-08-30 17:34   ` Brendan Higgins
  2021-08-30 19:26     ` Kees Cook
  2021-08-30 17:48   ` Tim.Bird
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Brendan Higgins @ 2021-08-30 17:34 UTC (permalink / raw)
  To: Kees Cook
  Cc: Rae Moar, David Gow, Tim.Bird, Shuah Khan, Daniel Latypov,
	kunit-dev, linux-kselftest, linux-kernel, kernelci,
	Guillaume Tucker

On Sat, Aug 28, 2021 at 1:20 AM Kees Cook <keescook@chromium.org> wrote:
>
> [+kernelci, +Guillaume]
>
> Hi!
>
> Please keep me in CC for these kinds of proposals. And thanks for looking

Oh sorry, we will make sure to do that in the future. I figured most
people who were interested who weren't directly CC'ed were on the
kselftest list.

> at this again! Please understand that while I may be coming across as
> rather negative here, I would like to have a rational and documented
> specification for the kernel's test output, too. My main objection here
> is that we already _have_ a specification, and it's already being parsed

We do?! Where is it? Can you link it here?

> by machines, so making changes without strong justification is going to
> be met with resistance. :) So, with that in mind, on to my reply...

Of course, but if there is no specification or it is not being
universally followed, then changes will need to be made, no?

> On Tue, Aug 10, 2021 at 04:25:03PM -0700, Rae Moar wrote:
> > We are looking to further standardise the output format used by kernel
> > test frameworks like kselftest and KUnit. Thus far we have used the
> > TAP (Test Anything Protocol) specification, but it has been extended
> > in many different ways, so we would like to agree on a common "Kernel
> > TAP" (KTAP) format to resolve these differences. Thus, below is a
> > draft of a specification of KTAP. Note that this specification is
> > largely based on the current format of test results for KUnit tests.
>
> The kernel's largest producer of TAP is kselftest, and the largest
> consumer is LAVA[1]. I would want buy-in from at least those responsible
> for either side of those two things. (And I'm one of the people working
> on both sides of it.)

Of course, that's why we sent it to the kselftest list. In terms of
LAVA as the biggest consumer, thank you for CCing them. My apologies
for failing to do so.

> The fundamental purpose of the kernel's TAP is to have many independent
> tests runnable and parseable, specifically without any embedded framework
> knowledge (or, even, any knowledge of TAP).
>
> The tests run by kselftest come from 2 different TAP-producing harnesses
> (kselftest.h for C, kselftest/runner.sh for TAP-agnostic tests) as well
> as several hand-rolled instances in Shell, Python, and C. There used to
> be more, but I have been steadily fixing their syntax[2] and merging[3]
> separate implementations for a while now.

Yep! And I believe there are still some inconsistencies - hence part
of the motivation.

> [1] https://github.com/Linaro/test-definitions/commit/8bd338bbcfa5a03efcf1d12e25b5d341d5a29cbc
> [2] https://git.kernel.org/linus/b0df366bbd701c45e93af0dcb87ce22398589d1d
> [3] https://git.kernel.org/linus/e80068be21824e4d2726eeea41cac6178d212415
>
> > Additionally, this specification was heavily inspired by the KTAP
> > specification draft by Tim Bird
> > (https://lore.kernel.org/linux-kselftest/CY4PR13MB1175B804E31E502221BC8163FD830@CY4PR13MB1175.namprd13.prod.outlook.com/T/).
> > However, there are some notable differences to his specification. One
> > such difference is the format of nested tests is more fully specified
> > in the following specification. However, they are specified in a way
> > which may not be compatible with many kselftest nested tests.
>
> I commented extensively on that thread. :)

Yep, and if I recall correctly, there was no resolution to that
thread, so we are trying to pick that conversation back up here.

> >
> > =====================
> > Specification of KTAP
> > =====================
> >
> > TAP, or the Test Anything Protocol is a format for specifying test
> > results used by a number of projects. It's website and specification
> > are found at: https://testanything.org/. The Linux Kernel uses TAP
> > output for test results. However, KUnit (and other Kernel testing
> > frameworks such as kselftest) have some special needs for test results
> > which don't gel perfectly with the original TAP specification. Thus, a
> > "Kernel TAP" (KTAP) format is specified to extend and alter TAP to
> > support these use-cases.
> >
> > KTAP Output consists of 5 major elements (all line-based):
> > - The version line
> > - Plan lines
> > - Test case result lines
>
> These are required.
>
> > - Diagnostic lines
>
> This is optional.
>
> > - A bail out line
>
> Bail out should be optional, and possibly not included at all. (See
> below.)
>
> >
> > An important component in this specification of KTAP is the
> > specification of the format of nested tests. This can be found in the
> > section on nested tests below.
> >
> > The version line
> > ----------------
> >
> > The first line of KTAP output must be the version line. As this
> > specification documents the first version of KTAP,  the recommended
> > version line is "KTAP version 1". However, since all kernel testing
> > frameworks use TAP version lines, "TAP version 14" and "TAP version
> > 13" are all acceptable version lines. Version lines with other
> > versions of TAP or KTAP will not cause the parsing of the test results
> > to fail but it will produce an error.
>
> Maybe "TAP version Linux.1" ? I don't want to needlessly break existing
> parsers.

"TAP version Linux.1" would not break existing parsers? If so, that
would be great, I just expect that it would break them too.

> >
> > Plan lines
> > ----------
> >
> > Plan lines must follow the format of "1..N" where N is the number of
> > subtests. The second line of KTAP output must be a plan line, which
> > indicates the number of tests at the highest level, such that the
> > tests do not have a parent. Also, in the instance of a test having
> > subtests, the second line of the test after the subtest header must be
> > a plan line which indicates the number of subtests within that test.
>
> I do not want "subtests" as a specification concept, only "nested
> tests".

Why not? I got the impression on the previous thread that everyone was
amenable to this?

> >
> > Test case result lines
> > ----------------------
> >
> > Test case result lines must have the format:
> >
> >     <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
>
> "[<diagnostic data>]" is not defined below. I think what you mean is
> "directive details".
>
> I suggest:
>
> <result> <number>[ [- ]<description>][# <directive> [<directive details>]]

I think that is probably fine, but I think Rae would probably have a
better idea.

> > The result can be either "ok", which indicates the test case passed,
> > or "not ok", which indicates that the test case failed.
>
> Yes.
>
> >
> > The number represents the number of the test case or suite being
> > performed. The first test case or suite must have the number 1 and the
> > number must increase by 1 for each additional test case or result at
> > the same level and within the same testing suite.
>
> Yes, though parsers should handle ordering failures and missed test
> results (this is the purpose of the "plan" line).

Right.

> >
> > The "-" character is optional.
> >
> > The description is a description of the test, generally the name of
> > the test, and can be any string of words (can't include #). The
> > description is optional.
>
> Yes, though the optional "- " is strictly part of the optional
> description.
>
> > The directive is used to indicate if a test was skipped. The format
>
> The directive is a single word -- currently only "SKIP" is recognized
> by TAP 14(?), but kselftest uses also XFAIL, XPASS, TIMEOUT, and
> error. (Though I would love to hear what "error" is intended to be used
> for, if different from regular "not ok".)

Can you explain (or link to documentation) on what these other directives mean?

> (This could also be called "context" rather than "directive".)
>
> > for the directive is: "# SKIP [<skip_description>]". The
> > skip_description is optional and can be any string of words to
> > describe why the test was skipped.
>
> I would call this "directive details".
>
> > The result of the test case result
> > line can be either "ok" or "not ok" if the skip directive is used.
> > Finally, note that TAP 14 specification includes TODO directives but
> > these are not supported for KTAP.
> >
> > Examples of test case result lines:
> >
> > Test passed:
> >
> >     ok 1 - test_case_name
> >
> > Test was skipped:
> >
> >     not ok 1 - test_case_name # SKIP test_case_name should be skipped
> >
> > Test failed:
> >
> >     not_ok 1 - test_case_name
>
> This isn't valid. No "_" is recognized for "ok" vs "not ok".

Oh yeah, I think that was a typo.

> >
> > Diagnostic lines
> > ----------------
> >
> > Diagnostic lines are used for description of testing operations.
> > Diagnostic lines are generally formatted as "#
> > <diagnostic_description>", where the description can be any string.
> > However, in practice, diagnostic lines are all lines that don't follow
> > the format of any other KTAP line format.
>
> I still think there should be a distinction between "diagnostic lines"
> and "unknown lines". For example, if kselftest is running on a console,
> the dmesg output may be intermixed, and possibly temporally offset. Such
> lines may not be useful, and may not be strictly correlated, where-as
> the output from kselftest is at least self-consistent.

Interesting. I think David probably would have some thoughts here.

> > Diagnostic lines can be
> > anywhere in the test output after the first two lines.
>
> I don't see a reason for this strictness. They can be anywhere.
>
> > There are a few
> > special diagnostic lines.
>
> No; diagnostic lines must have no meaning: they are for humans and nothing
> else. If other details are needed for machines, we should explicitly
> create new format lines. I made a mistake when I used a diagnostic line
> for embedding the test names. :( There is a need for parsers to discover
> the name of a given test, though, so we do likely need something for this.
>
> > Diagnostic lines of the format "# Subtest:
> > <test_name>" indicate the start of a test with subtests. Also,
> > diagnostic lines of the format "# <test_name>: <description>" refer to
> > a specific test and tend to occur before the test result line of that
> > test but are optional.
>
> I don't think the above should be included in the spec -- diag lines
> should have no parseable meaning.
>
> >
> > Bail out line
> > -------------
> >
> > A bail out line can occur anywhere in the KTAP output and will
> > indicate that a test has crashed. The format of a bail out line is
> > "Bail out! [<description>]",  where the description can give
> > information on why the bail out occurred and can be any string.
>
> I'm not a fan of the Bail out line. It's not a problem, exactly, but I
> find it redundant. If we want an "end of test" line, let's make one.
> "Bail out" is a special case of exit. If we want to handle test exit,
> let's define it. E.g. make kselftest's test summary lines not
> diagnostic lines:
>
> # FAILED: 85 / 87 tests passed.
> # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0

Cool, any ideas on what we should do then? I personally don't have
strong feelings on it.

> Also, parsers should treat a new "TAP version..." line at the same
> nesting level as indication that the prior test has ended (and any tests
> without result lines should be considered failed).

OK, that is definitely different. I think this will require some
further discussion.

> > Nested tests
> > ------------
> >
> > The new specification for KTAP will support an arbitrary number of
> > nested subtests. Thus, tests can now have subtests and those subtests
> > can have subtests. This can be useful to further categorize tests and
> > organize test results.
> >
> > The new required format for a test with subtests consists of: a
> > subtest header line, a plan line, all subtests, and a final test
> > result line.
> >
> > The first line of the test must be the subtest header line with the
> > format: "# Subtest: <test_name>".
>
> Please no. There is no reason to force a nested test to suddenly have
> to know about its test execution depth/environment. A subtest is just a
> nested TAP result. That it is nested is only meaningful to the parser, not
> the executing test, and it must have the same output, nested or not. (e.g.
> running the test alone or running the test under a full kselftest run,
> the only difference is the indentation level.)

Well it doesn't have to, right? The parent test can include the
directive. In anycase, something has to know enough to print a test
plan.

> > The second line of the test must be the plan line, which is formatted
> > as "1..N", where N is the number of subtests.
> >
> > Following the plan line, all lines pertaining to the subtests will follow.
> >
> > Finally, the last line of the test is a final test result line with
> > the format: "(ok|not ok) <number> [-] [<description>] [<directive>]
> > [<diagnostic data>]", which follows the same format as the general
> > test result lines described in this section. The result line should
> > indicate the result of the subtests. Thus, if one of the subtests
> > fail, the test should fail. The description in the final test result
> > line should match the name of the test in the subtest header.
> >
> > An example format:
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..2
> >     ok 1 - test_1
> >     ok 2 - test_2
> > ok 1 - test_suite
>
> Again, I see only downsides to this. Nesting for the spec is simple
> indentation-based recursion. Let's just keep what we already have:

What spec? I didn't see a final spec come from the previous discussion.

> TAP version 13
> 1..1
> # TAP version 13
> # 1..2
> # ok 1 - test_1
> # ok 2 - test_2
> ok 1 - test_suite
>
>
> > An example format with multiple levels of nested testing:
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..2
> >         # Subtest: sub_test_suite
> >         1..2
> >         ok 1 - test_1
> >         ok 2 test_2
> >     ok 1 - sub_test_suite
> >     ok 2 - test
> > ok 1 - test_suite
> >
> > In the instance that the plan line is missing, the end of the test
> > will be denoted by the final result line containing a description that
> > matches the name of the test given in the subtest header. Note that
> > thus, if the plan line is missing and one of the subtests have a
> > matching name to the test suite this will cause errors.
>
> A plan line is required. No special cases are needed. :)
>
> If nesting level is lost, a parser will understand the nested test to
> have ended, but probably so did the test runner:
>
> TAP version 13
> 1..1
>     TAP version 13
>     1..2
>         TAP version 13
>         1..3
>         ok 1 - test_1
>         ok 2 test_2
>     not ok 1 - sub test unfinished plan
>     ok 2 - test
> not ok 1 - test_suite
>
> > Lastly, indentation is also recommended for improved readability.
>
> Indentation is _required_. :)

KUnit cannot support indentation being required. We print test results
to dmesg. At anytime something could get printed to dmesg that could
potentially mess up indentation.

> Whether this is "# " or "  " I don't really care, as long as the change
> is coordinated. "  " is easier for humans to read, but may make parsing of
> "unknown" lines more difficult for machines.
>
> >
> > Major differences between TAP 14 and KTAP specification
> > -------------------------------------------------------
> >
> > Note that the major differences between TAP 14 and KTAP specification:
> > - yaml and json are not allowed in diagnostic messages
>
> Agreed -- these are overkill (and very difficult to implement as they
> _follow_ the test result line: anything generating them has already
> finished running).
>
> > - TODO directive not allowed
>
> I would just say "unrecognized".
>
> > - KTAP allows for an arbitrary number of tests to be nested with
> > specified nested test format
>
> Yup.
>
> >
> > Example of KTAP
> > ---------------
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..1
> >         # Subtest: sub_test_suite
> >         1..2
> >         ok 1 - test_1
> >         ok 2 test_2
> >     ok 1 - sub_test_suite
> > ok 1 - test_suite
>
> For a good example, please include all the possible combinations (SKIP,
> not ok, diagnostics, etc etc)
>
> >
> > =========================================
> > Note on incompatibilities with kselftests
> > =========================================
> >
> > To my knowledge, the above specification seems to generally accept the
> > TAP format of many non-nested test results of kselftests.
> >
> > An example of a common kselftests TAP format for non-nested test
> > results that are accepted by the above specification:
> >
> > TAP version 13
> > 1..2
> > # selftests: vDSO: vdso_test_gettimeofday
> > # The time is 1628024856.096879
> > ok 1 selftests: vDSO: vdso_test_gettimeofday
> > # selftests: vDSO: vdso_test_getcpu
> > # Could not find __vdso_getcpu
> > ok 2 selftests: vDSO: vdso_test_getcpu # SKIP
> >
> > However, one major difference noted with kselftests is the use of more
> > directives than the "# SKIP" directive. kselftest also supports XPASS
> > and XFAIL directives. Some additional examples found in kselftests:
> >
> >     not ok 5 selftests: netfilter: nft_concat_range.sh # TIMEOUT 45 seconds
> >
> >     not ok 45 selftests: kvm: kvm_binary_stats_test # exit=127
> >
> > Should the specification be expanded to include these directives?
>
> Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
> be reported without the '#')
>
> >
> > However, the general format for kselftests with nested test results
> > seems to differ from the above specification. It seems that a general
> > format for nested tests is as follows:
> >
> > TAP version 13
> > 1..2
> > # selftests: membarrier: membarrier_test_single_thread
> > # TAP version 13
> > # 1..2
> > # ok 1 sys_membarrier available
> > # ok 2 sys membarrier invalid command test: command = -1, flags = 0,
> > errno = 22. Failed as expected
> > ok 1 selftests: membarrier: membarrier_test_single_thread
> > # selftests: membarrier: membarrier_test_multi_thread
> > # TAP version 13
> > # 1..2
> > # ok 1 sys_membarrier available
> > # ok 2 sys membarrier invalid command test: command = -1, flags = 0,
> > errno = 22. Failed as expected
> > ok 2 selftests: membarrier: membarrier_test_multi_thread
> >
> > The major differences here, that do not match the above specification,
> > are use of "# " as an indentation and using a TAP version line to
> > denote a new test with subtests rather than the subtest header line
> > described above. If these are widely utilized formats in kselftests,
> > should we include both versions in the specification or should we
> > attempt to agree on a single format for nested tests? I personally
> > believe we should try to agree on a single format for nested tests.
> > This would allow for a cleaner specification of KTAP and would reduce
> > possible confusion.
>
> We already use "# " and the nested TAP lines to denote subtests. Without
> a good reason to change it, we should avoid the churn with the existing
> parsers.
>
> > ====
> >
> > So what do people think about the above specification?
> > How should we handle the differences with kselftests?
>
> I'm probably a broken record by now, but kselftests _is_ the
> specification. ;) What about it needs changing, and why?

Wait, what?! An implementation is not a specification. I thought Tim's
attempt at standardizing the TAP that exists under kselftest, KUnit,
and elsewhere was recognized as important or at least worthwhile.

The problem that was recognized, as I understand, was that there are
multiple "interpretations" of TAP floating around the kernel and that
goes against the original point of trying to standardize around TAP.

I know KUnit's usage is pretty minor in comparison to kselftest, but
people do use it and there is no point in us, KUnit, purporting to use
TAP and remain compatible with any particular version of it if it is
incompatible with kselftest's TAP.

Additionally, there is no way that we are going to be able to stay on
a compatible implementation of TAP unless we specify what TAP is
separate from the implementation.

Maybe there is something I am missing here. If so, please explain.

> > If this specification is accepted, where should the specification be documented?
>
> I imagine some .rst file under Documentation/dev-tools/, linked to from
> kselftest.rst and kunit/...rst
>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: RFC - kernel test result specification (KTAP)
  2021-08-28  8:20 ` Kees Cook
  2021-08-30 17:34   ` Brendan Higgins
@ 2021-08-30 17:48   ` Tim.Bird
  2021-08-30 19:58     ` Kees Cook
  2021-08-30 23:39   ` David Gow
  2021-08-31  2:14   ` Rae Moar
  3 siblings, 1 reply; 14+ messages in thread
From: Tim.Bird @ 2021-08-30 17:48 UTC (permalink / raw)
  To: keescook, rmoar
  Cc: brendanhiggins, davidgow, shuah, dlatypov, kunit-dev,
	linux-kselftest, linux-kernel, kernelci, guillaume.tucker



> -----Original Message-----
> From: Kees Cook <keescook@chromium.org>
> 
> [+kernelci, +Guillaume]
> 
> Hi!
> 
> Please keep me in CC for these kinds of proposals. And thanks for looking
> at this again! Please understand that while I may be coming across as
> rather negative here, I would like to have a rational and documented
> specification for the kernel's test output, too. My main objection here
> is that we already _have_ a specification, and it's already being parsed
> by machines, so making changes without strong justification is going to
> be met with resistance. :) So, with that in mind, on to my reply...
> 
> On Tue, Aug 10, 2021 at 04:25:03PM -0700, Rae Moar wrote:
> > We are looking to further standardise the output format used by kernel
> > test frameworks like kselftest and KUnit. Thus far we have used the
> > TAP (Test Anything Protocol) specification, but it has been extended
> > in many different ways, so we would like to agree on a common "Kernel
> > TAP" (KTAP) format to resolve these differences. Thus, below is a
> > draft of a specification of KTAP. Note that this specification is
> > largely based on the current format of test results for KUnit tests.
> 
> The kernel's largest producer of TAP is kselftest, and the largest
> consumer is LAVA[1]. I would want buy-in from at least those responsible
> for either side of those two things. (And I'm one of the people working
> on both sides of it.)
> 
> The fundamental purpose of the kernel's TAP is to have many independent
> tests runnable and parseable, specifically without any embedded framework
> knowledge (or, even, any knowledge of TAP).
> 
> The tests run by kselftest come from 2 different TAP-producing harnesses
> (kselftest.h for C, kselftest/runner.sh for TAP-agnostic tests) as well
> as several hand-rolled instances in Shell, Python, and C. There used to
> be more, but I have been steadily fixing their syntax[2] and merging[3]
> separate implementations for a while now.
> 
> [1] https://github.com/Linaro/test-definitions/commit/8bd338bbcfa5a03efcf1d12e25b5d341d5a29cbc
> [2] https://git.kernel.org/linus/b0df366bbd701c45e93af0dcb87ce22398589d1d
> [3] https://git.kernel.org/linus/e80068be21824e4d2726eeea41cac6178d212415
> 
> > Additionally, this specification was heavily inspired by the KTAP
> > specification draft by Tim Bird
> > (https://lore.kernel.org/linux-
> kselftest/CY4PR13MB1175B804E31E502221BC8163FD830@CY4PR13MB1175.namprd13.prod.outlook.com/T/).
> > However, there are some notable differences to his specification. One
> > such difference is the format of nested tests is more fully specified
> > in the following specification. However, they are specified in a way
> > which may not be compatible with many kselftest nested tests.
> 
> I commented extensively on that thread. :)
> 
> >
> > =====================
> > Specification of KTAP
> > =====================
> >
> > TAP, or the Test Anything Protocol is a format for specifying test
> > results used by a number of projects. It's website and specification
> > are found at: https://testanything.org/. The Linux Kernel uses TAP
> > output for test results. However, KUnit (and other Kernel testing
> > frameworks such as kselftest) have some special needs for test results
> > which don't gel perfectly with the original TAP specification. Thus, a
> > "Kernel TAP" (KTAP) format is specified to extend and alter TAP to
> > support these use-cases.
> >
> > KTAP Output consists of 5 major elements (all line-based):
> > - The version line
> > - Plan lines
> > - Test case result lines
> 
> These are required.
> 
> > - Diagnostic lines
> 
> This is optional.
> 
> > - A bail out line
> 
> Bail out should be optional, and possibly not included at all. (See
> below.)
> 
> >
> > An important component in this specification of KTAP is the
> > specification of the format of nested tests. This can be found in the
> > section on nested tests below.
> >
> > The version line
> > ----------------
> >
> > The first line of KTAP output must be the version line. As this
> > specification documents the first version of KTAP,  the recommended
> > version line is "KTAP version 1". However, since all kernel testing
> > frameworks use TAP version lines, "TAP version 14" and "TAP version
> > 13" are all acceptable version lines. Version lines with other
> > versions of TAP or KTAP will not cause the parsing of the test results
> > to fail but it will produce an error.
> 
> Maybe "TAP version Linux.1" ? I don't want to needlessly break existing
> parsers.
> 
> >
> > Plan lines
> > ----------
> >
> > Plan lines must follow the format of "1..N" where N is the number of
> > subtests. The second line of KTAP output must be a plan line, which
> > indicates the number of tests at the highest level, such that the
> > tests do not have a parent. Also, in the instance of a test having
> > subtests, the second line of the test after the subtest header must be
> > a plan line which indicates the number of subtests within that test.
> 
> I do not want "subtests" as a specification concept, only "nested
> tests".
> 
> >
> > Test case result lines
> > ----------------------
> >
> > Test case result lines must have the format:
> >
> >     <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
> 
> "[<diagnostic data>]" is not defined below. I think what you mean is
> "directive details".

IMHO it's useful to keep this called 'diagnostic data'.

The directive is optional, and having this be diagnostic data means you can
put explanatory text, or result details, that are independent of the description,
on the same line as the result.  This is convenient in some cases.

I think diagnostic data should always be prefixed by a hash sign.
Or, in the case where diagnostic data appears on the result line,
the hash serves as the delimiter.  This means that the description
text must not include a hash character.

For example:
ok 5 - check return code # rcode=0
or
not ok 5 - check return code # rcode=17

> 
> I suggest:
> 
> <result> <number>[ [- ]<description>][# <directive> [<directive details>]]
> 
> > The result can be either "ok", which indicates the test case passed,
> > or "not ok", which indicates that the test case failed.
> 
> Yes.
> 
> >
> > The number represents the number of the test case or suite being
> > performed. The first test case or suite must have the number 1 and the
> > number must increase by 1 for each additional test case or result at
> > the same level and within the same testing suite.
> 
> Yes, though parsers should handle ordering failures and missed test
> results (this is the purpose of the "plan" line).
> 
> >
> > The "-" character is optional.
> >
> > The description is a description of the test, generally the name of
> > the test, and can be any string of words (can't include #). The
> > description is optional.
> 
> Yes, though the optional "- " is strictly part of the optional
> description.

It's mildly annoying that "-" is optional.  It's trivial to deal with in the parser
to just ignore it if it's present.  But it has no semantic meaning whatsoever.
IMHO it would be nice to either mandate it or remove it, for consistency's sake.
This could be based solely on the consensus for whether it added or detracted
from human readability, since parsers don't care.

> 
> > The directive is used to indicate if a test was skipped. The format
> 
> The directive is a single word -- currently only "SKIP" is recognized
> by TAP 14(?), but kselftest uses also XFAIL, XPASS, TIMEOUT, and
> error. (Though I would love to hear what "error" is intended to be used
> for, if different from regular "not ok".)

ERROR usually means that there was some kind of failure
in the testing process that makes the testcase result indeterminate.
(like loss of connection to a server, or crash of the program executing
the testcase).  This can be due to a bug in the test or the test system.
I'm, not sure if ERROR is used this way in kselftest.  It is, IMHO, a useful
distinction to make.  TIMEOUT is a special case of ERROR, IMHO.

As an aside, I've never liked XFAIL or XPASS.  Semantically, I believe they
are quite confusing, and depend on the level of testcase execution.  XPASS is also
redundant.  Isn't the default case always XPASS?

In the case of XFAIL, does it mean that an operation that was expected to fail
all the time, failed, and so the test passed?  Or does it mean that this is a known
bug that a particular developer or site has decided to ignore (as in "we know this
is a bug, but we expected it to fail, so just ignore it for now).  The latter, I really
don't think should be encoded in the tests themselves, but should be put
into a layer above the test and parser, that is applied according to developer
or testlab policy.

> 
> (This could also be called "context" rather than "directive".)
> 
> > for the directive is: "# SKIP [<skip_description>]". The
> > skip_description is optional and can be any string of words to
> > describe why the test was skipped.
> 
> I would call this "directive details".
I would not, since it is human readable text, and may be present
whether the directive is there or not, I would still call it 'diagnostic data'.

> 
> > The result of the test case result
> > line can be either "ok" or "not ok" if the skip directive is used.
> > Finally, note that TAP 14 specification includes TODO directives but
> > these are not supported for KTAP.
> >
> > Examples of test case result lines:
> >
> > Test passed:
> >
> >     ok 1 - test_case_name
> >
> > Test was skipped:
> >
> >     not ok 1 - test_case_name # SKIP test_case_name should be skipped
> >
> > Test failed:
> >
> >     not_ok 1 - test_case_name
> 
> This isn't valid. No "_" is recognized for "ok" vs "not ok".
> 
> >
> > Diagnostic lines
> > ----------------
> >
> > Diagnostic lines are used for description of testing operations.
> > Diagnostic lines are generally formatted as "#
> > <diagnostic_description>", where the description can be any string.
> > However, in practice, diagnostic lines are all lines that don't follow
> > the format of any other KTAP line format.
> 
> I still think there should be a distinction between "diagnostic lines"
> and "unknown lines". For example, if kselftest is running on a console,
> the dmesg output may be intermixed, and possibly temporally offset. Such
> lines may not be useful, and may not be strictly correlated, where-as
> the output from kselftest is at least self-consistent.

I think this would be a very good distinction to make.  You might have
unknown lines that (inadvertently?) start with a hash, but trying to 
make sure that the test itself always outputs diagnostic data with a preceding
hash gives a reader potentially more information about where the message
came from, and could be useful.

> 
> > Diagnostic lines can be
> > anywhere in the test output after the first two lines.
> 
> I don't see a reason for this strictness. They can be anywhere.
> 
> > There are a few
> > special diagnostic lines.
> 
> No; diagnostic lines must have no meaning: they are for humans and nothing
> else. If other details are needed for machines, we should explicitly
> create new format lines. I made a mistake when I used a diagnostic line
> for embedding the test names. :( There is a need for parsers to discover
> the name of a given test, though, so we do likely need something for this.

I agree it is handy to have the test name for sub-tests.  However, I don't
see any semantic difference whether we define sub-test declarations via
diagnostic lines with special syntax or by creating a new syntax separate
from diagnostic lines.  Either way, we have a new syntax for parsers to
understand.

> 
> > Diagnostic lines of the format "# Subtest:
> > <test_name>" indicate the start of a test with subtests. Also,
> > diagnostic lines of the format "# <test_name>: <description>" refer to
> > a specific test and tend to occur before the test result line of that
> > test but are optional.
> 
> I don't think the above should be included in the spec -- diag lines
> should have no parseable meaning.
> 
> >
> > Bail out line
> > -------------
> >
> > A bail out line can occur anywhere in the KTAP output and will
> > indicate that a test has crashed. The format of a bail out line is
> > "Bail out! [<description>]",  where the description can give
> > information on why the bail out occurred and can be any string.
> 
> I'm not a fan of the Bail out line. It's not a problem, exactly, but I
> find it redundant. If we want an "end of test" line, let's make one.
> "Bail out" is a special case of exit. If we want to handle test exit,
> let's define it. E.g. make kselftest's test summary lines not
> diagnostic lines:
> 
> # FAILED: 85 / 87 tests passed.
> # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0

How consistent is it for selftests to produce test summary lines?

> 
> Also, parsers should treat a new "TAP version..." line at the same
> nesting level as indication that the prior test has ended (and any tests
> without result lines should be considered failed).


Agreed.

> 
> >
> > Nested tests
> > ------------
> >
> > The new specification for KTAP will support an arbitrary number of
> > nested subtests. Thus, tests can now have subtests and those subtests
> > can have subtests. This can be useful to further categorize tests and
> > organize test results.
> >
> > The new required format for a test with subtests consists of: a
> > subtest header line, a plan line, all subtests, and a final test
> > result line.
> >
> > The first line of the test must be the subtest header line with the
> > format: "# Subtest: <test_name>".
> 
> Please no. There is no reason to force a nested test to suddenly have
> to know about its test execution depth/environment. A subtest is just a
> nested TAP result. That it is nested is only meaningful to the parser, not
> the executing test, and it must have the same output, nested or not. (e.g.
> running the test alone or running the test under a full kselftest run,
> the only difference is the indentation level.)
I agree that the sub-test should not know about this line.  I thought
the "parent" test wrote the line.

> 
> > The second line of the test must be the plan line, which is formatted
> > as "1..N", where N is the number of subtests.
> >
> > Following the plan line, all lines pertaining to the subtests will follow.
> >
> > Finally, the last line of the test is a final test result line with
> > the format: "(ok|not ok) <number> [-] [<description>] [<directive>]
> > [<diagnostic data>]", which follows the same format as the general
> > test result lines described in this section. The result line should
> > indicate the result of the subtests. Thus, if one of the subtests
> > fail, the test should fail. The description in the final test result
> > line should match the name of the test in the subtest header.
> >
> > An example format:
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..2
> >     ok 1 - test_1
> >     ok 2 - test_2
> > ok 1 - test_suite
> 
> Again, I see only downsides to this. Nesting for the spec is simple
> indentation-based recursion. Let's just keep what we already have:
> 
> TAP version 13
> 1..1
> # TAP version 13
> # 1..2
> # ok 1 - test_1
> # ok 2 - test_2
> ok 1 - test_suite
>

I prefer spaces, but whatever.  Is the prefix on the 'TAP version 13' line guaranteed
to be the prefix on all lines for the (sub-)test?
 
> 
> > An example format with multiple levels of nested testing:
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..2
> >         # Subtest: sub_test_suite
> >         1..2
> >         ok 1 - test_1
> >         ok 2 test_2
> >     ok 1 - sub_test_suite
> >     ok 2 - test
> > ok 1 - test_suite
> >
> > In the instance that the plan line is missing, the end of the test
> > will be denoted by the final result line containing a description that
> > matches the name of the test given in the subtest header. Note that
> > thus, if the plan line is missing and one of the subtests have a
> > matching name to the test suite this will cause errors.
> 
> A plan line is required. No special cases are needed. :)
> 
> If nesting level is lost, a parser will understand the nested test to
> have ended, but probably so did the test runner:
> 
> TAP version 13
> 1..1
>     TAP version 13
>     1..2
>         TAP version 13
>         1..3
>         ok 1 - test_1
>         ok 2 test_2
>     not ok 1 - sub test unfinished plan
>     ok 2 - test
> not ok 1 - test_suite
> 
> > Lastly, indentation is also recommended for improved readability.
> 
> Indentation is _required_. :)
I strongly agree.  You can't disambiguate some outputs, unless the
indentation is present.

> 
> Whether this is "# " or "  " I don't really care, as long as the change
> is coordinated. "  " is easier for humans to read, but may make parsing of
> "unknown" lines more difficult for machines.
I prefer spaces for human readability.

Just as a side note, in some Fuego tests, it was very useful to include an identifier
in thethe prefix nested tests.  The output looked like this:

TAP version 13
1..2
[batch_id 4] TAP version 13
[batch_id 4] 1..2
[batch_id 4] ok 1 - cyclictest with 1000 cycles
[batch_id 4] # problem setting CLOCK_REALTIME
[batch_id 4] not ok 2 - cyclictest with CLOCK_REALTIME
not ok 1 - check realtime
[batch_id 4] TAP version 13
[batch_id 4] 1..1
[batch_id 4] ok 1 - IOZone read/write 4k blocks
ok 2 - check I/O performance

Can I propose that the prefix not be fixed by the spec, but that the spec indicates that
whatever the prefix is on the TAP version line, that prefix must be used with the output for
all lines from the test (with the exception of unknown lines)?

> 
> >
> > Major differences between TAP 14 and KTAP specification
> > -------------------------------------------------------
> >
> > Note that the major differences between TAP 14 and KTAP specification:
> > - yaml and json are not allowed in diagnostic messages
> 
> Agreed -- these are overkill (and very difficult to implement as they
> _follow_ the test result line: anything generating them has already
> finished running).

I thought that some tests put the diagnostic data before the result line.

I don't care whether it is before or after (and indeed Fuego's parser
can handle either case).  But it can't handle it when some testcases
put diagnostic data before the result line, and some testcases put
it after, in the same test.  There's no delimiter for the separation
between the two different testcases diagnostic data.

Could we possibly have a rule about diagnostic lines positioning relative
to the testcase result line?  Maybe a rule like this:  For a single test,
diagnostic lines should either always preceded the testcase they
are related to, or follow the testcase they are related to.

> 
> > - TODO directive not allowed
> 
> I would just say "unrecognized".
> 
> > - KTAP allows for an arbitrary number of tests to be nested with
> > specified nested test format
> 
> Yup.
> 
> >
> > Example of KTAP
> > ---------------
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..1
> >         # Subtest: sub_test_suite
> >         1..2
> >         ok 1 - test_1
> >         ok 2 test_2
> >     ok 1 - sub_test_suite
> > ok 1 - test_suite
> 
> For a good example, please include all the possible combinations (SKIP,
> not ok, diagnostics, etc etc)
> 
> >
> > =========================================
> > Note on incompatibilities with kselftests
> > =========================================
> >
> > To my knowledge, the above specification seems to generally accept the
> > TAP format of many non-nested test results of kselftests.
> >
> > An example of a common kselftests TAP format for non-nested test
> > results that are accepted by the above specification:
> >
> > TAP version 13
> > 1..2
> > # selftests: vDSO: vdso_test_gettimeofday
> > # The time is 1628024856.096879
> > ok 1 selftests: vDSO: vdso_test_gettimeofday
> > # selftests: vDSO: vdso_test_getcpu
> > # Could not find __vdso_getcpu
> > ok 2 selftests: vDSO: vdso_test_getcpu # SKIP
> >
> > However, one major difference noted with kselftests is the use of more
> > directives than the "# SKIP" directive. kselftest also supports XPASS
> > and XFAIL directives. Some additional examples found in kselftests:
> >
> >     not ok 5 selftests: netfilter: nft_concat_range.sh # TIMEOUT 45 seconds
> >
> >     not ok 45 selftests: kvm: kvm_binary_stats_test # exit=127
> >
> > Should the specification be expanded to include these directives?
> 
> Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
> be reported without the '#')

Please No!  (Unless I'm misinterpreting that line).

If you remove the '#', that makes the 'exit=127' part of the test description.
The test description should be invariant between success and failure cases.
The exit=127 (if I'm reading that line right) is part of data explaining or
describing the failure. It would be 'exit=0' if the test succeeded.

In order to correlate test cases (ie show results over time), the description
should not change from one invocation of the test to the next.  At least Fuego
uses the test description this way.  That is, the testcase description is used as
a testcase identifier, and if it changes from run to run things get confusing.

> 
> >
> > However, the general format for kselftests with nested test results
> > seems to differ from the above specification. It seems that a general
> > format for nested tests is as follows:
> >
> > TAP version 13
> > 1..2
> > # selftests: membarrier: membarrier_test_single_thread
> > # TAP version 13
> > # 1..2
> > # ok 1 sys_membarrier available
> > # ok 2 sys membarrier invalid command test: command = -1, flags = 0,
> > errno = 22. Failed as expected
> > ok 1 selftests: membarrier: membarrier_test_single_thread
> > # selftests: membarrier: membarrier_test_multi_thread
> > # TAP version 13
> > # 1..2
> > # ok 1 sys_membarrier available
> > # ok 2 sys membarrier invalid command test: command = -1, flags = 0,
> > errno = 22. Failed as expected
> > ok 2 selftests: membarrier: membarrier_test_multi_thread
> >
> > The major differences here, that do not match the above specification,
> > are use of "# " as an indentation and using a TAP version line to
> > denote a new test with subtests rather than the subtest header line
> > described above. If these are widely utilized formats in kselftests,
> > should we include both versions in the specification or should we
> > attempt to agree on a single format for nested tests? I personally
> > believe we should try to agree on a single format for nested tests.
> > This would allow for a cleaner specification of KTAP and would reduce
> > possible confusion.
> 
> We already use "# " and the nested TAP lines to denote subtests. Without
> a good reason to change it, we should avoid the churn with the existing
> parsers.
> 
> > ====
> >
> > So what do people think about the above specification?
> > How should we handle the differences with kselftests?
> 
> I'm probably a broken record by now, but kselftests _is_ the
> specification. ;) What about it needs changing, and why?
> 
> > If this specification is accepted, where should the specification be documented?
> 
> I imagine some .rst file under Documentation/dev-tools/, linked to from
> kselftest.rst and kunit/...rst
> 
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC - kernel test result specification (KTAP)
  2021-08-30 17:34   ` Brendan Higgins
@ 2021-08-30 19:26     ` Kees Cook
  0 siblings, 0 replies; 14+ messages in thread
From: Kees Cook @ 2021-08-30 19:26 UTC (permalink / raw)
  To: Brendan Higgins
  Cc: Rae Moar, David Gow, Tim.Bird, Shuah Khan, Daniel Latypov,
	kunit-dev, linux-kselftest, linux-kernel, kernelci,
	Guillaume Tucker

On Mon, Aug 30, 2021 at 10:34:14AM -0700, Brendan Higgins wrote:
> On Sat, Aug 28, 2021 at 1:20 AM Kees Cook <keescook@chromium.org> wrote:
> > Please keep me in CC for these kinds of proposals. And thanks for looking
> 
> Oh sorry, we will make sure to do that in the future. I figured most
> people who were interested who weren't directly CC'ed were on the
> kselftest list.

I mainly use mailing list member as a way to have thread history
available. I have too many subscriptions to read them all. :)

> > is that we already _have_ a specification, and it's already being parsed
> 
> We do?! Where is it? Can you link it here?

I am, perhaps, too glib. What I mean is that we have an
_implementation_, and we cannot create a specification that doesn't
match the existing implementation without very good reason.

> > by machines, so making changes without strong justification is going to
> > be met with resistance. :) So, with that in mind, on to my reply...
> 
> Of course, but if there is no specification or it is not being
> universally followed, then changes will need to be made, no?

Sure, but I'm being pragmatic: the implementation is already operating.
We can't pretend it doesn't exist. :)

> > The fundamental purpose of the kernel's TAP is to have many independent
> > tests runnable and parseable, specifically without any embedded framework
> > knowledge (or, even, any knowledge of TAP).
> >
> > The tests run by kselftest come from 2 different TAP-producing harnesses
> > (kselftest.h for C, kselftest/runner.sh for TAP-agnostic tests) as well
> > as several hand-rolled instances in Shell, Python, and C. There used to
> > be more, but I have been steadily fixing their syntax[2] and merging[3]
> > separate implementations for a while now.
> 
> Yep! And I believe there are still some inconsistencies - hence part
> of the motivation.

Agreed -- I'm happy to create a specification around what we _have_ (and
fix any existing mistakes). But the proposed spec very much did not
match what exists.

> > Maybe "TAP version Linux.1" ? I don't want to needlessly break existing
> > parsers.
> 
> "TAP version Linux.1" would not break existing parsers? If so, that
> would be great, I just expect that it would break them too.

https://metacpan.org/dist/Test-Harness/source/lib/TAP/Parser.pm
and
https://github.com/python-tap/tappy/blob/master/tap/parser.py
won't like anything non-numeric.

But neither have any idea about "#" nor nested tests either. (Those
considerations are entirely from the ability to construct automated
parsing of test-separated logs.)

LAVA doesn't parse the TAP line at all, as it chunks by known test
names, doing a simple mapping of name to outcome for the LAVA results:
https://github.com/Linaro/test-definitions/blob/master/automated/linux/kselftest/kselftest.sh

So, to that end, I guess "KTAP version 1" is fine.

> > >
> > > Plan lines
> > > ----------
> > >
> > > Plan lines must follow the format of "1..N" where N is the number of
> > > subtests. The second line of KTAP output must be a plan line, which
> > > indicates the number of tests at the highest level, such that the
> > > tests do not have a parent. Also, in the instance of a test having
> > > subtests, the second line of the test after the subtest header must be
> > > a plan line which indicates the number of subtests within that test.
> >
> > I do not want "subtests" as a specification concept, only "nested
> > tests".
> 
> Why not? I got the impression on the previous thread that everyone was
> amenable to this?

Because it breaks kselftests's agnosticism. More details on this
below...

> > The directive is a single word -- currently only "SKIP" is recognized
> > by TAP 14(?), but kselftest uses also XFAIL, XPASS, TIMEOUT, and
> > error. (Though I would love to hear what "error" is intended to be used
> > for, if different from regular "not ok".)
> 
> Can you explain (or link to documentation) on what these other directives mean?

I can link to code. :)

SKIP:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/kselftest.h?h=v5.14#n181
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/kselftest/runner.sh?h=v5.14#n84
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/arm64/bti/test.c?h=v5.14#n118

TIMEOUT:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/kselftest/runner.sh?h=v5.14#n87

XFAIL:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/kselftest.h?h=v5.14#n167

error:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/kselftest.h?h=v5.14#n195

xpass (not actually a directive, and nothing uses it, I say we remove
it):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/kselftest.h?h=v5.14#n255

"exit=..." (improperly formed directive -- this should probably be
changed into a preceding diagnostic line)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/kselftest/runner.sh?h=v5.14#n89

Besides the two kselftest TAP harnesses, there are a few open-coded
TAP producers:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/arm64/bti/test.c?h=v5.14#n170
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/exec/binfmt_script?h=v5.14#n105
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/size/get_size.c?h=v5.14#n91
(This last one has a very confused idea about where to put the plan line.)

> > >
> > > Diagnostic lines
> > > ----------------
> > >
> > > Diagnostic lines are used for description of testing operations.
> > > Diagnostic lines are generally formatted as "#
> > > <diagnostic_description>", where the description can be any string.
> > > However, in practice, diagnostic lines are all lines that don't follow
> > > the format of any other KTAP line format.
> >
> > I still think there should be a distinction between "diagnostic lines"
> > and "unknown lines". For example, if kselftest is running on a console,
> > the dmesg output may be intermixed, and possibly temporally offset. Such
> > lines may not be useful, and may not be strictly correlated, where-as
> > the output from kselftest is at least self-consistent.
> 
> Interesting. I think David probably would have some thoughts here.

The intention with the current implementation is to be as robust as
possible against other lines of output existing, since kselftest output
ends up in weird places fairly regularly.

> > # FAILED: 85 / 87 tests passed.
> > # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0
> 
> Cool, any ideas on what we should do then? I personally don't have
> strong feelings on it.

For this? I say leave it diagnostic. These are mainly for humans to see
the results of a specific (usually long high-output) test run.

> > Also, parsers should treat a new "TAP version..." line at the same
> > nesting level as indication that the prior test has ended (and any tests
> > without result lines should be considered failed).
> 
> OK, that is definitely different. I think this will require some
> further discussion.

This is mostly about making good parsers. Parsers should be able to
understand enough context to gracefully recover from unexpected output.

> > Please no. There is no reason to force a nested test to suddenly have
> > to know about its test execution depth/environment. A subtest is just a
> > nested TAP result. That it is nested is only meaningful to the parser, not
> > the executing test, and it must have the same output, nested or not. (e.g.
> > running the test alone or running the test under a full kselftest run,
> > the only difference is the indentation level.)
> 
> Well it doesn't have to, right? The parent test can include the
> directive. In anycase, something has to know enough to print a test
> plan.

What happens in kselftest is that the kselftest harness has no idea
what is or isn't going to produce TAP output. That is an intentional
design goal of kselftest: authors should be able to drop a tool into
the tree that returns 0 for "success", and 1 for "failure". (The full
set of return codes is actually: "0" for "success", "4" for "skip",
"124" for "timeout", and anything else for failure.)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/kselftest/runner.sh?h=v5.14#n83

The runner can see how many binaries it expects to run in a given
directory, so it can build TAP output (with a plan line), for a
given directory. However, some of those tests have tons of output and
effectively have subtests -- but those binaries are _standalone_, in
the sense that they produce TAP output without any knowledge of their
surroundings. As such, the kselftest runner must indent them to make
the results parsable.

Having different output when in one harness or another has and will lead
to pain in comparing the results. So, a test run outside of kselftest
will be (should be) identical to running inside kselftest, but with
indentation.

We can _call_ this a subtest, but it's just nested. A subtest should not
need to know it is a subtest.

> > If nesting level is lost, a parser will understand the nested test to
> > have ended, but probably so did the test runner:
> >
> > TAP version 13
> > 1..1
> >     TAP version 13
> >     1..2
> >         TAP version 13
> >         1..3
> >         ok 1 - test_1
> >         ok 2 test_2
> >     not ok 1 - sub test unfinished plan
> >     ok 2 - test
> > not ok 1 - test_suite
> >
> > > Lastly, indentation is also recommended for improved readability.
> >
> > Indentation is _required_. :)
> 
> KUnit cannot support indentation being required. We print test results
> to dmesg. At anytime something could get printed to dmesg that could
> potentially mess up indentation.

Can you explain this? KUnit controls what it prints, yes? Anything
outside of the TAP output is an "unknown line". This is why I used "#"
for indentation: it is distinguishable from "unknown (dmesg) lines".

> The problem that was recognized, as I understand, was that there are
> multiple "interpretations" of TAP floating around the kernel and that
> goes against the original point of trying to standardize around TAP.

Right -- this is the part I disagree with. There is 1 interpretation,
and a couple mistakes in the various kernel implementations. Let's get
that documented first.

> I know KUnit's usage is pretty minor in comparison to kselftest, but
> people do use it and there is no point in us, KUnit, purporting to use
> TAP and remain compatible with any particular version of it if it is
> incompatible with kselftest's TAP.

I have tried to keep the kernel's TAP output parsable by TAP 13
consumers. I don't think we have much of a fork currently (the
diagnostic lines are a simple addition, and the nesting is trivially
extractable with no additional knowledge beyond recognizing the "TAP"
line within diagnostic lines).

> Additionally, there is no way that we are going to be able to stay on
> a compatible implementation of TAP unless we specify what TAP is
> separate from the implementation.

My goals are:

- have kernel test output (and nest test output) parsable by existing
  tools, with a strong desire to not break LAVA (though arguably LAVA
  isn't a TAP parser).

- have kernel test output distinguishable from unrelated dmesg output
  since kselftest is very frequently collected from the console.

- not needlessly break existing tooling without showing a meaningful
  benefit. (i.e. flag days in parsing under LAVA has been difficult,
  and I want to minimize those.)

- be able to map meaningful test execution state to a common set of
  states ("ok" and "not ok" are rarely sufficient), which includes
  timeouts, expected failures (e.g. missing CONFIG_*s, wrong arch,
  etc).

- be able to include diagnostic runtime output _in the test result_,
  because too frequently the binary yes/no of a test is way to far away
  from the actual test details, which make debugging extremely
  difficult.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC - kernel test result specification (KTAP)
  2021-08-30 17:48   ` Tim.Bird
@ 2021-08-30 19:58     ` Kees Cook
  2021-08-30 22:04       ` Tim.Bird
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Kees Cook @ 2021-08-30 19:58 UTC (permalink / raw)
  To: Tim.Bird
  Cc: rmoar, brendanhiggins, davidgow, shuah, dlatypov, kunit-dev,
	linux-kselftest, linux-kernel, kernelci, guillaume.tucker

On Mon, Aug 30, 2021 at 05:48:07PM +0000, Tim.Bird@sony.com wrote:
> From: Kees Cook <keescook@chromium.org>
> > > Test case result lines
> > > ----------------------
> > >
> > > Test case result lines must have the format:
> > >
> > >     <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
> > 
> > "[<diagnostic data>]" is not defined below. I think what you mean is
> > "directive details".
> 
> IMHO it's useful to keep this called 'diagnostic data'.
> 
> The directive is optional, and having this be diagnostic data means you can
> put explanatory text, or result details, that are independent of the description,
> on the same line as the result.  This is convenient in some cases.
> 
> I think diagnostic data should always be prefixed by a hash sign.
> Or, in the case where diagnostic data appears on the result line,
> the hash serves as the delimiter.  This means that the description
> text must not include a hash character.
> 
> For example:
> ok 5 - check return code # rcode=0
> or
> not ok 5 - check return code # rcode=17

The directive is currently used to answer the question "why?" as in,
"why is this test not ok?" ("a timeout occurred") or "why is this test
ok?" ("a SKIP")

The "supporting evidence" for the "why?" answer are specific details
about that: "the exit was was NNN" or "the timeout was YYY seconds".

Test diagnostics are all the lines preceding that (currently prefixed
with a "#").

> > Yes, though the optional "- " is strictly part of the optional
> > description.
> 
> It's mildly annoying that "-" is optional.  It's trivial to deal with in the parser
> to just ignore it if it's present.  But it has no semantic meaning whatsoever.
> IMHO it would be nice to either mandate it or remove it, for consistency's sake.
> This could be based solely on the consensus for whether it added or detracted
> from human readability, since parsers don't care.

I have no strong opinion on the "-". I was surprised to encounter it
as it's not part of the TAP 13 spec. I would prefer to drop it, if I had
to choose.

> ERROR usually means that there was some kind of failure
> in the testing process that makes the testcase result indeterminate.
> (like loss of connection to a server, or crash of the program executing
> the testcase).  This can be due to a bug in the test or the test system.
> I'm, not sure if ERROR is used this way in kselftest.  It is, IMHO, a useful
> distinction to make.  TIMEOUT is a special case of ERROR, IMHO.

Well, I think TIMEOUT is a "why not ok?" answer. "not ok" seems to be
the same as ERROR, but I guess it's really a catch-all, but must come
with the "evidence" portion:

not ok 1 description goes here # TIMEOUT 45 seconds with no results
not ok 1 description goes here # ERROR lost connection to server
not ok 1 description goes here # ERROR non-zero exit code: NNN

I might go so far as to say the directive is _required_ for the "not ok"
state.

> As an aside, I've never liked XFAIL or XPASS.  Semantically, I believe they
> are quite confusing, and depend on the level of testcase execution.  XPASS is also
> redundant.  Isn't the default case always XPASS?

Nothing currently uses XPASS, and we should remove it.

XFAIL capture the condition of the test logic in that a test for
something arm64-specific on a non-arm64 machine is _not_ a "pass". It
will fail. But it's an expected failure. And, as a directive, must come
with the evidence bit:

not ok 1 PAN catches bad accessg # XFAIL PAN CPU feature is arm64 only

As a test author, I want to see this as distinct from failure and
success.

> In the case of XFAIL, does it mean that an operation that was expected to fail
> all the time, failed, and so the test passed?  Or does it mean that this is a known
> bug that a particular developer or site has decided to ignore (as in "we know this
> is a bug, but we expected it to fail, so just ignore it for now).  The latter, I really
> don't think should be encoded in the tests themselves, but should be put
> into a layer above the test and parser, that is applied according to developer
> or testlab policy.

I agree that "just ignore it for now" shouldn't be there.

Other arguments are that xfail tests shouldn't be run at all, but I
don't think that's right because it makes it very hard to do large-scale
test result comparisons because random stuff is missing, depending on
various configs, archs, etc. It blocks a 1-to-1 comparison, and begs
questions like "why is the test here in one case and not in another?"
where as an XFAIL will encode the reason it is an XFAIL in its output,
providing a self-documenting test result.

> > I still think there should be a distinction between "diagnostic lines"
> > and "unknown lines". For example, if kselftest is running on a console,
> > the dmesg output may be intermixed, and possibly temporally offset. Such
> > lines may not be useful, and may not be strictly correlated, where-as
> > the output from kselftest is at least self-consistent.
> 
> I think this would be a very good distinction to make.  You might have
> unknown lines that (inadvertently?) start with a hash, but trying to 
> make sure that the test itself always outputs diagnostic data with a preceding
> hash gives a reader potentially more information about where the message
> came from, and could be useful.

Right, so lines with a hash shouldn't be machine-parsed. (And this is a
mistake I made in the current kselftest output for the LAVA parser.) The
good news for the kernel, is that nothing else in dmesg leads with a
"#" currently. :)

> > No; diagnostic lines must have no meaning: they are for humans and nothing
> > else. If other details are needed for machines, we should explicitly
> > create new format lines. I made a mistake when I used a diagnostic line
> > for embedding the test names. :( There is a need for parsers to discover
> > the name of a given test, though, so we do likely need something for this.
> 
> I agree it is handy to have the test name for sub-tests.  However, I don't
> see any semantic difference whether we define sub-test declarations via
> diagnostic lines with special syntax or by creating a new syntax separate
> from diagnostic lines.  Either way, we have a new syntax for parsers to
> understand.

Pragmatically, I agree, which is what lead me to my mistake. However, in
reconsidering it, I realize this leads to a slippery slope of just
dumping stuff into diagnostic lines and pretending nothing will read it.

I suspect what we need is an optional "starting test" line, like:

test N DESC
# diag...
# diag...
# diag...
ok N

The nesting looks like:

TAP version 13
1..3
test 1 seccomp
# TAP version 13
# 1..27
# test 1 prctl
# ok 1
# test 2 strict
# ok 2
# ...
# test 27 user_notif
# # eek missing CONFIG_....
# not ok 27 user_notif # ERROR can't use ioctl
not ok 1 seccomp # ERROR non-zero exit code 1
test 2 netfilter
ok 2 netfilter
test 3 bpf
# exciting things happen
ok 3 bpf

This is what we have now except basically just replacing "# $name" with
"test $num $name"

> > 
> > > Diagnostic lines of the format "# Subtest:
> > > <test_name>" indicate the start of a test with subtests. Also,
> > > diagnostic lines of the format "# <test_name>: <description>" refer to
> > > a specific test and tend to occur before the test result line of that
> > > test but are optional.
> > 
> > I don't think the above should be included in the spec -- diag lines
> > should have no parseable meaning.
> > 
> > >
> > > Bail out line
> > > -------------
> > >
> > > A bail out line can occur anywhere in the KTAP output and will
> > > indicate that a test has crashed. The format of a bail out line is
> > > "Bail out! [<description>]",  where the description can give
> > > information on why the bail out occurred and can be any string.
> > 
> > I'm not a fan of the Bail out line. It's not a problem, exactly, but I
> > find it redundant. If we want an "end of test" line, let's make one.
> > "Bail out" is a special case of exit. If we want to handle test exit,
> > let's define it. E.g. make kselftest's test summary lines not
> > diagnostic lines:
> > 
> > # FAILED: 85 / 87 tests passed.
> > # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0
> 
> How consistent is it for selftests to produce test summary lines?

Any of the tools built with kselftest.h will spit it out. (Which is to
say several of the "subtests".) The bulk of kselftest output is from the
runner, which doesn't produce this summary.

> > Please no. There is no reason to force a nested test to suddenly have
> > to know about its test execution depth/environment. A subtest is just a
> > nested TAP result. That it is nested is only meaningful to the parser, not
> > the executing test, and it must have the same output, nested or not. (e.g.
> > running the test alone or running the test under a full kselftest run,
> > the only difference is the indentation level.)
> I agree that the sub-test should not know about this line.  I thought
> the "parent" test wrote the line.

The suggestion was that the subtest should _not_ emit the "TAP" line,
which is what I strongly disagree with: it should be a self-contained
TAP-emitting test.

> > Again, I see only downsides to this. Nesting for the spec is simple
> > indentation-based recursion. Let's just keep what we already have:
> > 
> > TAP version 13
> > 1..1
> > # TAP version 13
> > # 1..2
> > # ok 1 - test_1
> > # ok 2 - test_2
> > ok 1 - test_suite
> 
> I prefer spaces, but whatever.  Is the prefix on the 'TAP version 13' line guaranteed
> to be the prefix on all lines for the (sub-)test?

That's the current implementation, yes.

As for tabs vs spaces, I think it's fine to swap "# " for "  ". We just
need to patch LAVA to deal with it.

> Just as a side note, in some Fuego tests, it was very useful to include an identifier
> in thethe prefix nested tests.  The output looked like this:
> 
> TAP version 13
> 1..2
> [batch_id 4] TAP version 13
> [batch_id 4] 1..2
> [batch_id 4] ok 1 - cyclictest with 1000 cycles
> [batch_id 4] # problem setting CLOCK_REALTIME
> [batch_id 4] not ok 2 - cyclictest with CLOCK_REALTIME
> not ok 1 - check realtime
> [batch_id 4] TAP version 13
> [batch_id 4] 1..1
> [batch_id 4] ok 1 - IOZone read/write 4k blocks
> ok 2 - check I/O performance
> 
> Can I propose that the prefix not be fixed by the spec, but that the spec indicates that
> whatever the prefix is on the TAP version line, that prefix must be used with the output for
> all lines from the test (with the exception of unknown lines)?

Oh, interesting. This would also allow parallel (unique) test execution
to be parsable. That sounds workable. (Again, this needs LAVA patching
again...)

> > > Major differences between TAP 14 and KTAP specification
> > > -------------------------------------------------------
> > >
> > > Note that the major differences between TAP 14 and KTAP specification:
> > > - yaml and json are not allowed in diagnostic messages
> > 
> > Agreed -- these are overkill (and very difficult to implement as they
> > _follow_ the test result line: anything generating them has already
> > finished running).
> 
> I thought that some tests put the diagnostic data before the result line.
> 
> I don't care whether it is before or after (and indeed Fuego's parser
> can handle either case).  But it can't handle it when some testcases
> put diagnostic data before the result line, and some testcases put
> it after, in the same test.  There's no delimiter for the separation
> between the two different testcases diagnostic data.

Right. The current implementation puts it always before. TAP 13 (14?) has
the YAML _after_ the results line, but that's totally unworkable for
kselftest, which is designed to be first human readable and second machine
readable. Waiting to see diagnostics from a slow test is terrible --
we can't collect it and spit it out later. Also, a parser would have
no idea about when a test is finished if diagnostics follow the last
test result of a plan.

Making it "always before" is fine by me, and the motivation for making
the "summary" lines be non-diagnostics because they technically followed
the last test result.

> Could we possibly have a rule about diagnostic lines positioning relative
> to the testcase result line?  Maybe a rule like this:  For a single test,
> diagnostic lines should either always preceded the testcase they
> are related to, or follow the testcase they are related to.

For the kernel, I would rather avoid anything after because that just
confuses things.

> > Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
> > be reported without the '#')
> 
> Please No!  (Unless I'm misinterpreting that line).
> 
> If you remove the '#', that makes the 'exit=127' part of the test description.
> The test description should be invariant between success and failure cases.
> The exit=127 (if I'm reading that line right) is part of data explaining or
> describing the failure. It would be 'exit=0' if the test succeeded.
> 
> In order to correlate test cases (ie show results over time), the description
> should not change from one invocation of the test to the next.  At least Fuego
> uses the test description this way.  That is, the testcase description is used as
> a testcase identifier, and if it changes from run to run things get confusing.

Agreed, I think it should be part of an ERROR directive:

not ok 14 awesome test # ERROR exit=127


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: RFC - kernel test result specification (KTAP)
  2021-08-30 19:58     ` Kees Cook
@ 2021-08-30 22:04       ` Tim.Bird
  2021-08-31  7:46         ` David Gow
  2021-09-01 12:01         ` Mark Brown
  2021-08-30 22:21       ` Tim.Bird
  2022-02-04  2:44       ` Frank Rowand
  2 siblings, 2 replies; 14+ messages in thread
From: Tim.Bird @ 2021-08-30 22:04 UTC (permalink / raw)
  To: keescook
  Cc: rmoar, brendanhiggins, davidgow, shuah, dlatypov, kunit-dev,
	linux-kselftest, linux-kernel, kernelci, guillaume.tucker



> -----Original Message-----
> From: Kees Cook <keescook@chromium.org>
> 
> On Mon, Aug 30, 2021 at 05:48:07PM +0000, Tim.Bird@sony.com wrote:
> > From: Kees Cook <keescook@chromium.org>
> > > > Test case result lines
> > > > ----------------------
> > > >
> > > > Test case result lines must have the format:
> > > >
> > > >     <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
> > >
> > > "[<diagnostic data>]" is not defined below. I think what you mean is
> > > "directive details".
> >
> > IMHO it's useful to keep this called 'diagnostic data'.
> >
> > The directive is optional, and having this be diagnostic data means you can
> > put explanatory text, or result details, that are independent of the description,
> > on the same line as the result.  This is convenient in some cases.
> >
> > I think diagnostic data should always be prefixed by a hash sign.
> > Or, in the case where diagnostic data appears on the result line,
> > the hash serves as the delimiter.  This means that the description
> > text must not include a hash character.
> >
> > For example:
> > ok 5 - check return code # rcode=0
> > or
> > not ok 5 - check return code # rcode=17
> 
> The directive is currently used to answer the question "why?" as in,
> "why is this test not ok?" ("a timeout occurred") or "why is this test
> ok?" ("a SKIP")
> 
> The "supporting evidence" for the "why?" answer are specific details
> about that: "the exit was was NNN" or "the timeout was YYY seconds".
> 
> Test diagnostics are all the lines preceding that (currently prefixed
> with a "#").
> 
> > > Yes, though the optional "- " is strictly part of the optional
> > > description.
> >
> > It's mildly annoying that "-" is optional.  It's trivial to deal with in the parser
> > to just ignore it if it's present.  But it has no semantic meaning whatsoever.
> > IMHO it would be nice to either mandate it or remove it, for consistency's sake.
> > This could be based solely on the consensus for whether it added or detracted
> > from human readability, since parsers don't care.
> 
> I have no strong opinion on the "-". I was surprised to encounter it
> as it's not part of the TAP 13 spec. I would prefer to drop it, if I had
> to choose.
> 
> > ERROR usually means that there was some kind of failure
> > in the testing process that makes the testcase result indeterminate.
> > (like loss of connection to a server, or crash of the program executing
> > the testcase).  This can be due to a bug in the test or the test system.
> > I'm, not sure if ERROR is used this way in kselftest.  It is, IMHO, a useful
> > distinction to make.  TIMEOUT is a special case of ERROR, IMHO.
> 
> Well, I think TIMEOUT is a "why not ok?" answer. "not ok" seems to be
> the same as ERROR, but I guess it's really a catch-all, but must come
> with the "evidence" portion:

'not ok' should not mean the same thing as ERROR.  'not ok' means the 
testcase was run successfully, and the result is that the test determined
that the testcase failed.  ERROR means that the testcase was not
run successfully, so no result can be assumed.

But ERROR is confusing.  Maybe there's a better word for the concept
that is less likely to be conflated with the result.

> 
> not ok 1 description goes here # TIMEOUT 45 seconds with no results
> not ok 1 description goes here # ERROR lost connection to server
> not ok 1 description goes here # ERROR non-zero exit code: NNN
> 
> I might go so far as to say the directive is _required_ for the "not ok"
> state.
As I said above, I don't think they are the same.  I would reserve
ERROR for the case where a test error occurred and the test case
result could not be obtained.

If we go to the trouble of making our own spec, maybe it would
be better to support a third result value: 'unknown'.  This would
break all the TAP parsers, but it's much clearer, IMHO.

'ok 1 description' = test case executed successfully, and the testcase condition passed
'not ok 1 description' = test case executed successfully, and the testcase condition failed
'unknown 1 description # SKIP # reason' = test case does not apply for some reason
'unknown 1 description # ERROR # reason' = test case could not be executed, for some reason

In the case of SKIP and ERROR, the result should be ignored as it is either inapplicable
or wasn't obtained.

I remember we had a discussion about whether SKIP should have an "ok" or "not ok"
result.  IMHO it doesn't matter, because the result has no meaning when a SKIP directive
is used.  (Although it's useful for humans to see why something was skipped.)

> 
> > As an aside, I've never liked XFAIL or XPASS.  Semantically, I believe they
> > are quite confusing, and depend on the level of testcase execution.  XPASS is also
> > redundant.  Isn't the default case always XPASS?
> 
> Nothing currently uses XPASS, and we should remove it.
> 
> XFAIL capture the condition of the test logic in that a test for
> something arm64-specific on a non-arm64 machine is _not_ a "pass". It
> will fail. But it's an expected failure. And, as a directive, must come
> with the evidence bit:
> 
> not ok 1 PAN catches bad accessg # XFAIL PAN CPU feature is arm64 only
> 
> As a test author, I want to see this as distinct from failure and
> success.
I agree that this case is different from a normal failure.

In the example you provide of a feature that applies to only specific
architectures, I would use a SKIP instead.  It's not that the testcase failed.
It's that it does not even make sense for the device under test.

Whether something is 1) a testcase failure or 2) a testcase that does not
apply is a matter of perspective.  Is it the case that the feature should  exist
on all architectures,  and it's just not been written yet?  I could call this a
'not ok' or an XFAIL. However, if it's a feature that is never intended to be
written for arm32 or i386,  I would report that as a SKIP.

But all of this differs from what I originally described, which is labs using
XFAIL to annotate bugs that they knew about but didn't intend to fix.

> 
> > In the case of XFAIL, does it mean that an operation that was expected to fail
> > all the time, failed, and so the test passed?  Or does it mean that this is a known
> > bug that a particular developer or site has decided to ignore (as in "we know this
> > is a bug, but we expected it to fail, so just ignore it for now).  The latter, I really
> > don't think should be encoded in the tests themselves, but should be put
> > into a layer above the test and parser, that is applied according to developer
> > or testlab policy.
> 
> I agree that "just ignore it for now" shouldn't be there.
> 
> Other arguments are that xfail tests shouldn't be run at all, but I
> don't think that's right because it makes it very hard to do large-scale
> test result comparisons because random stuff is missing, depending on
> various configs, archs, etc. It blocks a 1-to-1 comparison, and begs
> questions like "why is the test here in one case and not in another?"
I completely agree with this.  It really messes up tables of results when
the list of testcases varies depending on some board
or configuration attribute.  It's very hard for users to parse and compare
results in that case.

> where as an XFAIL will encode the reason it is an XFAIL in its output,
> providing a self-documenting test result.


> 
> > > I still think there should be a distinction between "diagnostic lines"
> > > and "unknown lines". For example, if kselftest is running on a console,
> > > the dmesg output may be intermixed, and possibly temporally offset. Such
> > > lines may not be useful, and may not be strictly correlated, where-as
> > > the output from kselftest is at least self-consistent.
> >
> > I think this would be a very good distinction to make.  You might have
> > unknown lines that (inadvertently?) start with a hash, but trying to
> > make sure that the test itself always outputs diagnostic data with a preceding
> > hash gives a reader potentially more information about where the message
> > came from, and could be useful.
> 
> Right, so lines with a hash shouldn't be machine-parsed. (And this is a
> mistake I made in the current kselftest output for the LAVA parser.) The
> good news for the kernel, is that nothing else in dmesg leads with a
> "#" currently. :)

Oh. That's actually nice to know.  Interesting.

> 
> > > No; diagnostic lines must have no meaning: they are for humans and nothing
> > > else. If other details are needed for machines, we should explicitly
> > > create new format lines. I made a mistake when I used a diagnostic line
> > > for embedding the test names. :( There is a need for parsers to discover
> > > the name of a given test, though, so we do likely need something for this.
> >
> > I agree it is handy to have the test name for sub-tests.  However, I don't
> > see any semantic difference whether we define sub-test declarations via
> > diagnostic lines with special syntax or by creating a new syntax separate
> > from diagnostic lines.  Either way, we have a new syntax for parsers to
> > understand.
> 
> Pragmatically, I agree, which is what lead me to my mistake. However, in
> reconsidering it, I realize this leads to a slippery slope of just
> dumping stuff into diagnostic lines and pretending nothing will read it.

Good point.  Whether it starts with a hash or not, TAP 13 parsers should
ignore anything that is not a result line (starts with 'ok' or 'not ok').
So I guess we are free to make a new syntax without upsetting any existing
parsers.

> 
> I suspect what we need is an optional "starting test" line, like:
> 
> test N DESC
> # diag...
> # diag...
> # diag...
> ok N
> 
> The nesting looks like:
> 
> TAP version 13
> 1..3
> test 1 seccomp
> # TAP version 13
> # 1..27
> # test 1 prctl
> # ok 1
> # test 2 strict
> # ok 2
> # ...
> # test 27 user_notif
> # # eek missing CONFIG_....
> # not ok 27 user_notif # ERROR can't use ioctl
> not ok 1 seccomp # ERROR non-zero exit code 1
> test 2 netfilter
> ok 2 netfilter
> test 3 bpf
> # exciting things happen
> ok 3 bpf
> 
> This is what we have now except basically just replacing "# $name" with
> "test $num $name"
> 
Actually, I like this idea.  Including the number is a nice touch.
I think that will help to distinguish such a line from other 'unknown line' output.
I don't know if I'd leave the description off of the result line, though. 
A lot of parsers are looking for that.

> > >
> > > > Diagnostic lines of the format "# Subtest:
> > > > <test_name>" indicate the start of a test with subtests. Also,
> > > > diagnostic lines of the format "# <test_name>: <description>" refer to
> > > > a specific test and tend to occur before the test result line of that
> > > > test but are optional.
> > >
> > > I don't think the above should be included in the spec -- diag lines
> > > should have no parseable meaning.
> > >
> > > >
> > > > Bail out line
> > > > -------------
> > > >
> > > > A bail out line can occur anywhere in the KTAP output and will
> > > > indicate that a test has crashed. The format of a bail out line is
> > > > "Bail out! [<description>]",  where the description can give
> > > > information on why the bail out occurred and can be any string.
> > >
> > > I'm not a fan of the Bail out line. It's not a problem, exactly, but I
> > > find it redundant. If we want an "end of test" line, let's make one.
> > > "Bail out" is a special case of exit. If we want to handle test exit,
> > > let's define it. E.g. make kselftest's test summary lines not
> > > diagnostic lines:
> > >
> > > # FAILED: 85 / 87 tests passed.
> > > # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0
> >
> > How consistent is it for selftests to produce test summary lines?
> 
> Any of the tools built with kselftest.h will spit it out. (Which is to
> say several of the "subtests".) The bulk of kselftest output is from the
> runner, which doesn't produce this summary.
> 
> > > Please no. There is no reason to force a nested test to suddenly have
> > > to know about its test execution depth/environment. A subtest is just a
> > > nested TAP result. That it is nested is only meaningful to the parser, not
> > > the executing test, and it must have the same output, nested or not. (e.g.
> > > running the test alone or running the test under a full kselftest run,
> > > the only difference is the indentation level.)
> > I agree that the sub-test should not know about this line.  I thought
> > the "parent" test wrote the line.
> 
> The suggestion was that the subtest should _not_ emit the "TAP" line,
> which is what I strongly disagree with: it should be a self-contained
> TAP-emitting test.

OK - agreed.  Yes.  The sub-test should not know whether they are running
standalone or nested (with the exception of some line prefix opaquely conveyed
to them)

> 
> > > Again, I see only downsides to this. Nesting for the spec is simple
> > > indentation-based recursion. Let's just keep what we already have:
> > >
> > > TAP version 13
> > > 1..1
> > > # TAP version 13
> > > # 1..2
> > > # ok 1 - test_1
> > > # ok 2 - test_2
> > > ok 1 - test_suite
> >
> > I prefer spaces, but whatever.  Is the prefix on the 'TAP version 13' line guaranteed
> > to be the prefix on all lines for the (sub-)test?
> 
> That's the current implementation, yes.
> 
> As for tabs vs spaces, I think it's fine to swap "# " for "  ". We just
> need to patch LAVA to deal with it.
> 
> > Just as a side note, in some Fuego tests, it was very useful to include an identifier
> > in thethe prefix nested tests.  The output looked like this:
> >
> > TAP version 13
> > 1..2
> > [batch_id 4] TAP version 13
> > [batch_id 4] 1..2
> > [batch_id 4] ok 1 - cyclictest with 1000 cycles
> > [batch_id 4] # problem setting CLOCK_REALTIME
> > [batch_id 4] not ok 2 - cyclictest with CLOCK_REALTIME
> > not ok 1 - check realtime
> > [batch_id 4] TAP version 13
> > [batch_id 4] 1..1
> > [batch_id 4] ok 1 - IOZone read/write 4k blocks
> > ok 2 - check I/O performance
> >
> > Can I propose that the prefix not be fixed by the spec, but that the spec indicates that
> > whatever the prefix is on the TAP version line, that prefix must be used with the output for
> > all lines from the test (with the exception of unknown lines)?
> 
> Oh, interesting. This would also allow parallel (unique) test execution
> to be parsable. That sounds workable. (Again, this needs LAVA patching
> again...)
> 
> > > > Major differences between TAP 14 and KTAP specification
> > > > -------------------------------------------------------
> > > >
> > > > Note that the major differences between TAP 14 and KTAP specification:
> > > > - yaml and json are not allowed in diagnostic messages
> > >
> > > Agreed -- these are overkill (and very difficult to implement as they
> > > _follow_ the test result line: anything generating them has already
> > > finished running).
> >
> > I thought that some tests put the diagnostic data before the result line.
> >
> > I don't care whether it is before or after (and indeed Fuego's parser
> > can handle either case).  But it can't handle it when some testcases
> > put diagnostic data before the result line, and some testcases put
> > it after, in the same test.  There's no delimiter for the separation
> > between the two different testcases diagnostic data.
> 
> Right. The current implementation puts it always before. TAP 13 (14?) has
> the YAML _after_ the results line, but that's totally unworkable for
> kselftest, which is designed to be first human readable and second machine
> readable. Waiting to see diagnostics from a slow test is terrible --
> we can't collect it and spit it out later. Also, a parser would have
> no idea about when a test is finished if diagnostics follow the last
> test result of a plan.
> 
> Making it "always before" is fine by me, and the motivation for making
> the "summary" lines be non-diagnostics because they technically followed
> the last test result.
> 
> > Could we possibly have a rule about diagnostic lines positioning relative
> > to the testcase result line?  Maybe a rule like this:  For a single test,
> > diagnostic lines should either always preceded the testcase they
> > are related to, or follow the testcase they are related to.
> 
> For the kernel, I would rather avoid anything after because that just
> confuses things.
> 
> > > Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
> > > be reported without the '#')
> >
> > Please No!  (Unless I'm misinterpreting that line).
> >
> > If you remove the '#', that makes the 'exit=127' part of the test description.
> > The test description should be invariant between success and failure cases.
> > The exit=127 (if I'm reading that line right) is part of data explaining or
> > describing the failure. It would be 'exit=0' if the test succeeded.
> >
> > In order to correlate test cases (ie show results over time), the description
> > should not change from one invocation of the test to the next.  At least Fuego
> > uses the test description this way.  That is, the testcase description is used as
> > a testcase identifier, and if it changes from run to run things get confusing.
> 
> Agreed, I think it should be part of an ERROR directive:
> 
> not ok 14 awesome test # ERROR exit=127

As long as the line above means that testcase '14 awesome test' could not be
run, and exit=127 helps diagnose that, then I agree.  If 'exit=127' means that
testcase '14 awesome test' was run and exit=127 was a failing result, I would
say that the result line should just be:
not ok 14 awesome test # exit=127

  -- Tim Bird


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: RFC - kernel test result specification (KTAP)
  2021-08-30 19:58     ` Kees Cook
  2021-08-30 22:04       ` Tim.Bird
@ 2021-08-30 22:21       ` Tim.Bird
  2022-02-04  2:44       ` Frank Rowand
  2 siblings, 0 replies; 14+ messages in thread
From: Tim.Bird @ 2021-08-30 22:21 UTC (permalink / raw)
  To: keescook
  Cc: rmoar, brendanhiggins, davidgow, shuah, dlatypov, kunit-dev,
	linux-kselftest, linux-kernel, kernelci, guillaume.tucker



> -----Original Message-----
> From: Kees Cook <keescook@chromium.org>
> 
> On Mon, Aug 30, 2021 at 05:48:07PM +0000, Tim.Bird@sony.com wrote:
> > From: Kees Cook <keescook@chromium.org>
...
> > > Yes, though the optional "- " is strictly part of the optional
> > > description.
> >
> > It's mildly annoying that "-" is optional.  It's trivial to deal with in the parser
> > to just ignore it if it's present.  But it has no semantic meaning whatsoever.
> > IMHO it would be nice to either mandate it or remove it, for consistency's sake.
> > This could be based solely on the consensus for whether it added or detracted
> > from human readability, since parsers don't care.
> 
> I have no strong opinion on the "-". I was surprised to encounter it
> as it's not part of the TAP 13 spec. I would prefer to drop it, if I had
> to choose.

The TAP 13 specification does not mention "-", but a dash on the result line
is used in most of the examples shown in the specification here:
http://testanything.org/tap-specification.html

In the top two examples on that page, the first one does not use dashes and
the second one does.  It's kind of irritating to have that kind of loosey-goosey
syntax in a specification.  IMHO the syntax should be more strictly specified
than this.  I don't have a strong opinion either on whether to use dashes or not.
But it would be nice to make it consistent.
 -- Tim Bird


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC - kernel test result specification (KTAP)
  2021-08-28  8:20 ` Kees Cook
  2021-08-30 17:34   ` Brendan Higgins
  2021-08-30 17:48   ` Tim.Bird
@ 2021-08-30 23:39   ` David Gow
  2021-08-31  2:14   ` Rae Moar
  3 siblings, 0 replies; 14+ messages in thread
From: David Gow @ 2021-08-30 23:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Rae Moar, Brendan Higgins, Bird, Tim, Shuah Khan, Daniel Latypov,
	KUnit Development, open list:KERNEL SELFTEST FRAMEWORK,
	Linux Kernel Mailing List, kernelci, Guillaume Tucker

On Sat, Aug 28, 2021 at 4:20 PM Kees Cook <keescook@chromium.org> wrote:
>
> [+kernelci, +Guillaume]
>
> Hi!
>
> Please keep me in CC for these kinds of proposals. And thanks for looking
> at this again! Please understand that while I may be coming across as
> rather negative here, I would like to have a rational and documented
> specification for the kernel's test output, too. My main objection here
> is that we already _have_ a specification, and it's already being parsed
> by machines, so making changes without strong justification is going to
> be met with resistance. :) So, with that in mind, on to my reply...
>
> On Tue, Aug 10, 2021 at 04:25:03PM -0700, Rae Moar wrote:
> > We are looking to further standardise the output format used by kernel
> > test frameworks like kselftest and KUnit. Thus far we have used the
> > TAP (Test Anything Protocol) specification, but it has been extended
> > in many different ways, so we would like to agree on a common "Kernel
> > TAP" (KTAP) format to resolve these differences. Thus, below is a
> > draft of a specification of KTAP. Note that this specification is
> > largely based on the current format of test results for KUnit tests.
>
> The kernel's largest producer of TAP is kselftest, and the largest
> consumer is LAVA[1]. I would want buy-in from at least those responsible
> for either side of those two things. (And I'm one of the people working
> on both sides of it.)
>
> The fundamental purpose of the kernel's TAP is to have many independent
> tests runnable and parseable, specifically without any embedded framework
> knowledge (or, even, any knowledge of TAP).
>
> The tests run by kselftest come from 2 different TAP-producing harnesses
> (kselftest.h for C, kselftest/runner.sh for TAP-agnostic tests) as well
> as several hand-rolled instances in Shell, Python, and C. There used to
> be more, but I have been steadily fixing their syntax[2] and merging[3]
> separate implementations for a while now.
>
> [1] https://github.com/Linaro/test-definitions/commit/8bd338bbcfa5a03efcf1d12e25b5d341d5a29cbc
> [2] https://git.kernel.org/linus/b0df366bbd701c45e93af0dcb87ce22398589d1d
> [3] https://git.kernel.org/linus/e80068be21824e4d2726eeea41cac6178d212415
>

Thanks a lot for doing this, by the way!

I think many of the issues here stem from the original TAP spec having
been insufficient for kernel stuff, and a bit of divergent evolution
having occurred between kselftest, KUnit, and the dormant TAP 14 spec.
This proposed spec does approach things more from the KUnit side, just
because that's what we're more familiar with, but I agree that
kselftest and LAVA are the bigger fish in this pond. KUnit's parser
has also been a bit stricter in what it accepts, and the TAP producing
code is shared between all of the KUnit tests, which makes prototyping
changes a bit easier.

Fortunately, most of these differences seem pretty minor in the grand
scheme of things, so I'm sure we can adapt this spec to fit what
kselftest is doing better, while still leaving enough of the structure
the KUnit tooling requires.

In any case, here are some of my initial thoughts:

> > Additionally, this specification was heavily inspired by the KTAP
> > specification draft by Tim Bird
> > (https://lore.kernel.org/linux-kselftest/CY4PR13MB1175B804E31E502221BC8163FD830@CY4PR13MB1175.namprd13.prod.outlook.com/T/).
> > However, there are some notable differences to his specification. One
> > such difference is the format of nested tests is more fully specified
> > in the following specification. However, they are specified in a way
> > which may not be compatible with many kselftest nested tests.
>
> I commented extensively on that thread. :)
>
> >
> > =====================
> > Specification of KTAP
> > =====================
> >
> > TAP, or the Test Anything Protocol is a format for specifying test
> > results used by a number of projects. It's website and specification
> > are found at: https://testanything.org/. The Linux Kernel uses TAP
> > output for test results. However, KUnit (and other Kernel testing
> > frameworks such as kselftest) have some special needs for test results
> > which don't gel perfectly with the original TAP specification. Thus, a
> > "Kernel TAP" (KTAP) format is specified to extend and alter TAP to
> > support these use-cases.
> >
> > KTAP Output consists of 5 major elements (all line-based):
> > - The version line
> > - Plan lines
> > - Test case result lines
>
> These are required.
>
> > - Diagnostic lines
>
> This is optional.
>
> > - A bail out line
>
> Bail out should be optional, and possibly not included at all. (See
> below.)
>

Yeah, I'd happily jettison this totally if no-one's using it.

> >
> > An important component in this specification of KTAP is the
> > specification of the format of nested tests. This can be found in the
> > section on nested tests below.
> >
> > The version line
> > ----------------
> >
> > The first line of KTAP output must be the version line. As this
> > specification documents the first version of KTAP,  the recommended
> > version line is "KTAP version 1". However, since all kernel testing
> > frameworks use TAP version lines, "TAP version 14" and "TAP version
> > 13" are all acceptable version lines. Version lines with other
> > versions of TAP or KTAP will not cause the parsing of the test results
> > to fail but it will produce an error.
>
> Maybe "TAP version Linux.1" ? I don't want to needlessly break existing
> parsers.
>

I'd be okay with this. I personally think "KTAP" better describes how
much of a deviation this is turning out to be from the official TAP
specs, but if this fixes more parsers than it breaks, it's worth
doing.

> >
> > Plan lines
> > ----------
> >
> > Plan lines must follow the format of "1..N" where N is the number of
> > subtests. The second line of KTAP output must be a plan line, which
> > indicates the number of tests at the highest level, such that the
> > tests do not have a parent. Also, in the instance of a test having
> > subtests, the second line of the test after the subtest header must be
> > a plan line which indicates the number of subtests within that test.
>
> I do not want "subtests" as a specification concept, only "nested
> tests".
>

I don't think there's fundamentally a distinction worth preserving
here. "Subtests" is the language used by the TAP 14 spec, but if we
want to reframe it as nested tests, that's not a problem (and I agree
'subtests' is a little misleading here).

> >
> > Test case result lines
> > ----------------------
> >
> > Test case result lines must have the format:
> >
> >     <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
>
> "[<diagnostic data>]" is not defined below. I think what you mean is
> "directive details".
>
> I suggest:
>
> <result> <number>[ [- ]<description>][# <directive> [<directive details>]]
>
> > The result can be either "ok", which indicates the test case passed,
> > or "not ok", which indicates that the test case failed.
>
> Yes.
>
> >
> > The number represents the number of the test case or suite being
> > performed. The first test case or suite must have the number 1 and the
> > number must increase by 1 for each additional test case or result at
> > the same level and within the same testing suite.
>
> Yes, though parsers should handle ordering failures and missed test
> results (this is the purpose of the "plan" line).
>

For the record, KUnit's parser prints an error for each test with the
"wrong number", but still reports its result.

> >
> > The "-" character is optional.
> >
> > The description is a description of the test, generally the name of
> > the test, and can be any string of words (can't include #). The
> > description is optional.
>
> Yes, though the optional "- " is strictly part of the optional
> description.
>

I think the goal of specifying the "- " explicitly is that it's very
ugly to have test names/descriptions start with a "-" if one is used.

> > The directive is used to indicate if a test was skipped. The format
>
> The directive is a single word -- currently only "SKIP" is recognized
> by TAP 14(?), but kselftest uses also XFAIL, XPASS, TIMEOUT, and
> error. (Though I would love to hear what "error" is intended to be used
> for, if different from regular "not ok".)
>
> (This could also be called "context" rather than "directive".)
>

The name here matches the TAP spec, but I don't think we need to be
too beholden to it.

Personally, I think it'd make more sense to have SKIP, XFAIL, etc be
different statuses like "ok" and "not ok", rather than a directive
after the name, but that's possibly too radical a departure from TAP.

> > for the directive is: "# SKIP [<skip_description>]". The
> > skip_description is optional and can be any string of words to
> > describe why the test was skipped.
>
> I would call this "directive details".
>

Agreed.

> > The result of the test case result
> > line can be either "ok" or "not ok" if the skip directive is used.
> > Finally, note that TAP 14 specification includes TODO directives but
> > these are not supported for KTAP.
> >
> > Examples of test case result lines:
> >
> > Test passed:
> >
> >     ok 1 - test_case_name
> >
> > Test was skipped:
> >
> >     not ok 1 - test_case_name # SKIP test_case_name should be skipped
> >
> > Test failed:
> >
> >     not_ok 1 - test_case_name
>
> This isn't valid. No "_" is recognized for "ok" vs "not ok".
>
> >
> > Diagnostic lines
> > ----------------
> >
> > Diagnostic lines are used for description of testing operations.
> > Diagnostic lines are generally formatted as "#
> > <diagnostic_description>", where the description can be any string.
> > However, in practice, diagnostic lines are all lines that don't follow
> > the format of any other KTAP line format.
>
> I still think there should be a distinction between "diagnostic lines"
> and "unknown lines". For example, if kselftest is running on a console,
> the dmesg output may be intermixed, and possibly temporally offset. Such
> lines may not be useful, and may not be strictly correlated, where-as
> the output from kselftest is at least self-consistent.
>

I'm in two minds about this. On one hand, I totally agree that there's
a useful distinction here, and that a lot of dmesg output can get
nondeterministically mixed in. On the other, there are some tests for
which the dmesg output which gets mixed in is important output which
shouldn't be stripped out (for example, the KUnit KASAN tests print
KASAN reports, which we want to treat as "diagnostics" because they're
useful to understand what's happening in the tests.

That being said, there's nothing actually stopping us from maintaining
this distinction, and just having the KUnit parser treat "unknown"
lines the same as "diagnostic" lines, while maintaining the
distinction in theory.

> > Diagnostic lines can be
> > anywhere in the test output after the first two lines.
>
> I don't see a reason for this strictness. They can be anywhere.
>

I think this was just there because because we wanted to have room for
the TAP version header (or the 'Subtest' line in this spec/TAP14) and
the test plan, which needed to be the first two lines.

> > There are a few
> > special diagnostic lines.
>
> No; diagnostic lines must have no meaning: they are for humans and nothing
> else. If other details are needed for machines, we should explicitly
> create new format lines. I made a mistake when I used a diagnostic line
> for embedding the test names. :( There is a need for parsers to discover
> the name of a given test, though, so we do likely need something for this.
>

Yeah: the problem here is that we do have a pretty random mix of lines
starting with "#" which need parsing, and those which don't. Some of
this is inherited from the TAP or TAP14 spec. Using "# " for nesting
makes this a bit confusing as well.

Maybe this just needs reframing to something like "lines starting with
'#' which are not parseable as any of the following are diagnostic
lines", rather than the current implication that it's "anything that
starts with '#'".

Or we could try to clean up the mess and get rid of the leading '#'
everywhere else, but that's also a fair bit of work.

> > Diagnostic lines of the format "# Subtest:
> > <test_name>" indicate the start of a test with subtests. Also,
> > diagnostic lines of the format "# <test_name>: <description>" refer to
> > a specific test and tend to occur before the test result line of that
> > test but are optional.
>
> I don't think the above should be included in the spec -- diag lines
> should have no parseable meaning.
>
> >
> > Bail out line
> > -------------
> >
> > A bail out line can occur anywhere in the KTAP output and will
> > indicate that a test has crashed. The format of a bail out line is
> > "Bail out! [<description>]",  where the description can give
> > information on why the bail out occurred and can be any string.
>
> I'm not a fan of the Bail out line. It's not a problem, exactly, but I
> find it redundant. If we want an "end of test" line, let's make one.
> "Bail out" is a special case of exit. If we want to handle test exit,
> let's define it. E.g. make kselftest's test summary lines not
> diagnostic lines:
>
> # FAILED: 85 / 87 tests passed.
> # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0

KUnit has just added a similar line, though it's optional, and is
mainly used as a way of providing sums of nested tests at different
levels:
https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/commit/?h=kunit&id=acd8e8407b8fcc3229d6d8558cac338bea801aed

It's currently optional, though, so we might want to change that if we
rely on it.

In general, though, I agree that "Bail out!" is pretty useless. I
suspect it's here mostly because "it was in the TAP spec" rather than
"it's something we actually want to use".

>
> Also, parsers should treat a new "TAP version..." line at the same
> nesting level as indication that the prior test has ended (and any tests
> without result lines should be considered failed).
>

I do like this idea as a way of making it possible to just cat test
results together and get a sensibly merged result.
I think we want to count any tests without result lines marked as
"error" (test execution didn't complete/crashed/etc), rather than
"failed", but that's just a detail.

> >
> > Nested tests
> > ------------
> >
> > The new specification for KTAP will support an arbitrary number of
> > nested subtests. Thus, tests can now have subtests and those subtests
> > can have subtests. This can be useful to further categorize tests and
> > organize test results.
> >
> > The new required format for a test with subtests consists of: a
> > subtest header line, a plan line, all subtests, and a final test
> > result line.
> >
> > The first line of the test must be the subtest header line with the
> > format: "# Subtest: <test_name>".
>
> Please no. There is no reason to force a nested test to suddenly have
> to know about its test execution depth/environment. A subtest is just a
> nested TAP result. That it is nested is only meaningful to the parser, not
> the executing test, and it must have the same output, nested or not. (e.g.
> running the test alone or running the test under a full kselftest run,
> the only difference is the indentation level.)
>

This originally came from the draft TAP 14 spec here:
https://github.com/TestAnything/testanything.github.io/pull/36/files

On the KUnit side, we were labouring under the impression TAP 14 would
be ratified and implemented, so a lot of TAP 14 things ended up in our
implementation, including this line. Given that it stalled a long time
ago and no-one seems to care, I don't think that compatibility with it
needs to be an explicit goal any more, but it is still something that
KUnit uses at the moment.

The other thing worth noting is that, in KUnit, while nested test
support is limited, the framework does need to keep track of the
nesting depth in order to print the indentation.

> > The second line of the test must be the plan line, which is formatted
> > as "1..N", where N is the number of subtests.
> >
> > Following the plan line, all lines pertaining to the subtests will follow.
> >
> > Finally, the last line of the test is a final test result line with
> > the format: "(ok|not ok) <number> [-] [<description>] [<directive>]
> > [<diagnostic data>]", which follows the same format as the general
> > test result lines described in this section. The result line should
> > indicate the result of the subtests. Thus, if one of the subtests
> > fail, the test should fail. The description in the final test result
> > line should match the name of the test in the subtest header.
> >
> > An example format:
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..2
> >     ok 1 - test_1
> >     ok 2 - test_2
> > ok 1 - test_suite
>
> Again, I see only downsides to this. Nesting for the spec is simple
> indentation-based recursion. Let's just keep what we already have:
>
> TAP version 13
> 1..1
> # TAP version 13
> # 1..2
> # ok 1 - test_1
> # ok 2 - test_2
> ok 1 - test_suite
>
>
> > An example format with multiple levels of nested testing:
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..2
> >         # Subtest: sub_test_suite
> >         1..2
> >         ok 1 - test_1
> >         ok 2 test_2
> >     ok 1 - sub_test_suite
> >     ok 2 - test
> > ok 1 - test_suite
> >
> > In the instance that the plan line is missing, the end of the test
> > will be denoted by the final result line containing a description that
> > matches the name of the test given in the subtest header. Note that
> > thus, if the plan line is missing and one of the subtests have a
> > matching name to the test suite this will cause errors.
>
> A plan line is required. No special cases are needed. :)
>
> If nesting level is lost, a parser will understand the nested test to
> have ended, but probably so did the test runner:
>
> TAP version 13
> 1..1
>     TAP version 13
>     1..2
>         TAP version 13
>         1..3
>         ok 1 - test_1
>         ok 2 test_2
>     not ok 1 - sub test unfinished plan
>     ok 2 - test
> not ok 1 - test_suite
>
> > Lastly, indentation is also recommended for improved readability.
>
> Indentation is _required_. :)
>
> Whether this is "# " or "  " I don't really care, as long as the change
> is coordinated. "  " is easier for humans to read, but may make parsing of
> "unknown" lines more difficult for machines.
>

As one of the people who has an irrational distaste for syntactically
significant indentation, I'd prefer using something like the TAP
version header or the test plan as a way of indicating nesting. The
TAP 14 spec says that they should be indented four spaces, but "# " is
probably better, even if it doesn't gel perfectly with my mental model
of "#" lines not needing parsing (which went out the window ages ago).

> >
> > Major differences between TAP 14 and KTAP specification
> > -------------------------------------------------------
> >
> > Note that the major differences between TAP 14 and KTAP specification:
> > - yaml and json are not allowed in diagnostic messages
>
> Agreed -- these are overkill (and very difficult to implement as they
> _follow_ the test result line: anything generating them has already
> finished running).
>
> > - TODO directive not allowed
>
> I would just say "unrecognized".
>

I think this is actually supported in at least some versions of
TAP...? So maybe this isn't a difference at all.

> > - KTAP allows for an arbitrary number of tests to be nested with
> > specified nested test format
>
> Yup.
>

Again, I think this is actually a difference between TAP14/KTAP and
KUnit's current logic. TAP 14 does support nested tests using this
format, but KUnit only supported one level of nesting (Suite->Test).

> >
> > Example of KTAP
> > ---------------
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..1
> >         # Subtest: sub_test_suite
> >         1..2
> >         ok 1 - test_1
> >         ok 2 test_2
> >     ok 1 - sub_test_suite
> > ok 1 - test_suite
>
> For a good example, please include all the possible combinations (SKIP,
> not ok, diagnostics, etc etc)
>
> >
> > =========================================
> > Note on incompatibilities with kselftests
> > =========================================
> >
> > To my knowledge, the above specification seems to generally accept the
> > TAP format of many non-nested test results of kselftests.
> >
> > An example of a common kselftests TAP format for non-nested test
> > results that are accepted by the above specification:
> >
> > TAP version 13
> > 1..2
> > # selftests: vDSO: vdso_test_gettimeofday
> > # The time is 1628024856.096879
> > ok 1 selftests: vDSO: vdso_test_gettimeofday
> > # selftests: vDSO: vdso_test_getcpu
> > # Could not find __vdso_getcpu
> > ok 2 selftests: vDSO: vdso_test_getcpu # SKIP
> >
> > However, one major difference noted with kselftests is the use of more
> > directives than the "# SKIP" directive. kselftest also supports XPASS
> > and XFAIL directives. Some additional examples found in kselftests:
> >
> >     not ok 5 selftests: netfilter: nft_concat_range.sh # TIMEOUT 45 seconds
> >
> >     not ok 45 selftests: kvm: kvm_binary_stats_test # exit=127
> >
> > Should the specification be expanded to include these directives?
>
> Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
> be reported without the '#')
>
> >
> > However, the general format for kselftests with nested test results
> > seems to differ from the above specification. It seems that a general
> > format for nested tests is as follows:
> >
> > TAP version 13
> > 1..2
> > # selftests: membarrier: membarrier_test_single_thread
> > # TAP version 13
> > # 1..2
> > # ok 1 sys_membarrier available
> > # ok 2 sys membarrier invalid command test: command = -1, flags = 0,
> > errno = 22. Failed as expected
> > ok 1 selftests: membarrier: membarrier_test_single_thread
> > # selftests: membarrier: membarrier_test_multi_thread
> > # TAP version 13
> > # 1..2
> > # ok 1 sys_membarrier available
> > # ok 2 sys membarrier invalid command test: command = -1, flags = 0,
> > errno = 22. Failed as expected
> > ok 2 selftests: membarrier: membarrier_test_multi_thread
> >
> > The major differences here, that do not match the above specification,
> > are use of "# " as an indentation and using a TAP version line to
> > denote a new test with subtests rather than the subtest header line
> > described above. If these are widely utilized formats in kselftests,
> > should we include both versions in the specification or should we
> > attempt to agree on a single format for nested tests? I personally
> > believe we should try to agree on a single format for nested tests.
> > This would allow for a cleaner specification of KTAP and would reduce
> > possible confusion.
>
> We already use "# " and the nested TAP lines to denote subtests. Without
> a good reason to change it, we should avoid the churn with the existing
> parsers.
>

Like with the "Subtests" line, this came from the draft TAP 14 spec, via KUnit.
Personally, I prefer the whitespace indent from a readability point of
view, and because it gels a bit better with my idea of what the "#"
was supposed to represent: a non-parseable additional diagnostic,
though that's not consistently been the case for ages, so I don't care
too much.

> > ====
> >
> > So what do people think about the above specification?
> > How should we handle the differences with kselftests?
>
> I'm probably a broken record by now, but kselftests _is_ the
> specification. ;) What about it needs changing, and why?
>

I think there's value in describing the spec more explicitly,
particularly since there's also the TAP spec, TAP 14 draft spec, and
KUnit's implementation, all of which are different in some ways. And
while a lot of the implicit spec can be gleaned from reading example
kselftest tests, that only tells you what is valid, not what isn't,
which makes writing parsers a bit trickier.

> > If this specification is accepted, where should the specification be documented?
>
> I imagine some .rst file under Documentation/dev-tools/, linked to from
> kselftest.rst and kunit/...rst
>
Agreed.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC - kernel test result specification (KTAP)
  2021-08-28  8:20 ` Kees Cook
                     ` (2 preceding siblings ...)
  2021-08-30 23:39   ` David Gow
@ 2021-08-31  2:14   ` Rae Moar
  3 siblings, 0 replies; 14+ messages in thread
From: Rae Moar @ 2021-08-31  2:14 UTC (permalink / raw)
  To: Kees Cook
  Cc: Brendan Higgins, David Gow, Tim.Bird, Shuah Khan, Daniel Latypov,
	kunit-dev, linux-kselftest, linux-kernel, kernelci,
	Guillaume Tucker

Hello everyone! Thank you for all of your comments! I am glad to see some
discussion on this email. First of all, my primary goal with this email is to
advocate for a documented specification for the kernel's test output.  I
presented my first email largely from the perspective of KUnit and thus, I
am not surprised there are points of contention in the proposed specification.
That being said my comments are as follows:

On Sat, Aug 28, 2021 at 4:20 AM Kees Cook <keescook@chromium.org> wrote:
>
> [+kernelci, +Guillaume]
>
> Hi!
>
> Please keep me in CC for these kinds of proposals.

Very sorry for not including you as a CC. I will be sure to do this in the
future.

> And thanks for looking
> at this again! Please understand that while I may be coming across as
> rather negative here, I would like to have a rational and documented
> specification for the kernel's test output, too. My main objection here
> is that we already _have_ a specification,

I totally agree that the documented specification should generally follow
the agreed upon formats for kernel test results. However, there seems to be
some prominent differences in the formats of test results between KUnit and
kselftests, as well as within kselftests (requiring test plan line, use of
different directives). So I am primarily advocating for the specification to be
agreed upon and documented.

> and it's already being parsed
> by machines, so making changes without strong justification is going to
> be met with resistance. :) So, with that in mind, on to my reply...
>
> On Tue, Aug 10, 2021 at 04:25:03PM -0700, Rae Moar wrote:
> > We are looking to further standardise the output format used by kernel
> > test frameworks like kselftest and KUnit. Thus far we have used the
> > TAP (Test Anything Protocol) specification, but it has been extended
> > in many different ways, so we would like to agree on a common "Kernel
> > TAP" (KTAP) format to resolve these differences. Thus, below is a
> > draft of a specification of KTAP. Note that this specification is
> > largely based on the current format of test results for KUnit tests.
>
> The kernel's largest producer of TAP is kselftest, and the largest
> consumer is LAVA[1]. I would want buy-in from at least those responsible
> for either side of those two things. (And I'm one of the people working
> on both sides of it.)
>
> The fundamental purpose of the kernel's TAP is to have many independent
> tests runnable and parseable, specifically without any embedded framework
> knowledge (or, even, any knowledge of TAP).
>
> The tests run by kselftest come from 2 different TAP-producing harnesses
> (kselftest.h for C, kselftest/runner.sh for TAP-agnostic tests) as well
> as several hand-rolled instances in Shell, Python, and C. There used to
> be more, but I have been steadily fixing their syntax[2] and merging[3]
> separate implementations for a while now.
>
> [1] https://github.com/Linaro/test-definitions/commit/8bd338bbcfa5a03efcf1d12e25b5d341d5a29cbc
> [2] https://git.kernel.org/linus/b0df366bbd701c45e93af0dcb87ce22398589d1d
> [3] https://git.kernel.org/linus/e80068be21824e4d2726eeea41cac6178d212415
>
> > Additionally, this specification was heavily inspired by the KTAP
> > specification draft by Tim Bird
> > (https://lore.kernel.org/linux-kselftest/CY4PR13MB1175B804E31E502221BC8163FD830@CY4PR13MB1175.namprd13.prod.outlook.com/T/).
> > However, there are some notable differences to his specification. One
> > such difference is the format of nested tests is more fully specified
> > in the following specification. However, they are specified in a way
> > which may not be compatible with many kselftest nested tests.
>
> I commented extensively on that thread. :)
>
> >
> > =====================
> > Specification of KTAP
> > =====================
> >
> > TAP, or the Test Anything Protocol is a format for specifying test
> > results used by a number of projects. It's website and specification
> > are found at: https://testanything.org/. The Linux Kernel uses TAP
> > output for test results. However, KUnit (and other Kernel testing
> > frameworks such as kselftest) have some special needs for test results
> > which don't gel perfectly with the original TAP specification. Thus, a
> > "Kernel TAP" (KTAP) format is specified to extend and alter TAP to
> > support these use-cases.
> >
> > KTAP Output consists of 5 major elements (all line-based):
> > - The version line
> > - Plan lines
> > - Test case result lines
>
> These are required.

Agreed.

>
> > - Diagnostic lines
>
> This is optional.

Agreed.

>
> > - A bail out line
>
> Bail out should be optional, and possibly not included at all. (See
> below.)

I am fine with the Bail out line being removed if it is not in use.

>
> >
> > An important component in this specification of KTAP is the
> > specification of the format of nested tests. This can be found in the
> > section on nested tests below.
> >
> > The version line
> > ----------------
> >
> > The first line of KTAP output must be the version line. As this
> > specification documents the first version of KTAP,  the recommended
> > version line is "KTAP version 1". However, since all kernel testing
> > frameworks use TAP version lines, "TAP version 14" and "TAP version
> > 13" are all acceptable version lines. Version lines with other
> > versions of TAP or KTAP will not cause the parsing of the test results
> > to fail but it will produce an error.
>
> Maybe "TAP version Linux.1" ? I don't want to needlessly break existing
> parsers.

I think it would be best to maintain a numerical version number. So I
would prefer to either maintain 'TAP version #' or 'KTAP version 1', to
indicate the test results follow the newly specified KTAP. However, I
am flexible on this.

>
> >
> > Plan lines
> > ----------
> >
> > Plan lines must follow the format of "1..N" where N is the number of
> > subtests. The second line of KTAP output must be a plan line, which
> > indicates the number of tests at the highest level, such that the
> > tests do not have a parent. Also, in the instance of a test having
> > subtests, the second line of the test after the subtest header must be
> > a plan line which indicates the number of subtests within that test.
>
> I do not want "subtests" as a specification concept, only "nested
> tests".
>

I am fine changing this concept. It might be slightly clearer.

> >
> > Test case result lines
> > ----------------------
> >
> > Test case result lines must have the format:
> >
> >     <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
>
> "[<diagnostic data>]" is not defined below. I think what you mean is
> "directive details".

I should have defined the [<diagnostic data>] more fully. My intention was
that the diagnostic data would accommodate including information in the
test result that is not a directive or the description of the test. For example,
this would accommodate the examples that Tim Bird mentioned:
ok 5 - check return code # rcode=0

Note that both the directive and diagnostic data would be after a # symbol.
So the more exact format I was intending would be as follows:

<result> <number> [-] [<description>] [# [<directive>][<diagnostic data>]]

>
> I suggest:
>
> <result> <number>[ [- ]<description>][# <directive> [<directive details>]]
>
> > The result can be either "ok", which indicates the test case passed,
> > or "not ok", which indicates that the test case failed.
>
> Yes.
>
> >
> > The number represents the number of the test case or suite being
> > performed. The first test case or suite must have the number 1 and the
> > number must increase by 1 for each additional test case or result at
> > the same level and within the same testing suite.
>
> Yes, though parsers should handle ordering failures and missed test
> results (this is the purpose of the "plan" line).
>
> >
> > The "-" character is optional.
> >
> > The description is a description of the test, generally the name of
> > the test, and can be any string of words (can't include #). The
> > description is optional.
>
> Yes, though the optional "- " is strictly part of the optional
> description.

Does this mean you would want the "-" parsed as part of
the description? I think having "-" at the beginning of test descriptions
might be confusing.

Additionally, I am flexible on whether the "-" is necessary. I personally
think it looks a bit better with the "-" but that is preference.

>
> > The directive is used to indicate if a test was skipped. The format
>
> The directive is a single word -- currently only "SKIP" is recognized
> by TAP 14(?), but kselftest uses also XFAIL, XPASS, TIMEOUT, and
> error. (Though I would love to hear what "error" is intended to be used
> for, if different from regular "not ok".)
>
> (This could also be called "context" rather than "directive".)
>
> > for the directive is: "# SKIP [<skip_description>]". The
> > skip_description is optional and can be any string of words to
> > describe why the test was skipped.
>
> I would call this "directive details".
>
> > The result of the test case result
> > line can be either "ok" or "not ok" if the skip directive is used.
> > Finally, note that TAP 14 specification includes TODO directives but
> > these are not supported for KTAP.
> >
> > Examples of test case result lines:
> >
> > Test passed:
> >
> >     ok 1 - test_case_name
> >
> > Test was skipped:
> >
> >     not ok 1 - test_case_name # SKIP test_case_name should be skipped
> >
> > Test failed:
> >
> >     not_ok 1 - test_case_name
>
> This isn't valid. No "_" is recognized for "ok" vs "not ok".

Oops. My typo here. I meant:
not ok 1 - test_case_name

>
> >
> > Diagnostic lines
> > ----------------
> >
> > Diagnostic lines are used for description of testing operations.
> > Diagnostic lines are generally formatted as "#
> > <diagnostic_description>", where the description can be any string.
> > However, in practice, diagnostic lines are all lines that don't follow
> > the format of any other KTAP line format.
>
> I still think there should be a distinction between "diagnostic lines"
> and "unknown lines". For example, if kselftest is running on a console,
> the dmesg output may be intermixed, and possibly temporally offset. Such
> lines may not be useful, and may not be strictly correlated, where-as
> the output from kselftest is at least self-consistent.

I would be fine with maintaining that distinction.

>
> > Diagnostic lines can be
> > anywhere in the test output after the first two lines.
>
> I don't see a reason for this strictness. They can be anywhere.
>

This was an attempt to ensure that the first two lines of the KTAP
results are the
version line and the test plan. However, looking back on this, it was a bit too
strict.

> > There are a few
> > special diagnostic lines.
>
> No; diagnostic lines must have no meaning: they are for humans and nothing
> else. If other details are needed for machines, we should explicitly
> create new format lines. I made a mistake when I used a diagnostic line
> for embedding the test names. :( There is a need for parsers to discover
> the name of a given test, though, so we do likely need something for this.
>
> > Diagnostic lines of the format "# Subtest:
> > <test_name>" indicate the start of a test with subtests. Also,
> > diagnostic lines of the format "# <test_name>: <description>" refer to
> > a specific test and tend to occur before the test result line of that
> > test but are optional.
>
> I don't think the above should be included in the spec -- diag lines
> should have no parseable meaning.
>
> >
> > Bail out line
> > -------------
> >
> > A bail out line can occur anywhere in the KTAP output and will
> > indicate that a test has crashed. The format of a bail out line is
> > "Bail out! [<description>]",  where the description can give
> > information on why the bail out occurred and can be any string.
>
> I'm not a fan of the Bail out line. It's not a problem, exactly, but I
> find it redundant. If we want an "end of test" line, let's make one.
> "Bail out" is a special case of exit. If we want to handle test exit,
> let's define it. E.g. make kselftest's test summary lines not
> diagnostic lines:
>
> # FAILED: 85 / 87 tests passed.
> # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0
>
> Also, parsers should treat a new "TAP version..." line at the same
> nesting level as indication that the prior test has ended (and any tests
> without result lines should be considered failed).
>
> >
> > Nested tests
> > ------------
> >
> > The new specification for KTAP will support an arbitrary number of
> > nested subtests. Thus, tests can now have subtests and those subtests
> > can have subtests. This can be useful to further categorize tests and
> > organize test results.
> >
> > The new required format for a test with subtests consists of: a
> > subtest header line, a plan line, all subtests, and a final test
> > result line.
> >
> > The first line of the test must be the subtest header line with the
> > format: "# Subtest: <test_name>".
>
> Please no. There is no reason to force a nested test to suddenly have
> to know about its test execution depth/environment. A subtest is just a
> nested TAP result. That it is nested is only meaningful to the parser, not
> the executing test, and it must have the same output, nested or not. (e.g.
> running the test alone or running the test under a full kselftest run,
> the only difference is the indentation level.)

To clarify if this was implemented in kselftests, I believe the parent test
would print this subtest header. So the executing test would not have
this knowledge about the execution depth and thus, if the test was not
nested, it would not print this subtest header. This is my understanding.

Additionally, this is the current format for nested tests for KUnit.
As David mentioned, KUnit adopted this format from the proposed
TAP 14 draft spec
(https://github.com/TestAnything/testanything.github.io/pull/36/files).
Currently KUnit relies on this Subtest header line to parse nested tests.
I hope this is taken into account with discussions.

However, I hope we can decide on a singular format for these nested
tests. I am flexible with this specification. I would be fine with including
the nested version lines.

>
> > The second line of the test must be the plan line, which is formatted
> > as "1..N", where N is the number of subtests.
> >
> > Following the plan line, all lines pertaining to the subtests will follow.
> >
> > Finally, the last line of the test is a final test result line with
> > the format: "(ok|not ok) <number> [-] [<description>] [<directive>]
> > [<diagnostic data>]", which follows the same format as the general
> > test result lines described in this section. The result line should
> > indicate the result of the subtests. Thus, if one of the subtests
> > fail, the test should fail. The description in the final test result
> > line should match the name of the test in the subtest header.
> >
> > An example format:
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..2
> >     ok 1 - test_1
> >     ok 2 - test_2
> > ok 1 - test_suite
>
> Again, I see only downsides to this. Nesting for the spec is simple
> indentation-based recursion. Let's just keep what we already have:
>
> TAP version 13
> 1..1
> # TAP version 13
> # 1..2
> # ok 1 - test_1
> # ok 2 - test_2
> ok 1 - test_suite
>
>
> > An example format with multiple levels of nested testing:
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..2
> >         # Subtest: sub_test_suite
> >         1..2
> >         ok 1 - test_1
> >         ok 2 test_2
> >     ok 1 - sub_test_suite
> >     ok 2 - test
> > ok 1 - test_suite
> >
> > In the instance that the plan line is missing, the end of the test
> > will be denoted by the final result line containing a description that
> > matches the name of the test given in the subtest header. Note that
> > thus, if the plan line is missing and one of the subtests have a
> > matching name to the test suite this will cause errors.
>
> A plan line is required. No special cases are needed. :)
>

I am good with this. This was an attempt to be flexible because
it does seem there are tests that leave out the test plan or
place the test plan at the end of the test.

> If nesting level is lost, a parser will understand the nested test to
> have ended, but probably so did the test runner:
>
> TAP version 13
> 1..1
>     TAP version 13
>     1..2
>         TAP version 13
>         1..3
>         ok 1 - test_1
>         ok 2 test_2
>     not ok 1 - sub test unfinished plan
>     ok 2 - test
> not ok 1 - test_suite
>
> > Lastly, indentation is also recommended for improved readability.
>
> Indentation is _required_. :)

I think this could be a little strict to require indentation. For example,
in the instance of a nested test with a prefix, there could be difficulties
with correctly parsing the indentation level:

TAP version 13
1..1
[batch_id 4] TAP version 13
[batch_id 4] 1..2
[batch_id 4] ok 1 - cyclictest with 1000 cycles
[batch_id 4] # problem setting CLOCK_REALTIME
[batch_id 4] not ok 2 - cyclictest with CLOCK_REALTIME
not ok 1 - check realtime

(using a truncated version of an example Tim Bird used - note
he used it for a different point)

This could mostly be corrected with the use of "# " as the indentation. But
I personally prefer "  " as an indentation because the use of "# " can be
confusing with the format of diagnostic lines also starting with "# ". Thus,
I think it is safer to use the version lines or a subtest header lines to parse
nested tests.

>
> Whether this is "# " or "  " I don't really care, as long as the change
> is coordinated. "  " is easier for humans to read, but may make parsing of
> "unknown" lines more difficult for machines.
>
> >
> > Major differences between TAP 14 and KTAP specification
> > -------------------------------------------------------
> >
> > Note that the major differences between TAP 14 and KTAP specification:
> > - yaml and json are not allowed in diagnostic messages
>
> Agreed -- these are overkill (and very difficult to implement as they
> _follow_ the test result line: anything generating them has already
> finished running).
>
> > - TODO directive not allowed
>
> I would just say "unrecognized".

I am good with this change.

>
> > - KTAP allows for an arbitrary number of tests to be nested with
> > specified nested test format
>
> Yup.
>
> >
> > Example of KTAP
> > ---------------
> >
> > KTAP version 1
> > 1..1
> >     # Subtest: test_suite
> >     1..1
> >         # Subtest: sub_test_suite
> >         1..2
> >         ok 1 - test_1
> >         ok 2 test_2
> >     ok 1 - sub_test_suite
> > ok 1 - test_suite
>
> For a good example, please include all the possible combinations (SKIP,
> not ok, diagnostics, etc etc)
>
> >
> > =========================================
> > Note on incompatibilities with kselftests
> > =========================================
> >
> > To my knowledge, the above specification seems to generally accept the
> > TAP format of many non-nested test results of kselftests.
> >
> > An example of a common kselftests TAP format for non-nested test
> > results that are accepted by the above specification:
> >
> > TAP version 13
> > 1..2
> > # selftests: vDSO: vdso_test_gettimeofday
> > # The time is 1628024856.096879
> > ok 1 selftests: vDSO: vdso_test_gettimeofday
> > # selftests: vDSO: vdso_test_getcpu
> > # Could not find __vdso_getcpu
> > ok 2 selftests: vDSO: vdso_test_getcpu # SKIP
> >
> > However, one major difference noted with kselftests is the use of more
> > directives than the "# SKIP" directive. kselftest also supports XPASS
> > and XFAIL directives. Some additional examples found in kselftests:
> >
> >     not ok 5 selftests: netfilter: nft_concat_range.sh # TIMEOUT 45 seconds
> >
> >     not ok 45 selftests: kvm: kvm_binary_stats_test # exit=127
> >
> > Should the specification be expanded to include these directives?
>
> Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
> be reported without the '#')
>

Sounds good. Are there specific directives that should or should not
be included?

> >
> > However, the general format for kselftests with nested test results
> > seems to differ from the above specification. It seems that a general
> > format for nested tests is as follows:
> >
> > TAP version 13
> > 1..2
> > # selftests: membarrier: membarrier_test_single_thread
> > # TAP version 13
> > # 1..2
> > # ok 1 sys_membarrier available
> > # ok 2 sys membarrier invalid command test: command = -1, flags = 0,
> > errno = 22. Failed as expected
> > ok 1 selftests: membarrier: membarrier_test_single_thread
> > # selftests: membarrier: membarrier_test_multi_thread
> > # TAP version 13
> > # 1..2
> > # ok 1 sys_membarrier available
> > # ok 2 sys membarrier invalid command test: command = -1, flags = 0,
> > errno = 22. Failed as expected
> > ok 2 selftests: membarrier: membarrier_test_multi_thread
> >
> > The major differences here, that do not match the above specification,
> > are use of "# " as an indentation and using a TAP version line to
> > denote a new test with subtests rather than the subtest header line
> > described above. If these are widely utilized formats in kselftests,
> > should we include both versions in the specification or should we
> > attempt to agree on a single format for nested tests? I personally
> > believe we should try to agree on a single format for nested tests.
> > This would allow for a cleaner specification of KTAP and would reduce
> > possible confusion.
>
> We already use "# " and the nested TAP lines to denote subtests. Without
> a good reason to change it, we should avoid the churn with the existing
> parsers.
>
> > ====
> >
> > So what do people think about the above specification?
> > How should we handle the differences with kselftests?
>
> I'm probably a broken record by now, but kselftests _is_ the
> specification. ;) What about it needs changing, and why?

I think the change that needs to happen is that the specification is
documented so it is accessible to everyone. Otherwise I hope this
proposed specification helps to incite needed discussion so we
hopefully can create this accepted specification. Again thank you
to everyone for their comments.

>
> > If this specification is accepted, where should the specification be documented?
>
> I imagine some .rst file under Documentation/dev-tools/, linked to from
> kselftest.rst and kunit/...rst
>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC - kernel test result specification (KTAP)
  2021-08-30 22:04       ` Tim.Bird
@ 2021-08-31  7:46         ` David Gow
  2021-09-01 12:01         ` Mark Brown
  1 sibling, 0 replies; 14+ messages in thread
From: David Gow @ 2021-08-31  7:46 UTC (permalink / raw)
  To: Bird, Tim
  Cc: Kees Cook, Rae Moar, Brendan Higgins, Shuah Khan, Daniel Latypov,
	KUnit Development, open list:KERNEL SELFTEST FRAMEWORK,
	Linux Kernel Mailing List, kernelci, Guillaume Tucker

On Tue, Aug 31, 2021 at 6:04 AM <Tim.Bird@sony.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Kees Cook <keescook@chromium.org>
> >
> > On Mon, Aug 30, 2021 at 05:48:07PM +0000, Tim.Bird@sony.com wrote:
> > > From: Kees Cook <keescook@chromium.org>
> > > > > Test case result lines
> > > > > ----------------------
> > > > >
> > > > > Test case result lines must have the format:
> > > > >
> > > > >     <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
> > > >
> > > > "[<diagnostic data>]" is not defined below. I think what you mean is
> > > > "directive details".
> > >
> > > IMHO it's useful to keep this called 'diagnostic data'.
> > >
> > > The directive is optional, and having this be diagnostic data means you can
> > > put explanatory text, or result details, that are independent of the description,
> > > on the same line as the result.  This is convenient in some cases.
> > >
> > > I think diagnostic data should always be prefixed by a hash sign.
> > > Or, in the case where diagnostic data appears on the result line,
> > > the hash serves as the delimiter.  This means that the description
> > > text must not include a hash character.
> > >
> > > For example:
> > > ok 5 - check return code # rcode=0
> > > or
> > > not ok 5 - check return code # rcode=17
> >
> > The directive is currently used to answer the question "why?" as in,
> > "why is this test not ok?" ("a timeout occurred") or "why is this test
> > ok?" ("a SKIP")
> >
> > The "supporting evidence" for the "why?" answer are specific details
> > about that: "the exit was was NNN" or "the timeout was YYY seconds".
> >
> > Test diagnostics are all the lines preceding that (currently prefixed
> > with a "#").
> >
> > > > Yes, though the optional "- " is strictly part of the optional
> > > > description.
> > >
> > > It's mildly annoying that "-" is optional.  It's trivial to deal with in the parser
> > > to just ignore it if it's present.  But it has no semantic meaning whatsoever.
> > > IMHO it would be nice to either mandate it or remove it, for consistency's sake.
> > > This could be based solely on the consensus for whether it added or detracted
> > > from human readability, since parsers don't care.
> >
> > I have no strong opinion on the "-". I was surprised to encounter it
> > as it's not part of the TAP 13 spec. I would prefer to drop it, if I had
> > to choose.
> >
> > > ERROR usually means that there was some kind of failure
> > > in the testing process that makes the testcase result indeterminate.
> > > (like loss of connection to a server, or crash of the program executing
> > > the testcase).  This can be due to a bug in the test or the test system.
> > > I'm, not sure if ERROR is used this way in kselftest.  It is, IMHO, a useful
> > > distinction to make.  TIMEOUT is a special case of ERROR, IMHO.
> >
> > Well, I think TIMEOUT is a "why not ok?" answer. "not ok" seems to be
> > the same as ERROR, but I guess it's really a catch-all, but must come
> > with the "evidence" portion:
>
> 'not ok' should not mean the same thing as ERROR.  'not ok' means the
> testcase was run successfully, and the result is that the test determined
> that the testcase failed.  ERROR means that the testcase was not
> run successfully, so no result can be assumed.
>
> But ERROR is confusing.  Maybe there's a better word for the concept
> that is less likely to be conflated with the result.
>

In KUnit, we've largely used ERROR to refer to test cases with
malformed or corrupted output (e.g. the test code crashed, or there
are missing results according to the test plan, etc). Otherwise, the
test fails (or is skipped, in some circumstances).
This means that it's not been something tests explicitly sets, it's
something the parser outputs. That's obviously less of a sensible
thing for kselftest, which doesn't provide its own TAP parser (though
the bits of test harness code could maybe try to output something if a
subprocess failed, or the like.)

> >
> > not ok 1 description goes here # TIMEOUT 45 seconds with no results
> > not ok 1 description goes here # ERROR lost connection to server
> > not ok 1 description goes here # ERROR non-zero exit code: NNN
> >
> > I might go so far as to say the directive is _required_ for the "not ok"
> > state.
> As I said above, I don't think they are the same.  I would reserve
> ERROR for the case where a test error occurred and the test case
> result could not be obtained.
>
> If we go to the trouble of making our own spec, maybe it would
> be better to support a third result value: 'unknown'.  This would
> break all the TAP parsers, but it's much clearer, IMHO.
>
> 'ok 1 description' = test case executed successfully, and the testcase condition passed
> 'not ok 1 description' = test case executed successfully, and the testcase condition failed
> 'unknown 1 description # SKIP # reason' = test case does not apply for some reason
> 'unknown 1 description # ERROR # reason' = test case could not be executed, for some reason
>
> In the case of SKIP and ERROR, the result should be ignored as it is either inapplicable
> or wasn't obtained.
>
> I remember we had a discussion about whether SKIP should have an "ok" or "not ok"
> result.  IMHO it doesn't matter, because the result has no meaning when a SKIP directive
> is used.  (Although it's useful for humans to see why something was skipped.)
>

My feeling is that if we deviate from TAP enough to add another result
value, let's just make "skip" (and maybe "error", "xfail", "xpass" if
they make sense) it's own value rather than writing "unknown".

> >
> > > As an aside, I've never liked XFAIL or XPASS.  Semantically, I believe they
> > > are quite confusing, and depend on the level of testcase execution.  XPASS is also
> > > redundant.  Isn't the default case always XPASS?
> >
> > Nothing currently uses XPASS, and we should remove it.
> >
> > XFAIL capture the condition of the test logic in that a test for
> > something arm64-specific on a non-arm64 machine is _not_ a "pass". It
> > will fail. But it's an expected failure. And, as a directive, must come
> > with the evidence bit:
> >
> > not ok 1 PAN catches bad accessg # XFAIL PAN CPU feature is arm64 only
> >
> > As a test author, I want to see this as distinct from failure and
> > success.
> I agree that this case is different from a normal failure.
>
> In the example you provide of a feature that applies to only specific
> architectures, I would use a SKIP instead.  It's not that the testcase failed.
)> It's that it does not even make sense for the device under test.
>
> Whether something is 1) a testcase failure or 2) a testcase that does not
> apply is a matter of perspective.  Is it the case that the feature should  exist
> on all architectures,  and it's just not been written yet?  I could call this a
> 'not ok' or an XFAIL. However, if it's a feature that is never intended to be
> written for arm32 or i386,  I would report that as a SKIP.
>
> But all of this differs from what I originally described, which is labs using
> XFAIL to annotate bugs that they knew about but didn't intend to fix.
>

IIRC, the TAP spec specifies a TODO which is basically the same thing
as this. We haven't implemented either it or XPASS/XFAIL in KUnit,
though.
(I tend to agree that this is a nicer thing to have in a separate
layer, rather than the tests themselves, but can see the argument for
not wanting another list somewhere.)

> >
> > > In the case of XFAIL, does it mean that an operation that was expected to fail
> > > all the time, failed, and so the test passed?  Or does it mean that this is a known
> > > bug that a particular developer or site has decided to ignore (as in "we know this
> > > is a bug, but we expected it to fail, so just ignore it for now).  The latter, I really
> > > don't think should be encoded in the tests themselves, but should be put
> > > into a layer above the test and parser, that is applied according to developer
> > > or testlab policy.
> >
> > I agree that "just ignore it for now" shouldn't be there.
> >
> > Other arguments are that xfail tests shouldn't be run at all, but I
> > don't think that's right because it makes it very hard to do large-scale
> > test result comparisons because random stuff is missing, depending on
> > various configs, archs, etc. It blocks a 1-to-1 comparison, and begs
> > questions like "why is the test here in one case and not in another?"
> I completely agree with this.  It really messes up tables of results when
> the list of testcases varies depending on some board
> or configuration attribute.  It's very hard for users to parse and compare
> results in that case.
>
> > where as an XFAIL will encode the reason it is an XFAIL in its output,
> > providing a self-documenting test result.
>
>
> >
> > > > I still think there should be a distinction between "diagnostic lines"
> > > > and "unknown lines". For example, if kselftest is running on a console,
> > > > the dmesg output may be intermixed, and possibly temporally offset. Such
> > > > lines may not be useful, and may not be strictly correlated, where-as
> > > > the output from kselftest is at least self-consistent.
> > >
> > > I think this would be a very good distinction to make.  You might have
> > > unknown lines that (inadvertently?) start with a hash, but trying to
> > > make sure that the test itself always outputs diagnostic data with a preceding
> > > hash gives a reader potentially more information about where the message
> > > came from, and could be useful.
> >
> > Right, so lines with a hash shouldn't be machine-parsed. (And this is a
> > mistake I made in the current kselftest output for the LAVA parser.) The
> > good news for the kernel, is that nothing else in dmesg leads with a
> > "#" currently. :)
>
> Oh. That's actually nice to know.  Interesting.
>

If this is the case, I'm tempted to want a "#" in front of all KTAP
lines, just so that we have an easy way to isolate test output in
general (including the machine-parsable bits).

> >
> > > > No; diagnostic lines must have no meaning: they are for humans and nothing
> > > > else. If other details are needed for machines, we should explicitly
> > > > create new format lines. I made a mistake when I used a diagnostic line
> > > > for embedding the test names. :( There is a need for parsers to discover
> > > > the name of a given test, though, so we do likely need something for this.
> > >
> > > I agree it is handy to have the test name for sub-tests.  However, I don't
> > > see any semantic difference whether we define sub-test declarations via
> > > diagnostic lines with special syntax or by creating a new syntax separate
> > > from diagnostic lines.  Either way, we have a new syntax for parsers to
> > > understand.
> >
> > Pragmatically, I agree, which is what lead me to my mistake. However, in
> > reconsidering it, I realize this leads to a slippery slope of just
> > dumping stuff into diagnostic lines and pretending nothing will read it.
>
> Good point.  Whether it starts with a hash or not, TAP 13 parsers should
> ignore anything that is not a result line (starts with 'ok' or 'not ok').
> So I guess we are free to make a new syntax without upsetting any existing
> parsers.
>

TAP 14 definitely parses a few things other than just the result line
(including some which start with '#'), but it does look like no-one
liked TAP 14 enough to finish the spec and use it (other than KUnit
stealing some ideas from it).

> >
> > I suspect what we need is an optional "starting test" line, like:
> >
> > test N DESC
> > # diag...
> > # diag...
> > # diag...
> > ok N
> >
> > The nesting looks like:
> >
> > TAP version 13
> > 1..3
> > test 1 seccomp
> > # TAP version 13
> > # 1..27
> > # test 1 prctl
> > # ok 1
> > # test 2 strict
> > # ok 2
> > # ...
> > # test 27 user_notif
> > # # eek missing CONFIG_....
> > # not ok 27 user_notif # ERROR can't use ioctl
> > not ok 1 seccomp # ERROR non-zero exit code 1
> > test 2 netfilter
> > ok 2 netfilter
> > test 3 bpf
> > # exciting things happen
> > ok 3 bpf
> >
> > This is what we have now except basically just replacing "# $name" with
> > "test $num $name"
> >
> Actually, I like this idea.  Including the number is a nice touch.
> I think that will help to distinguish such a line from other 'unknown line' output.
> I don't know if I'd leave the description off of the result line, though.
> A lot of parsers are looking for that.
>

Yeah, this is something I'd quite like, personally. If we're prepared
to deviate from TAP this much, having the name/description of the test
before getting the results makes it easier to read. I'd agree with
keeping the description on the result line, too: not only is it what
existing parsers are looking for, it makes reading the results
manually much nicer.

> > > >
> > > > > Diagnostic lines of the format "# Subtest:
> > > > > <test_name>" indicate the start of a test with subtests. Also,
> > > > > diagnostic lines of the format "# <test_name>: <description>" refer to
> > > > > a specific test and tend to occur before the test result line of that
> > > > > test but are optional.
> > > >
> > > > I don't think the above should be included in the spec -- diag lines
> > > > should have no parseable meaning.
> > > >
> > > > >
> > > > > Bail out line
> > > > > -------------
> > > > >
> > > > > A bail out line can occur anywhere in the KTAP output and will
> > > > > indicate that a test has crashed. The format of a bail out line is
> > > > > "Bail out! [<description>]",  where the description can give
> > > > > information on why the bail out occurred and can be any string.
> > > >
> > > > I'm not a fan of the Bail out line. It's not a problem, exactly, but I
> > > > find it redundant. If we want an "end of test" line, let's make one.
> > > > "Bail out" is a special case of exit. If we want to handle test exit,
> > > > let's define it. E.g. make kselftest's test summary lines not
> > > > diagnostic lines:
> > > >
> > > > # FAILED: 85 / 87 tests passed.
> > > > # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0
> > >
> > > How consistent is it for selftests to produce test summary lines?
> >
> > Any of the tools built with kselftest.h will spit it out. (Which is to
> > say several of the "subtests".) The bulk of kselftest output is from the
> > runner, which doesn't produce this summary.
> >
> > > > Please no. There is no reason to force a nested test to suddenly have
> > > > to know about its test execution depth/environment. A subtest is just a
> > > > nested TAP result. That it is nested is only meaningful to the parser, not
> > > > the executing test, and it must have the same output, nested or not. (e.g.
> > > > running the test alone or running the test under a full kselftest run,
> > > > the only difference is the indentation level.)
> > > I agree that the sub-test should not know about this line.  I thought
> > > the "parent" test wrote the line.
> >
> > The suggestion was that the subtest should _not_ emit the "TAP" line,
> > which is what I strongly disagree with: it should be a self-contained
> > TAP-emitting test.
>
> OK - agreed.  Yes.  The sub-test should not know whether they are running
> standalone or nested (with the exception of some line prefix opaquely conveyed
> to them)
>

Yeah, this all came from the TAP 14 spec. In KUnit it's not a problem,
as the harness does all of the printing of these headers, so the test
code itself never has to worry about it. It does make more sense for
kselftest to make tests more independent, though, so let's leave the
TAP header line in.

My only concern here is that this could complicate attempting to parse
multiple test suites in the same output slightly:
e.g., if we just grep the input naively for "TAP version blahdiblah",
then parse whatever we get, there's a possibility we could
double-parse nested results.

We don't hit this at the moment, though, and it could be worked around
pretty easily, so I don't think it's worth holding back something
which otherwise works.

> >
> > > > Again, I see only downsides to this. Nesting for the spec is simple
> > > > indentation-based recursion. Let's just keep what we already have:
> > > >
> > > > TAP version 13
> > > > 1..1
> > > > # TAP version 13
> > > > # 1..2
> > > > # ok 1 - test_1
> > > > # ok 2 - test_2
> > > > ok 1 - test_suite
> > >
> > > I prefer spaces, but whatever.  Is the prefix on the 'TAP version 13' line guaranteed
> > > to be the prefix on all lines for the (sub-)test?
> >
> > That's the current implementation, yes.
> >
> > As for tabs vs spaces, I think it's fine to swap "# " for "  ". We just
> > need to patch LAVA to deal with it.
> >
> > > Just as a side note, in some Fuego tests, it was very useful to include an identifier
> > > in thethe prefix nested tests.  The output looked like this:
> > >
> > > TAP version 13
> > > 1..2
> > > [batch_id 4] TAP version 13
> > > [batch_id 4] 1..2
> > > [batch_id 4] ok 1 - cyclictest with 1000 cycles
> > > [batch_id 4] # problem setting CLOCK_REALTIME
> > > [batch_id 4] not ok 2 - cyclictest with CLOCK_REALTIME
> > > not ok 1 - check realtime
> > > [batch_id 4] TAP version 13
> > > [batch_id 4] 1..1
> > > [batch_id 4] ok 1 - IOZone read/write 4k blocks
> > > ok 2 - check I/O performance
> > >
> > > Can I propose that the prefix not be fixed by the spec, but that the spec indicates that
> > > whatever the prefix is on the TAP version line, that prefix must be used with the output for
> > > all lines from the test (with the exception of unknown lines)?
> >
> > Oh, interesting. This would also allow parallel (unique) test execution
> > to be parsable. That sounds workable. (Again, this needs LAVA patching
> > again...)
> >
> > > > > Major differences between TAP 14 and KTAP specification
> > > > > -------------------------------------------------------
> > > > >
> > > > > Note that the major differences between TAP 14 and KTAP specification:
> > > > > - yaml and json are not allowed in diagnostic messages
> > > >
> > > > Agreed -- these are overkill (and very difficult to implement as they
> > > > _follow_ the test result line: anything generating them has already
> > > > finished running).
> > >
> > > I thought that some tests put the diagnostic data before the result line.
> > >
> > > I don't care whether it is before or after (and indeed Fuego's parser
> > > can handle either case).  But it can't handle it when some testcases
> > > put diagnostic data before the result line, and some testcases put
> > > it after, in the same test.  There's no delimiter for the separation
> > > between the two different testcases diagnostic data.
> >
> > Right. The current implementation puts it always before. TAP 13 (14?) has
> > the YAML _after_ the results line, but that's totally unworkable for
> > kselftest, which is designed to be first human readable and second machine
> > readable. Waiting to see diagnostics from a slow test is terrible --
> > we can't collect it and spit it out later. Also, a parser would have
> > no idea about when a test is finished if diagnostics follow the last
> > test result of a plan.
> >
> > Making it "always before" is fine by me, and the motivation for making
> > the "summary" lines be non-diagnostics because they technically followed
> > the last test result.
> >
> > > Could we possibly have a rule about diagnostic lines positioning relative
> > > to the testcase result line?  Maybe a rule like this:  For a single test,
> > > diagnostic lines should either always preceded the testcase they
> > > are related to, or follow the testcase they are related to.
> >
> > For the kernel, I would rather avoid anything after because that just
> > confuses things.
> >
> > > > Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
> > > > be reported without the '#')
> > >
> > > Please No!  (Unless I'm misinterpreting that line).
> > >
> > > If you remove the '#', that makes the 'exit=127' part of the test description.
> > > The test description should be invariant between success and failure cases.
> > > The exit=127 (if I'm reading that line right) is part of data explaining or
> > > describing the failure. It would be 'exit=0' if the test succeeded.
> > >
> > > In order to correlate test cases (ie show results over time), the description
> > > should not change from one invocation of the test to the next.  At least Fuego
> > > uses the test description this way.  That is, the testcase description is used as
> > > a testcase identifier, and if it changes from run to run things get confusing.
> >
> > Agreed, I think it should be part of an ERROR directive:
> >
> > not ok 14 awesome test # ERROR exit=127
>
> As long as the line above means that testcase '14 awesome test' could not be
> run, and exit=127 helps diagnose that, then I agree.  If 'exit=127' means that
> testcase '14 awesome test' was run and exit=127 was a failing result, I would
> say that the result line should just be:
> not ok 14 awesome test # exit=127

I think this is fine, though I don't think it gives us that much more
over just making the last diagnostic line "# exit=127".
If we make "skip" and the like first-class results, that makes this
even simpler, of course.


-- David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC - kernel test result specification (KTAP)
  2021-08-30 22:04       ` Tim.Bird
  2021-08-31  7:46         ` David Gow
@ 2021-09-01 12:01         ` Mark Brown
  1 sibling, 0 replies; 14+ messages in thread
From: Mark Brown @ 2021-09-01 12:01 UTC (permalink / raw)
  To: kernelci, Tim.Bird
  Cc: keescook, rmoar, brendanhiggins, davidgow, shuah, dlatypov,
	kunit-dev, linux-kselftest, linux-kernel, guillaume.tucker

[-- Attachment #1: Type: text/plain, Size: 1975 bytes --]

On Mon, Aug 30, 2021 at 10:04:24PM +0000, Tim.Bird@sony.com wrote:
> > -----Original Message-----
> > From: Kees Cook <keescook@chromium.org>
> > On Mon, Aug 30, 2021 at 05:48:07PM +0000, Tim.Bird@sony.com wrote:
> > > From: Kees Cook <keescook@chromium.org>

> > XFAIL capture the condition of the test logic in that a test for
> > something arm64-specific on a non-arm64 machine is _not_ a "pass". It
> > will fail. But it's an expected failure. And, as a directive, must come
> > with the evidence bit:

> > not ok 1 PAN catches bad accessg # XFAIL PAN CPU feature is arm64 only

> > As a test author, I want to see this as distinct from failure and
> > success.

> I agree that this case is different from a normal failure.

> In the example you provide of a feature that applies to only specific
> architectures, I would use a SKIP instead.  It's not that the testcase failed.
> It's that it does not even make sense for the device under test.

That's what all the arm64 tests are doing at the minute - they're
skipping tests on systems where the feature being tested isn't
supported.  We've determined that the test can't meaningfully be run so
we're going to skip it.  We also don't use XFAIL for tests that are
checking that an error of some kind is generated, the test is formulated
as "error is reported when we do X" meaning that PASS/FAIL makes sense
directly.

There's a couple of XFAIL uses in other selftests but it's pretty
infrequent.

> > This is what we have now except basically just replacing "# $name" with
> > "test $num $name"

> Actually, I like this idea.  Including the number is a nice touch.
> I think that will help to distinguish such a line from other 'unknown line' output.
> I don't know if I'd leave the description off of the result line, though. 
> A lot of parsers are looking for that.

It's also important that humans are able to understand the output and
the descriptions do help with that.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC - kernel test result specification (KTAP)
  2021-08-30 19:58     ` Kees Cook
  2021-08-30 22:04       ` Tim.Bird
  2021-08-30 22:21       ` Tim.Bird
@ 2022-02-04  2:44       ` Frank Rowand
  2 siblings, 0 replies; 14+ messages in thread
From: Frank Rowand @ 2022-02-04  2:44 UTC (permalink / raw)
  To: Kees Cook, Tim.Bird
  Cc: rmoar, brendanhiggins, davidgow, shuah, dlatypov, kunit-dev,
	linux-kselftest, linux-kernel, kernelci, guillaume.tucker

On 8/30/21 2:58 PM, Kees Cook wrote:
> On Mon, Aug 30, 2021 at 05:48:07PM +0000, Tim.Bird@sony.com wrote:
>> From: Kees Cook <keescook@chromium.org>
>>>> Test case result lines
>>>> ----------------------
>>>>
>>>> Test case result lines must have the format:
>>>>
>>>>     <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
>>>
>>> "[<diagnostic data>]" is not defined below. I think what you mean is
>>> "directive details".
>>
>> IMHO it's useful to keep this called 'diagnostic data'.
>>
>> The directive is optional, and having this be diagnostic data means you can
>> put explanatory text, or result details, that are independent of the description,
>> on the same line as the result.  This is convenient in some cases.
>>
>> I think diagnostic data should always be prefixed by a hash sign.
>> Or, in the case where diagnostic data appears on the result line,
>> the hash serves as the delimiter.  This means that the description
>> text must not include a hash character.
>>
>> For example:
>> ok 5 - check return code # rcode=0
>> or
>> not ok 5 - check return code # rcode=17
> 
> The directive is currently used to answer the question "why?" as in,
> "why is this test not ok?" ("a timeout occurred") or "why is this test
> ok?" ("a SKIP")
> 
> The "supporting evidence" for the "why?" answer are specific details
> about that: "the exit was was NNN" or "the timeout was YYY seconds".
> 
> Test diagnostics are all the lines preceding that (currently prefixed
> with a "#").
> 
>>> Yes, though the optional "- " is strictly part of the optional
>>> description.
>>
>> It's mildly annoying that "-" is optional.  It's trivial to deal with in the parser
>> to just ignore it if it's present.  But it has no semantic meaning whatsoever.
>> IMHO it would be nice to either mandate it or remove it, for consistency's sake.
>> This could be based solely on the consensus for whether it added or detracted
>> from human readability, since parsers don't care.
> 
> I have no strong opinion on the "-". I was surprised to encounter it
> as it's not part of the TAP 13 spec. I would prefer to drop it, if I had
> to choose.
> 
>> ERROR usually means that there was some kind of failure
>> in the testing process that makes the testcase result indeterminate.
>> (like loss of connection to a server, or crash of the program executing
>> the testcase).  This can be due to a bug in the test or the test system.
>> I'm, not sure if ERROR is used this way in kselftest.  It is, IMHO, a useful
>> distinction to make.  TIMEOUT is a special case of ERROR, IMHO.
> 
> Well, I think TIMEOUT is a "why not ok?" answer. "not ok" seems to be
> the same as ERROR, but I guess it's really a catch-all, but must come
> with the "evidence" portion:
> 
> not ok 1 description goes here # TIMEOUT 45 seconds with no results
> not ok 1 description goes here # ERROR lost connection to server
> not ok 1 description goes here # ERROR non-zero exit code: NNN
> 
> I might go so far as to say the directive is _required_ for the "not ok"
> state.
> 
>> As an aside, I've never liked XFAIL or XPASS.  Semantically, I believe they
>> are quite confusing, and depend on the level of testcase execution.  XPASS is also
>> redundant.  Isn't the default case always XPASS?
> 
> Nothing currently uses XPASS, and we should remove it.
> 
> XFAIL capture the condition of the test logic in that a test for
> something arm64-specific on a non-arm64 machine is _not_ a "pass". It
> will fail. But it's an expected failure. And, as a directive, must come
> with the evidence bit:
> 
> not ok 1 PAN catches bad accessg # XFAIL PAN CPU feature is arm64 only
> 
> As a test author, I want to see this as distinct from failure and
> success.
> 
>> In the case of XFAIL, does it mean that an operation that was expected to fail
>> all the time, failed, and so the test passed?  Or does it mean that this is a known
>> bug that a particular developer or site has decided to ignore (as in "we know this
>> is a bug, but we expected it to fail, so just ignore it for now).  The latter, I really
>> don't think should be encoded in the tests themselves, but should be put
>> into a layer above the test and parser, that is applied according to developer
>> or testlab policy.
> 
> I agree that "just ignore it for now" shouldn't be there.
> 
> Other arguments are that xfail tests shouldn't be run at all, but I
> don't think that's right because it makes it very hard to do large-scale
> test result comparisons because random stuff is missing, depending on
> various configs, archs, etc. It blocks a 1-to-1 comparison, and begs
> questions like "why is the test here in one case and not in another?"
> where as an XFAIL will encode the reason it is an XFAIL in its output,
> providing a self-documenting test result.
> 
>>> I still think there should be a distinction between "diagnostic lines"
>>> and "unknown lines". For example, if kselftest is running on a console,
>>> the dmesg output may be intermixed, and possibly temporally offset. Such
>>> lines may not be useful, and may not be strictly correlated, where-as
>>> the output from kselftest is at least self-consistent.
>>
>> I think this would be a very good distinction to make.  You might have
>> unknown lines that (inadvertently?) start with a hash, but trying to 
>> make sure that the test itself always outputs diagnostic data with a preceding
>> hash gives a reader potentially more information about where the message
>> came from, and could be useful.
> 
> Right, so lines with a hash shouldn't be machine-parsed. (And this is a
> mistake I made in the current kselftest output for the LAVA parser.) The
> good news for the kernel, is that nothing else in dmesg leads with a
> "#" currently. :)

Actually, not correct.  Devicetree unittest (drivers/of/unittest.c) leads
messages with "### dt-test ### ".  But the good news is that this is
easily changeable.

-Frank

> 
>>> No; diagnostic lines must have no meaning: they are for humans and nothing
>>> else. If other details are needed for machines, we should explicitly
>>> create new format lines. I made a mistake when I used a diagnostic line
>>> for embedding the test names. :( There is a need for parsers to discover
>>> the name of a given test, though, so we do likely need something for this.
>>
>> I agree it is handy to have the test name for sub-tests.  However, I don't
>> see any semantic difference whether we define sub-test declarations via
>> diagnostic lines with special syntax or by creating a new syntax separate
>> from diagnostic lines.  Either way, we have a new syntax for parsers to
>> understand.
> 
> Pragmatically, I agree, which is what lead me to my mistake. However, in
> reconsidering it, I realize this leads to a slippery slope of just
> dumping stuff into diagnostic lines and pretending nothing will read it.
> 
> I suspect what we need is an optional "starting test" line, like:
> 
> test N DESC
> # diag...
> # diag...
> # diag...
> ok N
> 
> The nesting looks like:
> 
> TAP version 13
> 1..3
> test 1 seccomp
> # TAP version 13
> # 1..27
> # test 1 prctl
> # ok 1
> # test 2 strict
> # ok 2
> # ...
> # test 27 user_notif
> # # eek missing CONFIG_....
> # not ok 27 user_notif # ERROR can't use ioctl
> not ok 1 seccomp # ERROR non-zero exit code 1
> test 2 netfilter
> ok 2 netfilter
> test 3 bpf
> # exciting things happen
> ok 3 bpf
> 
> This is what we have now except basically just replacing "# $name" with
> "test $num $name"
> 
>>>
>>>> Diagnostic lines of the format "# Subtest:
>>>> <test_name>" indicate the start of a test with subtests. Also,
>>>> diagnostic lines of the format "# <test_name>: <description>" refer to
>>>> a specific test and tend to occur before the test result line of that
>>>> test but are optional.
>>>
>>> I don't think the above should be included in the spec -- diag lines
>>> should have no parseable meaning.
>>>
>>>>
>>>> Bail out line
>>>> -------------
>>>>
>>>> A bail out line can occur anywhere in the KTAP output and will
>>>> indicate that a test has crashed. The format of a bail out line is
>>>> "Bail out! [<description>]",  where the description can give
>>>> information on why the bail out occurred and can be any string.
>>>
>>> I'm not a fan of the Bail out line. It's not a problem, exactly, but I
>>> find it redundant. If we want an "end of test" line, let's make one.
>>> "Bail out" is a special case of exit. If we want to handle test exit,
>>> let's define it. E.g. make kselftest's test summary lines not
>>> diagnostic lines:
>>>
>>> # FAILED: 85 / 87 tests passed.
>>> # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0
>>
>> How consistent is it for selftests to produce test summary lines?
> 
> Any of the tools built with kselftest.h will spit it out. (Which is to
> say several of the "subtests".) The bulk of kselftest output is from the
> runner, which doesn't produce this summary.
> 
>>> Please no. There is no reason to force a nested test to suddenly have
>>> to know about its test execution depth/environment. A subtest is just a
>>> nested TAP result. That it is nested is only meaningful to the parser, not
>>> the executing test, and it must have the same output, nested or not. (e.g.
>>> running the test alone or running the test under a full kselftest run,
>>> the only difference is the indentation level.)
>> I agree that the sub-test should not know about this line.  I thought
>> the "parent" test wrote the line.
> 
> The suggestion was that the subtest should _not_ emit the "TAP" line,
> which is what I strongly disagree with: it should be a self-contained
> TAP-emitting test.
> 
>>> Again, I see only downsides to this. Nesting for the spec is simple
>>> indentation-based recursion. Let's just keep what we already have:
>>>
>>> TAP version 13
>>> 1..1
>>> # TAP version 13
>>> # 1..2
>>> # ok 1 - test_1
>>> # ok 2 - test_2
>>> ok 1 - test_suite
>>
>> I prefer spaces, but whatever.  Is the prefix on the 'TAP version 13' line guaranteed
>> to be the prefix on all lines for the (sub-)test?
> 
> That's the current implementation, yes.
> 
> As for tabs vs spaces, I think it's fine to swap "# " for "  ". We just
> need to patch LAVA to deal with it.
> 
>> Just as a side note, in some Fuego tests, it was very useful to include an identifier
>> in thethe prefix nested tests.  The output looked like this:
>>
>> TAP version 13
>> 1..2
>> [batch_id 4] TAP version 13
>> [batch_id 4] 1..2
>> [batch_id 4] ok 1 - cyclictest with 1000 cycles
>> [batch_id 4] # problem setting CLOCK_REALTIME
>> [batch_id 4] not ok 2 - cyclictest with CLOCK_REALTIME
>> not ok 1 - check realtime
>> [batch_id 4] TAP version 13
>> [batch_id 4] 1..1
>> [batch_id 4] ok 1 - IOZone read/write 4k blocks
>> ok 2 - check I/O performance
>>
>> Can I propose that the prefix not be fixed by the spec, but that the spec indicates that
>> whatever the prefix is on the TAP version line, that prefix must be used with the output for
>> all lines from the test (with the exception of unknown lines)?
> 
> Oh, interesting. This would also allow parallel (unique) test execution
> to be parsable. That sounds workable. (Again, this needs LAVA patching
> again...)
> 
>>>> Major differences between TAP 14 and KTAP specification
>>>> -------------------------------------------------------
>>>>
>>>> Note that the major differences between TAP 14 and KTAP specification:
>>>> - yaml and json are not allowed in diagnostic messages
>>>
>>> Agreed -- these are overkill (and very difficult to implement as they
>>> _follow_ the test result line: anything generating them has already
>>> finished running).
>>
>> I thought that some tests put the diagnostic data before the result line.
>>
>> I don't care whether it is before or after (and indeed Fuego's parser
>> can handle either case).  But it can't handle it when some testcases
>> put diagnostic data before the result line, and some testcases put
>> it after, in the same test.  There's no delimiter for the separation
>> between the two different testcases diagnostic data.
> 
> Right. The current implementation puts it always before. TAP 13 (14?) has
> the YAML _after_ the results line, but that's totally unworkable for
> kselftest, which is designed to be first human readable and second machine
> readable. Waiting to see diagnostics from a slow test is terrible --
> we can't collect it and spit it out later. Also, a parser would have
> no idea about when a test is finished if diagnostics follow the last
> test result of a plan.
> 
> Making it "always before" is fine by me, and the motivation for making
> the "summary" lines be non-diagnostics because they technically followed
> the last test result.
> 
>> Could we possibly have a rule about diagnostic lines positioning relative
>> to the testcase result line?  Maybe a rule like this:  For a single test,
>> diagnostic lines should either always preceded the testcase they
>> are related to, or follow the testcase they are related to.
> 
> For the kernel, I would rather avoid anything after because that just
> confuses things.
> 
>>> Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
>>> be reported without the '#')
>>
>> Please No!  (Unless I'm misinterpreting that line).
>>
>> If you remove the '#', that makes the 'exit=127' part of the test description.
>> The test description should be invariant between success and failure cases.
>> The exit=127 (if I'm reading that line right) is part of data explaining or
>> describing the failure. It would be 'exit=0' if the test succeeded.
>>
>> In order to correlate test cases (ie show results over time), the description
>> should not change from one invocation of the test to the next.  At least Fuego
>> uses the test description this way.  That is, the testcase description is used as
>> a testcase identifier, and if it changes from run to run things get confusing.
> 
> Agreed, I think it should be part of an ERROR directive:
> 
> not ok 14 awesome test # ERROR exit=127
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-02-04  2:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-10 23:25 RFC - kernel test result specification (KTAP) Rae Moar
2021-08-11 20:53 ` Brendan Higgins
2021-08-28  8:20 ` Kees Cook
2021-08-30 17:34   ` Brendan Higgins
2021-08-30 19:26     ` Kees Cook
2021-08-30 17:48   ` Tim.Bird
2021-08-30 19:58     ` Kees Cook
2021-08-30 22:04       ` Tim.Bird
2021-08-31  7:46         ` David Gow
2021-09-01 12:01         ` Mark Brown
2021-08-30 22:21       ` Tim.Bird
2022-02-04  2:44       ` Frank Rowand
2021-08-30 23:39   ` David Gow
2021-08-31  2:14   ` Rae Moar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.