Re: [Fuego] [PATCH] LTP: put fanotify07 in the skiplist

From: <Tim.Bird@sony.com>
To: dhinakar.k@samsung.com, daniel.sangorrin@toshiba.co.jp
Cc: fuego@lists.linuxfoundation.org
Subject: Re: [Fuego] [PATCH] LTP: put fanotify07 in the skiplist
Date: Thu, 10 May 2018 21:50:57 +0000	[thread overview]
Message-ID: <ECADFF3FD767C149AD96A924E7EA6EAF7C132EA4@USCULXMSG01.am.sony.com> (raw)
In-Reply-To: <20180510161357epcms5p5a0db2529eae5bc76eb13e3e6c182e270@epcms5p5>

> -----Original Message-----
> From: Dhinakar Kalyanasundaram
> 
> Yes fanotify07 test is problematic if you use old kernel before 4.14 version.
> 
> We used kernel 4.9 and fanotify07 resulted in kernel panic when executed.
> 
> This issue has been fixed in kernel version 4.14.
> 
> Please refer to this link https://www.spinics.net/lists/linux-
> fsdevel/msg109131.html
> 
> fanotify07 executes without any issue in kernel 4.14.
> 

Dhinakar,

Thanks very much for this information.  I had been meaning to investigate
the issue with fanotify07 (which also fails on some of my boards here), but had
not gotten around to it yet.

Sorry for the long message below, but I'm now going to start talking about
more general Fuego issues...

This test raises some interesting issues that I'd like make Fuego better at
handling.

This test causes a hang in the middle of an LTP test, which is really a pain.
Fuego doesn't handle this kind of thing very well.  In the case of other tests,
if the test hangs in the middle you can see from the console log where the
test left off.  The LTP output is too quiet, in my judgement, as it takes a lot
of effort to see where the test got to before the machine hung.  This is
something that would be good to change.

Also - once the machine hangs there is no mechanism to continue
where the test left off.  The result is that many LTP sub-testcases don't get run.
Right now the only mechanism we have to deal with this is to skip
the individual testcase (the fanotify07 program).
Luckily, Daniel has added the skiplist feature to LTP to support this,
which is nice. We might want to think about also implementing a test
re-start mechanism, or making LTP more fine-grained in its execution.

In general, LTP is a "big" test, and Fuego is really architected around
running "small" tests, that don't hang the DUT while they are in progress.
It might make more sense to structure LTP a little differently, with
testplans that use a series of specs that run smaller test sets from LTP,
rather than huge test sets (syscalls is notoriously long).

I'm not sure of all the pros and cons to having lots of specs, but specs
that ran smaller sets of programs would lose less data when the
machine hung than the current configuration does now.

Fuego should be able to reboot a hung machine and continue to the
next scheduled Fuego test, when the BOARD_CONTROL feature is
being used. However, the BOARD_CONTROL feature needs more
work to support LAVA and other DUT control systems.  We might
want to prioritize adding BOARD_CONTROL support for more board
management systems in the near future.

Also, with regards to fanotify07;
This is one of the "active bugs", which some people will see and
some won't, depending on what kernel version they're running.  It
definitely indicates a bug (I think), and it's quite possible that an
LTS bugfix backport might fix it, so it's not good IMHO to remove it
from the test pool.  However, it is really problematic since it takes
down the test machine, and (under certain circumstances) causes
a Fuego failure cascade where all remaining queued tests in Jenkins
for that board fail as well.  Yuk.

Tests like these are why I wrote LTP_one_test, to isolate
these into individual units that could be tested independently from
the main LTP set of tests.  I can imagine making a board-specific
testplan_ltp_problem_tests file, listing LTP_one_test with
multiple different specs (indicating multiple different single LTP
test programs).  This could be run at a different frequency than
the full LTP, to check the status of these testcases.

I don't know what most QA departments do with tests that fail, but
are known to work in future versions of the software.  Do you just
always skip them?  Do you just remember to ignore their failure
results in your log?  We can use criteria files to ignore results but there's
a similar issue there.  How do you know to turn a test back on (stop skipping
it) when your software upgrades?  Or how do you know when to stop
ignoring a test's fail status, in the criteria file?
It seems like if a tester is not careful, they will continue skipping or ignoring
tests indefinitely, when it would be better to occasionally re-check the
status of failing tests to see if they've been fixed.

I assume that if a kernel is 4.14 or later, we really want to run fanotify07,
to catch possible regressions.  That's why I declined Daniel's patch
to put fanotify07 in the skiplist for the default spec for LTP.

Finally, I wrote the per-testcase-documentation system to capture
detailed information, like what Dhinakar has provided, so that other
users could avoid having to do the same research to figure out
what is going on with a test that fails or errors out.
Note that one of the files I already created was
Functional.LTP/docs/Functional.LTP.syscalls.fanotify07.ftmp, so
I could gather this information.  I've had problems with this test
myself, that I hadn't reported yet, because I didn't have time
to do the research into why it was misbehaving.  But hey, I made
the file to hold the information I wanted to gather!  :-)
I'll put Dhinakar's information there, but probably people won't
notice because the system is not well-know or visible yet.
But it's a start.

The per-testcase-documentation system is not completely
implemented, but is intended to eventually provide a combination
of static and dynamic information on a test, testset or testcase -  accessible
from the Jenkins interface.

OK - I've talked far too long.  But feedback or thoughts on these
issues is welcome.  I'd like Fuego to handle these types of situations
better, and any ideas are appreciated.

Regards,
 -- Tim

> 
> Regards,
> 
> Dhinakar
> 
> 
> 
> 
> 
> --------- Original Message ---------
> 
> Sender : Daniel Sangorrin <daniel.sangorrin@toshiba.co.jp>
> 
> Date : 2018-05-10 13:07 (GMT+5:30)
> 
> Title : [Fuego] [PATCH] LTP: put fanotify07 in the skiplist
> 
> To : fuego@lists.linuxfoundation.org
> 
> 
> 
> This test seems problematic. I need to investigate it
> further but for now, it maybe a good idea to put it in the
> skip list.
> 
> tst_test.c:1015: INFO: Timeout per run is 0h 05m 00s
> Test timeouted, sending SIGKILL!
> Test timeouted, sending SIGKILL!
> Test timeouted, sending SIGKILL!
> Test timeouted, sending SIGKILL!
> Test timeouted, sending SIGKILL!
> Test timeouted, sending SIGKILL!
> Test timeouted, sending SIGKILL!
> Test timeouted, sending SIGKILL!
> Test timeouted, sending SIGKILL!
> Test timeouted, sending SIGKILL!
> Test timeouted, sending SIGKILL!
> Cannot kill test processes!
> Congratulation, likely test hit a kernel bug.
> Exitting uncleanly...
> 
> Signed-off-by: Daniel Sangorrin <daniel.sangorrin@toshiba.co.jp>
> ---
>  engine/tests/Functional.LTP/spec.json | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/engine/tests/Functional.LTP/spec.json
> b/engine/tests/Functional.LTP/spec.json
> index 3b48bdc..1f78afb 100644
> --- a/engine/tests/Functional.LTP/spec.json
> +++ b/engine/tests/Functional.LTP/spec.json
> @@ -3,6 +3,7 @@
>      "specs": {
>          "default": {
>              "tests": "syscalls SEM",
> +            "skiplist": "fanotify07",
>              "extra_success_links": {"xlsx": "results.xlsx", "skiplist": "skiplist.txt"},
>              "extra_fail_links": {"xlsx": "results.xlsx", "skiplist": "skiplist.txt"}
>          },
> @@ -14,6 +15,7 @@
>          },
>          "selection": {
>              "tests": "syscalls fs pipes sched timers dio mm ipc pty AIO MSG SEM
> SIG THR TMR TPS",
> +            "skiplist": "fanotify07",
>              "extra_success_links": {"xlsx": "results.xlsx", "skiplist": "skiplist.txt"},
>              "extra_fail_links": {"xlsx": "results.xlsx", "skiplist": "skiplist.txt"}
>          },
> --
> 2.7.4
> 
> 
> _______________________________________________
> Fuego mailing list
> Fuego@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/fuego
> 
> 
> 
> 
> <http://ext.samsung.net/mail/ext/v1/external/status/update?userid=dhina
> kar.k&do=bWFpbElEPTIwMTgwNTEwMTYxMzU3ZXBjbXM1cDVhMGRiMjUy
> OWVhZTViYzc2ZWIxM2UzZTZjMTgyZTI3MCZyZWNpcGllbnRBZGRyZXNzPWZ1
> ZWdvQGxpc3RzLmxpbnV4Zm91bmRhdGlvbi5vcmc_>