On Tue, Sep 30, 2014 at 10:27:40AM +0800, Fengguang Wu wrote: > On Fri, Sep 26, 2014 at 12:42:23AM -0700, Paul E. McKenney wrote: > > On Thu, Sep 18, 2014 at 09:17:51PM +0800, Fengguang Wu wrote: > > > Hi Paul, > > > > > > > > > > plymouth-upstart-bridge: ply-event-loop.c:497: ply_event_loop_new: Assertion `loop->epoll_fd >= 0' failed. > > > > > > > /etc/lsb-base-logging.sh: line 5: 2580 Aborted plymouth --ping > /dev/null 2>&1 > > > > > > > /etc/lsb-base-logging.sh: line 5: 2585 Aborted plymouth --ping > /dev/null 2>&1 > > > > > > > mount: proc has wrong device number or fs type proc not supported > > > > > > > /etc/lsb-base-logging.sh: line 5: 2601 Aborted plymouth --ping > /dev/null 2>&1 > > > > > > > /etc/rc6.d/S40umountfs: line 20: /proc/mounts: No such file or directory > > > > > > > cat: /proc/1/maps: No such file or directory > > > > > > > cat: /proc/1/maps: No such file or directory > > > > > > > cat: /proc/1/maps: No such file or directory > > > > > > > cat: /proc/1/maps: No such file or directory > > > > > > > cat: /proc/1/maps: No such file or directory > > > > > > > cat: /proc/1/maps: No such file or directory > > > > > > > umount: /var/run: not mounted > > > > > > > umount: /var/lock: not mounted > > > > > > > umount: /dev/shm: not mounted > > > > > > > mount: / is busy > > > > > > > * Will now restart > > > > > > > > Are these expected behavior? > > > > > > Yes, because it's randconfig boot tests, the user space may well > > > complain about random stuff and I'll ignore them all as long as it > > > will eventually call the shutdown command to finish the test in time. :) > > > > > > > So again, I can invoke this commit without losing much (sendkey > > > > alt-sysrq-z is after all my friend), but it is not clear to me that we > > > > have gotten to the root of this problem. > > > > > > Sorry about that! If you see any debug tricks that I can try, or > > > information I can collect, please let me know. > > > > Hmmm... > > > > Looks like rcutorture might be starting too soon. With all the selftests, > > it is taking 3-4 minutes to boot. > > Sorry my scripts reboot the machine quickly, there is no logic to wait for the > completion of rcutorture tests. Looking at the dmesg, the BUG shows up during > the shutdown stage: > > ==> Sending all processes the TERM signal... > mount: mounting proc on /proc failed: No such device > [ 121.930088] Dumping ftrace buffer: > [ 121.930453] --------------------------------- > [ 121.930865] BUG: unable to handle kernel NULL pointer dereference at (null) > [ 121.931644] IP: [] print_trace_line+0x2b0/0x38a Fair point. > > One approach would be to set > > rcutorture.stat_interval=200 or whatever the duration of boot is. > > Another would be to set rcutorture.torture_runnable=0, and to change: > > > > int rcutorture_runnable = RCUTORTURE_RUNNABLE_INIT; > > module_param(rcutorture_runnable, int, 0444); > > MODULE_PARM_DESC(rcutorture_runnable, "Start rcutorture at boot"); > > > > To: > > > > int rcutorture_runnable = RCUTORTURE_RUNNABLE_INIT; > > module_param(rcutorture_runnable, int, 0644); > > MODULE_PARM_DESC(rcutorture_runnable, "Start rcutorture at boot"); > > > > In kernel/rcu/rcutorture.c. > > > > Then have your scripts set rcutorture_runnable=1 from sysfs once boot > > completes. > > That looks suitable to run as a functional test case in LKP. > > I can enable the below options in the LKP test kernel (hope they will not > bring noticeable runtime overheads when not insmod): > > CONFIG_TORTURE_TEST=m > CONFIG_RCU_TORTURE_TEST=m > CONFIG_LOCK_TORTURE_TEST=m > > and write a simple test case for it. To start the torture test, > it should be as simple as > > modproble rcutorture "modprobe", but yes. ;-) > echo 1 > /proc/sys/kernel/rcutorture_runnable > > To determine if the test has finished, will this do the job in the > normal cases? > > dmesg | grep "--- End of test:" Well, rcutorture won't stop unless you tell it to. The usual way to tell it to stop is via rmmod. > Can I reasonably set the max test timeout to 5 or 10 or more minutes? > When exceeded, LKP will assume the machine is dead and need a force reboot. One thing would be something like this: ( sleep 300; rmmod rcutorture ) & Running both locktorture and rcutorture concurrently is not supported at the moment because both make the assumption that they have the entire system at their disposal. > > Alternatively, if poking sysfs is not reasonable (and it > > would not be in my test scripts), put a delay just after the > > rcutorture_record_test_transition() in rcu_torture_init(). For example, > > schedule_timeout_interruptible(200 * HZ) to delay 200 seconds. > > I'd prefer the boot test to complete sooner than later, which helps > improve the test efficiency. :) Good point... It would not be hard to make this a timeout-based delay, though. But your earlier point about this being during shutdown is a good one here, also. > The 0day boot tests works by running this in a init.d script > > run CPU hotplug tests in the background > enable tracing events [optional] > run trinity for 100 seconds > ### if feasible, we can add a wait-for-rcutorture-finish here > reboot > > > Another approach would be for me to figure out some way for rcutorture > > to figure out that boot was not far enough along for it to safely > > do much, probably enabled by a third value of rcutorture_runnable. > > Will it be *easy* to have a blockable syswrite to "stop rcutorture tests", > which will return when all tests have been stopped? I am not eager to handle the races between such a mechanism and rmmod. ;-) > Anyway, I'm already happy with the above two options (where the kernel > printk some test completion message for the user space to grep). OK, but rmmod already exists and already shuts down rcutorture and waits for it to complete. > > One more approach would be to replace DUMP_ALL with DUMP_NONE in > > kernel/rcu/rcutorture.c's rcutorture_trace_dump() function. Or > > to remove the ftrace_dump() statement entirely. (The question that > > this might help answer is which part of rcutorture_trace_dump() is > > causing the problem.) > > It should be unnecessary in testing POV. Unless if it is a bug. It might be that it is illegal to do ftrace_dump() too far into shutdown. I should check with Steven Rostedt. Thanx, Paul