All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jet Chen <jet.chen@intel.com>, Su Tao <tao.su@intel.com>,
	Yuanhan Liu <yuanhan.liu@intel.com>, LKP <lkp@01.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [torture] BUG: unable to handle kernel NULL pointer dereference at (null)
Date: Tue, 30 Sep 2014 02:58:42 -0700	[thread overview]
Message-ID: <20140930095842.GR5015@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140930022740.GA4723@wfg-t540p.sh.intel.com>

On Tue, Sep 30, 2014 at 10:27:40AM +0800, Fengguang Wu wrote:
> On Fri, Sep 26, 2014 at 12:42:23AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 18, 2014 at 09:17:51PM +0800, Fengguang Wu wrote:
> > > Hi Paul,
> > > 
> > > > > > > plymouth-upstart-bridge: ply-event-loop.c:497: ply_event_loop_new: Assertion `loop->epoll_fd >= 0' failed.
> > > > > > > /etc/lsb-base-logging.sh: line 5:  2580 Aborted                 plymouth --ping > /dev/null 2>&1
> > > > > > > /etc/lsb-base-logging.sh: line 5:  2585 Aborted                 plymouth --ping > /dev/null 2>&1
> > > > > > > mount: proc has wrong device number or fs type proc not supported
> > > > > > > /etc/lsb-base-logging.sh: line 5:  2601 Aborted                 plymouth --ping > /dev/null 2>&1
> > > > > > > /etc/rc6.d/S40umountfs: line 20: /proc/mounts: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > umount: /var/run: not mounted
> > > > > > > umount: /var/lock: not mounted
> > > > > > > umount: /dev/shm: not mounted
> > > > > > > mount: / is busy
> > > > > > >  * Will now restart
> > > > 
> > > > Are these expected behavior?
> > > 
> > > Yes, because it's randconfig boot tests, the user space may well
> > > complain about random stuff and I'll ignore them all as long as it
> > > will eventually call the shutdown command to finish the test in time.  :)
> > > 
> > > > So again, I can invoke this commit without losing much (sendkey
> > > > alt-sysrq-z is after all my friend), but it is not clear to me that we
> > > > have gotten to the root of this problem.
> > > 
> > > Sorry about that! If you see any debug tricks that I can try, or
> > > information I can collect, please let me know.
> > 
> > Hmmm...
> > 
> > Looks like rcutorture might be starting too soon.  With all the selftests,
> > it is taking 3-4 minutes to boot.
> 
> Sorry my scripts reboot the machine quickly, there is no logic to wait for the
> completion of rcutorture tests.  Looking at the dmesg, the BUG shows up during
> the shutdown stage:
> 
> ==>     Sending all processes the TERM signal...
>         mount: mounting proc on /proc failed: No such device
>         [  121.930088] Dumping ftrace buffer:
>         [  121.930453] ---------------------------------
>         [  121.930865] BUG: unable to handle kernel NULL pointer dereference at           (null)
>         [  121.931644] IP: [<ffffffff8959b074>] print_trace_line+0x2b0/0x38a

Fair point.

> > One approach would be to set
> > rcutorture.stat_interval=200 or whatever the duration of boot is.
> > Another would be to set rcutorture.torture_runnable=0, and to change:
> > 
> > 	int rcutorture_runnable = RCUTORTURE_RUNNABLE_INIT;
> > 	module_param(rcutorture_runnable, int, 0444);
> > 	MODULE_PARM_DESC(rcutorture_runnable, "Start rcutorture at boot");
> > 
> > To:
> > 
> > 	int rcutorture_runnable = RCUTORTURE_RUNNABLE_INIT;
> > 	module_param(rcutorture_runnable, int, 0644);
> > 	MODULE_PARM_DESC(rcutorture_runnable, "Start rcutorture at boot");
> > 
> > In kernel/rcu/rcutorture.c.
> > 
> > Then have your scripts set rcutorture_runnable=1 from sysfs once boot
> > completes.
> 
> That looks suitable to run as a functional test case in LKP.
> 
> I can enable the below options in the LKP test kernel (hope they will not
> bring noticeable runtime overheads when not insmod):
> 
> CONFIG_TORTURE_TEST=m
> CONFIG_RCU_TORTURE_TEST=m
> CONFIG_LOCK_TORTURE_TEST=m
> 
> and write a simple test case for it. To start the torture test,
> it should be as simple as
> 
>          modproble rcutorture

"modprobe", but yes.  ;-)

>          echo 1 > /proc/sys/kernel/rcutorture_runnable
> 
> To determine if the test has finished, will this do the job in the
> normal cases?
> 
>          dmesg | grep "--- End of test:"

Well, rcutorture won't stop unless you tell it to.  The usual way to
tell it to stop is via rmmod.

> Can I reasonably set the max test timeout to 5 or 10 or more minutes?
> When exceeded, LKP will assume the machine is dead and need a force reboot.

One thing would be something like this:

	( sleep 300; rmmod rcutorture ) &

Running both locktorture and rcutorture concurrently is not supported
at the moment because both make the assumption that they have the entire
system at their disposal.

> > Alternatively, if poking sysfs is not reasonable (and it
> > would not be in my test scripts), put a delay just after the
> > rcutorture_record_test_transition() in rcu_torture_init().  For example,
> > schedule_timeout_interruptible(200 * HZ) to delay 200 seconds.
> 
> I'd prefer the boot test to complete sooner than later, which helps
> improve the test efficiency. :)

Good point...  It would not be hard to make this a timeout-based delay,
though.  But your earlier point about this being during shutdown is
a good one here, also.

> The 0day boot tests works by running this in a init.d script
> 
>         run CPU hotplug tests in the background
>         enable tracing events [optional]
>         run trinity for 100 seconds
>         ### if feasible, we can add a wait-for-rcutorture-finish here
>         reboot
> 
> > Another approach would be for me to figure out some way for rcutorture
> > to figure out that boot was not far enough along for it to safely
> > do much, probably enabled by a third value of rcutorture_runnable.
> 
> Will it be *easy* to have a blockable syswrite to "stop rcutorture tests",
> which will return when all tests have been stopped?

I am not eager to handle the races between such a mechanism and rmmod.  ;-)

> Anyway, I'm already happy with the above two options (where the kernel
> printk some test completion message for the user space to grep).

OK, but rmmod already exists and already shuts down rcutorture and waits
for it to complete.

> > One more approach would be to replace DUMP_ALL with DUMP_NONE in
> > kernel/rcu/rcutorture.c's rcutorture_trace_dump() function.  Or
> > to remove the ftrace_dump() statement entirely.  (The question that
> > this might help answer is which part of rcutorture_trace_dump() is
> > causing the problem.)
> 
> It should be unnecessary in testing POV. Unless if it is a bug.

It might be that it is illegal to do ftrace_dump() too far into shutdown.
I should check with Steven Rostedt.

							Thanx, Paul


WARNING: multiple messages have this Message-ID (diff)
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
To: lkp@lists.01.org
Subject: Re: [torture] BUG: unable to handle kernel NULL pointer dereference at (null)
Date: Tue, 30 Sep 2014 02:58:42 -0700	[thread overview]
Message-ID: <20140930095842.GR5015@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140930022740.GA4723@wfg-t540p.sh.intel.com>

[-- Attachment #1: Type: text/plain, Size: 6732 bytes --]

On Tue, Sep 30, 2014 at 10:27:40AM +0800, Fengguang Wu wrote:
> On Fri, Sep 26, 2014 at 12:42:23AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 18, 2014 at 09:17:51PM +0800, Fengguang Wu wrote:
> > > Hi Paul,
> > > 
> > > > > > > plymouth-upstart-bridge: ply-event-loop.c:497: ply_event_loop_new: Assertion `loop->epoll_fd >= 0' failed.
> > > > > > > /etc/lsb-base-logging.sh: line 5:  2580 Aborted                 plymouth --ping > /dev/null 2>&1
> > > > > > > /etc/lsb-base-logging.sh: line 5:  2585 Aborted                 plymouth --ping > /dev/null 2>&1
> > > > > > > mount: proc has wrong device number or fs type proc not supported
> > > > > > > /etc/lsb-base-logging.sh: line 5:  2601 Aborted                 plymouth --ping > /dev/null 2>&1
> > > > > > > /etc/rc6.d/S40umountfs: line 20: /proc/mounts: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > cat: /proc/1/maps: No such file or directory
> > > > > > > umount: /var/run: not mounted
> > > > > > > umount: /var/lock: not mounted
> > > > > > > umount: /dev/shm: not mounted
> > > > > > > mount: / is busy
> > > > > > >  * Will now restart
> > > > 
> > > > Are these expected behavior?
> > > 
> > > Yes, because it's randconfig boot tests, the user space may well
> > > complain about random stuff and I'll ignore them all as long as it
> > > will eventually call the shutdown command to finish the test in time.  :)
> > > 
> > > > So again, I can invoke this commit without losing much (sendkey
> > > > alt-sysrq-z is after all my friend), but it is not clear to me that we
> > > > have gotten to the root of this problem.
> > > 
> > > Sorry about that! If you see any debug tricks that I can try, or
> > > information I can collect, please let me know.
> > 
> > Hmmm...
> > 
> > Looks like rcutorture might be starting too soon.  With all the selftests,
> > it is taking 3-4 minutes to boot.
> 
> Sorry my scripts reboot the machine quickly, there is no logic to wait for the
> completion of rcutorture tests.  Looking at the dmesg, the BUG shows up during
> the shutdown stage:
> 
> ==>     Sending all processes the TERM signal...
>         mount: mounting proc on /proc failed: No such device
>         [  121.930088] Dumping ftrace buffer:
>         [  121.930453] ---------------------------------
>         [  121.930865] BUG: unable to handle kernel NULL pointer dereference at           (null)
>         [  121.931644] IP: [<ffffffff8959b074>] print_trace_line+0x2b0/0x38a

Fair point.

> > One approach would be to set
> > rcutorture.stat_interval=200 or whatever the duration of boot is.
> > Another would be to set rcutorture.torture_runnable=0, and to change:
> > 
> > 	int rcutorture_runnable = RCUTORTURE_RUNNABLE_INIT;
> > 	module_param(rcutorture_runnable, int, 0444);
> > 	MODULE_PARM_DESC(rcutorture_runnable, "Start rcutorture at boot");
> > 
> > To:
> > 
> > 	int rcutorture_runnable = RCUTORTURE_RUNNABLE_INIT;
> > 	module_param(rcutorture_runnable, int, 0644);
> > 	MODULE_PARM_DESC(rcutorture_runnable, "Start rcutorture at boot");
> > 
> > In kernel/rcu/rcutorture.c.
> > 
> > Then have your scripts set rcutorture_runnable=1 from sysfs once boot
> > completes.
> 
> That looks suitable to run as a functional test case in LKP.
> 
> I can enable the below options in the LKP test kernel (hope they will not
> bring noticeable runtime overheads when not insmod):
> 
> CONFIG_TORTURE_TEST=m
> CONFIG_RCU_TORTURE_TEST=m
> CONFIG_LOCK_TORTURE_TEST=m
> 
> and write a simple test case for it. To start the torture test,
> it should be as simple as
> 
>          modproble rcutorture

"modprobe", but yes.  ;-)

>          echo 1 > /proc/sys/kernel/rcutorture_runnable
> 
> To determine if the test has finished, will this do the job in the
> normal cases?
> 
>          dmesg | grep "--- End of test:"

Well, rcutorture won't stop unless you tell it to.  The usual way to
tell it to stop is via rmmod.

> Can I reasonably set the max test timeout to 5 or 10 or more minutes?
> When exceeded, LKP will assume the machine is dead and need a force reboot.

One thing would be something like this:

	( sleep 300; rmmod rcutorture ) &

Running both locktorture and rcutorture concurrently is not supported
at the moment because both make the assumption that they have the entire
system at their disposal.

> > Alternatively, if poking sysfs is not reasonable (and it
> > would not be in my test scripts), put a delay just after the
> > rcutorture_record_test_transition() in rcu_torture_init().  For example,
> > schedule_timeout_interruptible(200 * HZ) to delay 200 seconds.
> 
> I'd prefer the boot test to complete sooner than later, which helps
> improve the test efficiency. :)

Good point...  It would not be hard to make this a timeout-based delay,
though.  But your earlier point about this being during shutdown is
a good one here, also.

> The 0day boot tests works by running this in a init.d script
> 
>         run CPU hotplug tests in the background
>         enable tracing events [optional]
>         run trinity for 100 seconds
>         ### if feasible, we can add a wait-for-rcutorture-finish here
>         reboot
> 
> > Another approach would be for me to figure out some way for rcutorture
> > to figure out that boot was not far enough along for it to safely
> > do much, probably enabled by a third value of rcutorture_runnable.
> 
> Will it be *easy* to have a blockable syswrite to "stop rcutorture tests",
> which will return when all tests have been stopped?

I am not eager to handle the races between such a mechanism and rmmod.  ;-)

> Anyway, I'm already happy with the above two options (where the kernel
> printk some test completion message for the user space to grep).

OK, but rmmod already exists and already shuts down rcutorture and waits
for it to complete.

> > One more approach would be to replace DUMP_ALL with DUMP_NONE in
> > kernel/rcu/rcutorture.c's rcutorture_trace_dump() function.  Or
> > to remove the ftrace_dump() statement entirely.  (The question that
> > this might help answer is which part of rcutorture_trace_dump() is
> > causing the problem.)
> 
> It should be unnecessary in testing POV. Unless if it is a bug.

It might be that it is illegal to do ftrace_dump() too far into shutdown.
I should check with Steven Rostedt.

							Thanx, Paul


  reply	other threads:[~2014-09-30  9:58 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-13 12:24 [torture] BUG: unable to handle kernel NULL pointer dereference at (null) Fengguang Wu
2014-09-15 20:27 ` Paul E. McKenney
2014-09-15 20:27   ` Paul E. McKenney
2014-09-17  2:31   ` Fengguang Wu
2014-09-17  2:31     ` Fengguang Wu
2014-09-17 16:17     ` Paul E. McKenney
2014-09-17 16:17       ` Paul E. McKenney
2014-09-18 13:17       ` Fengguang Wu
2014-09-18 13:17         ` Fengguang Wu
2014-09-26  7:42         ` Paul E. McKenney
2014-09-26  7:42           ` Paul E. McKenney
2014-09-30  2:27           ` Fengguang Wu
2014-09-30  2:27             ` Fengguang Wu
2014-09-30  9:58             ` Paul E. McKenney [this message]
2014-09-30  9:58               ` Paul E. McKenney
2014-09-30 11:41               ` Fengguang Wu
2014-09-30 11:41                 ` Fengguang Wu
2014-09-30 19:22                 ` Paul E. McKenney
2014-09-30 19:22                   ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140930095842.GR5015@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=fengguang.wu@intel.com \
    --cc=jet.chen@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@01.org \
    --cc=tao.su@intel.com \
    --cc=yuanhan.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.