All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Pingfan Liu <kernelfans@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>,
	rcu@vger.kernel.org, David Woodhouse <dwmw@amazon.co.uk>,
	Neeraj Upadhyay <quic_neeraju@quicinc.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	"Jason A. Donenfeld" <Jason@zx2c4.com>
Subject: Re: [PATCHv2 3/3] rcu: coordinate tick dependency during concurrent offlining
Date: Fri, 30 Sep 2022 08:44:59 -0700	[thread overview]
Message-ID: <20220930154459.GF4196@paulmck-ThinkPad-P17-Gen-1> (raw)
In-Reply-To: <CAFgQCTtLm-JXRyQfKo6-+P00SShVGujZGau+khmtCe1AiRodQA@mail.gmail.com>

On Thu, Sep 29, 2022 at 04:19:28PM +0800, Pingfan Liu wrote:
> On Tue, Sep 27, 2022 at 5:59 PM Pingfan Liu <kernelfans@gmail.com> wrote:
> >
> > On Mon, Sep 26, 2022 at 03:23:52PM -0700, Paul E. McKenney wrote:
> > > On Mon, Sep 26, 2022 at 02:34:17PM +0800, Pingfan Liu wrote:
> > > > Sorry to reply late. I just realize this e-mail misses in my gmail.
> > > >
> > > > On Thu, Sep 22, 2022 at 06:54:42AM -0700, Paul E. McKenney wrote:
> > > > [...]
> > > > >
> > > > > If you have tools/.../rcutorture/bin on your path, yes.  This would default
> > > > > to a 30-minute run.  If you have at least 16 CPUs, you should add
> > > >                                             ^^^ TREE04 has CONFIG_NR_CPUS=8, so I think here the num is 8
> > >
> > > Yes, you will get some benefit from --allcpus on systems with from 9-15
> > > CPUs as well as for 16 and more.  At 8 CPUs, it wouldn't matter.
> > >
> > > > > "--allcpus" to do concurrrent runs.  For example, given 64 CPUs you could
> > > > > do this:
> > > > >
> > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10h --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "4*TREE04"
> > > > >
> > > >
> > > > I have tried to find a two socket system with 128 cpus and run
> > > >   sh kvm.sh --allcpus --duration 250h --bootargs rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30 --configs 16*TREE04
> > > >
> > > > Where 250*16=4000
> > >
> > > That would work.
> > >
> >
> > This job has successfully run 24+ hours. (But maybe I can only keep it
> > about 180 hours)
> >
> > > > > This would run four concurrent instances of the TREE04 scenario, each for
> > > > > 10 hours, for a total of 40 hours of test time.
> > > > >
> > > > > > > It does take some time to run.  I did 4,000 hours worth of TREE04
> > > > > >                                         ^^^ '--duration=4000h' can serve this purpose?
> > > > >
> > > > > You could, at least if you replace the "=" with a space character, but
> > > > > that really would run a six-month test, which is probably not what you
> > > > > want to do.  There being 8,760 hours in a year and all that.
> > > > >
> > > > > > Is it related with the cpu's freq?
> > > > >
> > > > > Not at all.  '--duration 10h' would run ten hours of wall-clock time
> > > > > regardless of the CPU frequencies.
> > > > >
> > > > > > > to confirm lack of bug.  But an 80-CPU dual-socket system can run
> > > > > > > 10 concurrent instances of TREE04, which gets things down to a more
> > > > > >
> > > > > > The total demanded hours H = 4000/(system_cpu_num/8)?
> > > > >
> > > > > Yes.  You can also use multiple systems, which is what kvm-remote.sh is
> > > > > intended for, again assuming 80 CPUs per system to keep the arithmetic
> > > > > simple:
> > > > >
> > > > > tools/testing/selftests/rcutorture/bin/kvm-remote.sh "sys1 sys2 ... sys20" --duration 20h --cpus 80 --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "200*TREE04"
> > > > >
> > > >
> > > > That is appealing.
> > > >
> > > > I will see if any opportunity to grasp a batch of machines to run the
> > > > test.
> > >
> > > Initial tests with smaller numbers of CPUs are also useful, for example,
> > > in case reversion causes some bug due to bad interaction with a later
> > > commit.
> > >
> > > Please let me know how it goes!
> > >
> >
> > I have managed to grasp three two-socket machine, each has 256 cpus.
> > The test has run about 7 hours till now without any problem by the following command:
> > tools/testing/selftests/rcutorture/bin/kvm-remote.sh "sys1 sys2 sys3" \
> > --duration 45h --cpus 256 --bootargs "rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30" --configs "96*TREE04"
> >
> > It seems promising.
> >
> 
> The test is against v6.0-rc7 kernel, and only with 96926686deab ("rcu:
> Make CPU-hotplug removal operations enable tick") reverted. It is
> close to the end, but unfortunately it fails.
> Quote from remote-log
> "
> TREE04.57 ------- 4410955 GPs (27.2281/s) [rcu: g36045577 f0x0
> total-gps=9011687] n_max_cbs: 4111392
> TREE04.58 ------- 4368391 GPs (26.9654/s) [rcu: g35630093 f0x0
> total-gps=8907816] n_max_cbs: 2411104
> TREE04.59 ------- 800516 GPs (4.94146/s) n_max_cbs: 3634471
> QEMU killed
> TREE04.59 no success message, 10547 successful version messages
> ^[[033mWARNING: ^[[mTREE04.59 GP HANG at 800516 torture stat 1925
> ^[[033mWARNING: ^[[mAssertion failure in
> /home/linux/tools/testing/selftests/rcutorture/res/2022.09.26-23.33.34-remote/TREE04.59/console.log
> TREE04.59
> ^[[033mWARNING: ^[[mSummary: Call Traces: 1 Stalls: 8615
> TREE04.6 ------- 4348443 GPs (26.8422/s) [rcu: g35341129 f0x0
> total-gps=8835575] n_max_cbs: 2329432

First, thank you for running this!

This is not the typical failure that we were seeing, which would show
up as a 2.199.0-second RCU CPU stall during which time there would be
no console messages.

But please do let me know how continuing tests go!

							Thanx, Paul

> ...
> ...
> TREE04.91 ------- 4895716 GPs (30.2205/s) [rcu: g39322065 f0x0
> total-gps=9830808] n_max_cbs: 2208839
> TREE04.92 ------- 4902696 GPs (30.2636/s) [rcu: g39113441 f0x0
> total-gps=9778652] n_max_cbs: 1412377
> TREE04.93 ------- 4891393 GPs (30.1938/s) [rcu: g39244749 f0x0
> total-gps=9811481] n_max_cbs: 1772653
> TREE04.94 ------- 4921510 GPs (30.3797/s) [rcu: g39187349 f0x0
> total-gps=9797129] n_max_cbs: 1120534
> TREE04.95 ------- 4885795 GPs (30.1592/s) [rcu: g39020985 f0x0
> total-gps=9755538] n_max_cbs: 1178416
> TREE04.96 ------- 4889097 GPs (30.1796/s) [rcu: g39097057 f0x0
> total-gps=9774556] n_max_cbs: 1861434
> 1 runs with runtime errors.
>  --- Done at Wed Sep 28 08:40:31 PM EDT 2022 (1d 21:06:57) exitcode 2
> "
> 
> Quote from  console.log of TREE04.59
> "
> .....
> [162001.696486] rcu-torture: rcu_torture_barrier_cbs is stopping
> [162001.697004] rcu-torture: Stopping rcu_torture_fwd_prog task
> [162001.697662] rcu_torture_fwd_prog n_max_cbs: 0
> [162001.698195] rcu_torture_fwd_prog: Starting forward-progress test 0
> [162001.698782] rcu_torture_fwd_prog_cr: Starting forward-progress test 0
> [162001.707571] rcu_torture_fwd_prog_cr: Waiting for CBs:
> rcu_barrier+0x0/0x3b0() 0
> [162002.738504] rcu_torture_fwd_prog_nr: Starting forward-progress test 0
> [162002.746491] rcu_torture_fwd_prog_nr: Waiting for CBs:
> rcu_barrier+0x0/0x3b0() 0
> [162002.850483] rcu_torture_fwd_prog: tested 2105 tested_tries 2107
> [162002.851008] rcu-torture: rcu_torture_fwd_prog is stopping
> [162002.851542] rcu-torture: Stopping rcu_torture_writer task
> [162004.530463] rcu-torture: rtc: 00000000ac003c99 ver: 800516 tfle: 0
> rta: 800517 rtaf: 0 rtf: 800507 rtmbe: 0 rtmbkf: 0/142699 rtbe: 0
> rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 205710931 onoff:
> 185194/185194:185196/185196 1,1860:1,3263 25610601:47601063 (HZ=1000)
> barrier: 773783/773783:0 read-exits: 184960 nocb-toggles: 0:0
> [162004.532583] rcu-torture: Reader Pipe:  343007605654 1359216 0 0 0
> 0 0 0 0 0 0
> [162004.533113] rcu-torture: Reader Batch:  342996212546 12752324 0 0
> 0 0 0 0 0 0 0
> [162004.533648] rcu-torture: Free-Block Circulation:  800516 800515
> 800514 800513 800512 800511 800510 800509 800508 800507 0
> [162004.534442] ??? Writer stall state RTWS_EXP_SYNC(4) g30755544 f0x0
> ->state 0x2 cpu 0
> [162004.535057] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1)
> ->state: 0x402 ->rt_priority 0 delta ->gp_start 1674 ->gp_activity
> 1670 ->gp_req_activity 1674 ->gp_wake_time 1674 ->gp_wake_seq 30755540
> ->gp_seq 30755544 ->gp_seq_needed 30755544 ->gp_max 989 ->gp_flags 0x0
> [162004.536805] rcu:    CB 1^0->2 KbclSW F2838 L2838 C0 ..... q0 S CPU 0
> [162004.537277] rcu:    CB 2^0->3 KbclSW F2911 L2911 C7 ..... q0 S CPU 0
> [162004.537742] rcu:    CB 3^0->-1 KbclSW F1686 L1686 C2 ..... q0 S CPU 0
> [162004.538217] rcu: nocb GP 4 KldtS W[..] ..:0 rnp 4:7 2176869 S CPU 0
> [162004.538729] rcu:    CB 4^4->5 KbclSW F2912 L2912 C7 ..... q0 S CPU 0
> [162004.539202] rcu:    CB 5^4->6 KbclSW F2871 L2872 C1 ..... q0 S CPU 0
> [162004.539667] rcu:    CB 6^4->7 KbclSW F4060 L4060 C0 ..... q0 S CPU 0
> [162004.540136] rcu:    CB 7^4->-1 KbclSW F5763 L5763 C1 ..... q0 S CPU 0
> [162004.540653] rcu: RCU callbacks invoked since boot: 1431149091
> [162004.541076] rcu-torture: rcu_torture_stats is stopping
> "
> 
> I have no idea whether this is related to the reverted commit.
> 
> 
> Thanks,
> 
> Pingfan



  parent reply	other threads:[~2022-09-30 15:45 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-15  5:58 [PATCHv2 0/3] rcu: Enhance the capability to cope with concurrent cpu offlining/onlining Pingfan Liu
2022-09-15  5:58 ` [PATCHv2 1/3] rcu: Keep qsmaskinitnext fresh when rcutree_online_cpu() Pingfan Liu
2022-09-15  6:11   ` Pingfan Liu
2022-09-16 14:52   ` Frederic Weisbecker
2022-09-19 10:24     ` Pingfan Liu
2022-09-19 10:51       ` Frederic Weisbecker
2022-09-20  3:45         ` Pingfan Liu
2022-09-20  9:23           ` Frederic Weisbecker
2022-10-01  2:26             ` Joel Fernandes
2022-10-02 12:34               ` Pingfan Liu
2022-10-02 15:52                 ` Joel Fernandes
2022-09-20 10:31   ` Frederic Weisbecker
2022-09-21 11:56     ` Pingfan Liu
2022-09-15  5:58 ` [PATCHv2 2/3] rcu: Resort to cpu_dying_mask for affinity when offlining Pingfan Liu
2022-09-16 14:23   ` Frederic Weisbecker
2022-09-19  4:33     ` Pingfan Liu
2022-09-19 10:34       ` Frederic Weisbecker
2022-09-20  3:16         ` Pingfan Liu
2022-09-20  9:00           ` Frederic Weisbecker
2022-09-20  9:38   ` Frederic Weisbecker
2022-09-21 11:48     ` Pingfan Liu
2022-09-15  5:58 ` [PATCHv2 3/3] rcu: coordinate tick dependency during concurrent offlining Pingfan Liu
2022-09-16 13:42   ` Frederic Weisbecker
2022-09-20  7:26     ` Pingfan Liu
2022-09-20  9:46       ` Frederic Weisbecker
2022-09-20 19:13         ` Paul E. McKenney
2022-09-22  9:29           ` Pingfan Liu
2022-09-22 13:54             ` Paul E. McKenney
2022-09-23 22:13               ` Frederic Weisbecker
2022-09-26  6:34               ` Pingfan Liu
2022-09-26 22:23                 ` Paul E. McKenney
2022-09-27  9:59                   ` Pingfan Liu
2022-09-29  8:19                     ` Pingfan Liu
2022-09-29  8:20                       ` Pingfan Liu
2022-09-30 13:04                         ` Joel Fernandes
2022-10-02 14:06                           ` Pingfan Liu
2022-10-02 16:11                             ` Joel Fernandes
2022-10-02 16:24                               ` Paul E. McKenney
2022-10-02 16:30                                 ` Joel Fernandes
2022-10-02 16:57                                   ` Paul E. McKenney
2022-10-02 16:59                                     ` Joel Fernandes
2022-09-30 15:44                       ` Paul E. McKenney [this message]
2022-10-02 13:29                         ` Pingfan Liu
2022-10-02 15:08                           ` Frederic Weisbecker
2022-10-02 16:20                             ` Paul E. McKenney
2022-10-02 16:20                           ` Paul E. McKenney
     [not found]                             ` <CAFgQCTtgLfc0NeYqyWk4Ew-pA9rMREjRjWSnQhYLv-V5117s9Q@mail.gmail.com>
2022-10-27 17:46                               ` Paul E. McKenney
2022-10-31  3:24                                 ` Pingfan Liu
2022-11-03 16:51                                   ` Paul E. McKenney
2022-11-07 16:07                                     ` Paul E. McKenney
2022-11-09 18:55                                       ` Joel Fernandes
2022-11-18 12:08                                         ` Pingfan Liu
2022-11-18 23:30                                           ` Paul E. McKenney
2022-11-21  3:48                                             ` Pingfan Liu
2022-11-21 17:14                                               ` Paul E. McKenney
2022-11-17 14:39                                       ` Frederic Weisbecker
2022-11-18  1:45                                         ` Pingfan Liu
     [not found]                             ` <CAFgQCTtNetv7v_Law=abPtngC8Gv6OGcGz9M_wWMxz_GAEWDUQ@mail.gmail.com>
2022-10-27 18:13                               ` Paul E. McKenney
2022-10-31  2:10                                 ` Pingfan Liu
2022-09-26 16:13   ` Joel Fernandes
2022-09-27  9:42     ` Pingfan Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220930154459.GF4196@paulmck-ThinkPad-P17-Gen-1 \
    --to=paulmck@kernel.org \
    --cc=Jason@zx2c4.com \
    --cc=dwmw@amazon.co.uk \
    --cc=frederic@kernel.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=josh@joshtriplett.org \
    --cc=kernelfans@gmail.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=quic_neeraju@quicinc.com \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.