All of lore.kernel.org
 help / color / mirror / Atom feed
* CFS scheduler OLTP perforamnce
@ 2008-12-11 23:25 Ma, Chinang
  2008-12-12 12:12 ` Peter Zijlstra
  2008-12-12 12:37 ` Gilles.Carry
  0 siblings, 2 replies; 14+ messages in thread
From: Ma, Chinang @ 2008-12-11 23:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Wilcox, Matthew R, Van De Ven, Arjan, Styner,
	Douglas W, Chilukuri, Harita, Wang, Peter Xihong, Nueckel,
	Hubert

We are evaluating the CFS OLTP performance with 2.6.28-c7 kernel. In this workload once a database foreground process commit a transaction it will signal the log writer process to write to the log file. Foreground processes will wait until log writer finish writing and wake them up. With hundreds of foreground process running in the system, it is important that the log writer get to run as soon as data is available. 

Here are the experiments we have done with 2.6.28-rc7.
1. Increase log writer priority "renice -20 <log writer pid>" while keeping all other processes running in default CFS priority. We get a baseline performance with log latency (scheduling + i/o) at 7 ms.

2. To reduce log latency, we set log writer to SCHED_RR with higher priority. We tried "chrt -p 49  <log writer pid>" and got 0.7% boost in performance with log latency reduced to 6.4 ms.

It seems that in this case renice to higher priority with CFS did not reduce scheduling latency as well as SCHED_RR.

-Chinang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CFS scheduler OLTP perforamnce
  2008-12-11 23:25 CFS scheduler OLTP perforamnce Ma, Chinang
@ 2008-12-12 12:12 ` Peter Zijlstra
  2008-12-12 13:38   ` Peter Zijlstra
                     ` (2 more replies)
  2008-12-12 12:37 ` Gilles.Carry
  1 sibling, 3 replies; 14+ messages in thread
From: Peter Zijlstra @ 2008-12-12 12:12 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Ingo Molnar, linux-kernel, Wilcox, Matthew R, Van De Ven, Arjan,
	Styner, Douglas W, Chilukuri, Harita, Wang, Peter Xihong,
	Nueckel, Hubert

On Thu, 2008-12-11 at 16:25 -0700, Ma, Chinang wrote:
> We are evaluating the CFS OLTP performance with 2.6.28-c7 kernel. In
> this workload once a database foreground process commit a transaction
> it will signal the log writer process to write to the log file.
> Foreground processes will wait until log writer finish writing and
> wake them up. With hundreds of foreground process running in the
> system, it is important that the log writer get to run as soon as data
> is available. 
> 
> Here are the experiments we have done with 2.6.28-rc7.
> 1. Increase log writer priority "renice -20 <log writer pid>" while
> keeping all other processes running in default CFS priority. We get a
> baseline performance with log latency (scheduling + i/o) at 7 ms.

Is this better or the same than nice-0 ?

> 2. To reduce log latency, we set log writer to SCHED_RR with higher
> priority. We tried "chrt -p 49  <log writer pid>" and got 0.7% boost
> in performance with log latency reduced to 6.4 ms.
> 
> It seems that in this case renice to higher priority with CFS did not
> reduce scheduling latency as well as SCHED_RR.

Is there a question in this email?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CFS scheduler OLTP perforamnce
  2008-12-11 23:25 CFS scheduler OLTP perforamnce Ma, Chinang
  2008-12-12 12:12 ` Peter Zijlstra
@ 2008-12-12 12:37 ` Gilles.Carry
  1 sibling, 0 replies; 14+ messages in thread
From: Gilles.Carry @ 2008-12-12 12:37 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Ingo Molnar, linux-kernel, Wilcox, Matthew R, Van De Ven, Arjan,
	Styner, Douglas W, Chilukuri, Harita, Wang, Peter Xihong,
	Nueckel, Hubert

Ma, Chinang a écrit :
> We are evaluating the CFS OLTP performance with 2.6.28-c7 kernel. In this workload once a database foreground process commit a transaction it will signal the log writer process to write to the log file. Foreground processes will wait until log writer finish writing and wake them up. With hundreds of foreground process running in the system, it is important that the log writer get to run as soon as data is available. 
> 
> Here are the experiments we have done with 2.6.28-rc7.
> 1. Increase log writer priority "renice -20 <log writer pid>" while keeping all other processes running in default CFS priority. We get a baseline performance with log latency (scheduling + i/o) at 7 ms.
> 
> 2. To reduce log latency, we set log writer to SCHED_RR with higher priority. We tried "chrt -p 49  <log writer pid>" and got 0.7% boost in performance with log latency reduced to 6.4 ms.
> 
> It seems that in this case renice to higher priority with CFS did not reduce scheduling latency as well as SCHED_RR.

Maybe you should try an RT kernel.

Gilles.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CFS scheduler OLTP perforamnce
  2008-12-12 12:12 ` Peter Zijlstra
@ 2008-12-12 13:38   ` Peter Zijlstra
  2008-12-12 14:04     ` Gilles.Carry
  2008-12-12 21:45     ` Ma, Chinang
  2008-12-12 14:15   ` Andi Kleen
  2008-12-12 17:25   ` Ma, Chinang
  2 siblings, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2008-12-12 13:38 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Ingo Molnar, linux-kernel, Wilcox, Matthew R, Van De Ven, Arjan,
	Styner, Douglas W, Chilukuri, Harita, Wang, Peter Xihong,
	Nueckel, Hubert

On Fri, 2008-12-12 at 13:12 +0100, Peter Zijlstra wrote:
> On Thu, 2008-12-11 at 16:25 -0700, Ma, Chinang wrote:
> > We are evaluating the CFS OLTP performance with 2.6.28-c7 kernel. In
> > this workload once a database foreground process commit a transaction
> > it will signal the log writer process to write to the log file.
> > Foreground processes will wait until log writer finish writing and
> > wake them up. With hundreds of foreground process running in the
> > system, it is important that the log writer get to run as soon as data
> > is available. 
> > 
> > Here are the experiments we have done with 2.6.28-rc7.
> > 1. Increase log writer priority "renice -20 <log writer pid>" while
> > keeping all other processes running in default CFS priority. We get a
> > baseline performance with log latency (scheduling + i/o) at 7 ms.
> 
> Is this better or the same than nice-0 ?
> 
> > 2. To reduce log latency, we set log writer to SCHED_RR with higher
> > priority. We tried "chrt -p 49  <log writer pid>" and got 0.7% boost
> > in performance with log latency reduced to 6.4 ms.

BTW, 6.4ms schedule latency sounds insanely long for a RR task, are you
running a PREEMPT=n kernel or something?

How would you characterize the log tasks behaviour?

 - does it run long/short (any quantization)
 - does it sleep long/short - how does it compare to its runtime?
 - does it wake others?
   - if so, always the one who woke it, or multiple others?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CFS scheduler OLTP perforamnce
  2008-12-12 13:38   ` Peter Zijlstra
@ 2008-12-12 14:04     ` Gilles.Carry
  2008-12-12 21:45     ` Ma, Chinang
  1 sibling, 0 replies; 14+ messages in thread
From: Gilles.Carry @ 2008-12-12 14:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ma, Chinang, Ingo Molnar, linux-kernel, Wilcox, Matthew R,
	Van De Ven, Arjan, Styner, Douglas W, Chilukuri, Harita, Wang,
	Peter Xihong, Nueckel, Hubert

Peter Zijlstra a écrit :
> On Fri, 2008-12-12 at 13:12 +0100, Peter Zijlstra wrote:
>> On Thu, 2008-12-11 at 16:25 -0700, Ma, Chinang wrote:
>>> We are evaluating the CFS OLTP performance with 2.6.28-c7 kernel. In
>>> this workload once a database foreground process commit a transaction
>>> it will signal the log writer process to write to the log file.
>>> Foreground processes will wait until log writer finish writing and
>>> wake them up. With hundreds of foreground process running in the
>>> system, it is important that the log writer get to run as soon as data
>>> is available. 
>>>
>>> Here are the experiments we have done with 2.6.28-rc7.
>>> 1. Increase log writer priority "renice -20 <log writer pid>" while
>>> keeping all other processes running in default CFS priority. We get a
>>> baseline performance with log latency (scheduling + i/o) at 7 ms.
>> Is this better or the same than nice-0 ?
>>
>>> 2. To reduce log latency, we set log writer to SCHED_RR with higher
>>> priority. We tried "chrt -p 49  <log writer pid>" and got 0.7% boost
>>> in performance with log latency reduced to 6.4 ms.
> 
> BTW, 6.4ms schedule latency sounds insanely long for a RR task, are you
> running a PREEMPT=n kernel or something?
> 
> How would you characterize the log tasks behaviour?
> 
>  - does it run long/short (any quantization)
>  - does it sleep long/short - how does it compare to its runtime?
>  - does it wake others?
>    - if so, always the one who woke it, or multiple others?
> 

To this, I would add these questions:
- do you have the source code of the DBMS? Are you sure that the 
timestamp you get corresponds to the very beginning of the wakeup time?
- if many foregroud tasks write data onto a buffer which is later read 
by the log writer, there may be locks (mutexes?) involved and the log 
writer is supposed to wait for lower prio tasks to release the lock.

Gilles.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CFS scheduler OLTP perforamnce
  2008-12-12 12:12 ` Peter Zijlstra
  2008-12-12 13:38   ` Peter Zijlstra
@ 2008-12-12 14:15   ` Andi Kleen
  2008-12-12 14:22     ` Peter Zijlstra
  2008-12-12 17:25   ` Ma, Chinang
  2 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2008-12-12 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ma, Chinang, Ingo Molnar, linux-kernel, Wilcox, Matthew R,
	Van De Ven, Arjan, Styner, Douglas W, Chilukuri, Harita, Wang,
	Peter Xihong, Nueckel, Hubert

Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> It seems that in this case renice to higher priority with CFS did not
>> reduce scheduling latency as well as SCHED_RR.
>
> Is there a question in this email?

The question is how to make nice perform as well as SCHED_RR.

-Andi
-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CFS scheduler OLTP perforamnce
  2008-12-12 14:15   ` Andi Kleen
@ 2008-12-12 14:22     ` Peter Zijlstra
  2008-12-12 14:39       ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2008-12-12 14:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ma, Chinang, Ingo Molnar, linux-kernel, Wilcox, Matthew R,
	Van De Ven, Arjan, Styner, Douglas W, Chilukuri, Harita, Wang,
	Peter Xihong, Nueckel, Hubert

On Fri, 2008-12-12 at 15:15 +0100, Andi Kleen wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> >> 
> >> It seems that in this case renice to higher priority with CFS did not
> >> reduce scheduling latency as well as SCHED_RR.
> >
> > Is there a question in this email?
> 
> The question is how to make nice perform as well as SCHED_RR.

Depending on the circumstances, you can't - SCHED_RR doesn't bother with
fairness.

That is not to say there is no room for improvement, but that really
needs more information.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CFS scheduler OLTP perforamnce
  2008-12-12 14:22     ` Peter Zijlstra
@ 2008-12-12 14:39       ` Andi Kleen
  0 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2008-12-12 14:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Ma, Chinang, Ingo Molnar, linux-kernel, Wilcox,
	Matthew R, Van De Ven, Arjan, Styner, Douglas W, Chilukuri,
	Harita, Wang, Peter Xihong, Nueckel, Hubert

On Fri, Dec 12, 2008 at 03:22:15PM +0100, Peter Zijlstra wrote:
> On Fri, 2008-12-12 at 15:15 +0100, Andi Kleen wrote:
> > Peter Zijlstra <peterz@infradead.org> writes:
> > >> 
> > >> It seems that in this case renice to higher priority with CFS did not
> > >> reduce scheduling latency as well as SCHED_RR.
> > >
> > > Is there a question in this email?
> > 
> > The question is how to make nice perform as well as SCHED_RR.
> 
> Depending on the circumstances, you can't - SCHED_RR doesn't bother with
> fairness.

When the spread between nice levels (negative/positive) is large enough
at least the log writer should be able to schedule soon most of the
time, no?

At least that doesn't seem to work.

Also in general there seems to be a starvation issue here between
producer and consumer.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: CFS scheduler OLTP perforamnce
  2008-12-12 12:12 ` Peter Zijlstra
  2008-12-12 13:38   ` Peter Zijlstra
  2008-12-12 14:15   ` Andi Kleen
@ 2008-12-12 17:25   ` Ma, Chinang
  2 siblings, 0 replies; 14+ messages in thread
From: Ma, Chinang @ 2008-12-12 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Wilcox, Matthew R, Van De Ven, Arjan,
	Styner, Douglas W, Chilukuri, Harita, Wang, Peter Xihong,
	Nueckel, Hubert



>-----Original Message-----
>From: Peter Zijlstra [mailto:peterz@infradead.org]
>Sent: Friday, December 12, 2008 4:12 AM
>To: Ma, Chinang
>Cc: Ingo Molnar; linux-kernel@vger.kernel.org; Wilcox, Matthew R; Van De
>Ven, Arjan; Styner, Douglas W; Chilukuri, Harita; Wang, Peter Xihong;
>Nueckel, Hubert
>Subject: Re: CFS scheduler OLTP perforamnce
>
>On Thu, 2008-12-11 at 16:25 -0700, Ma, Chinang wrote:
>> We are evaluating the CFS OLTP performance with 2.6.28-c7 kernel. In
>> this workload once a database foreground process commit a transaction
>> it will signal the log writer process to write to the log file.
>> Foreground processes will wait until log writer finish writing and
>> wake them up. With hundreds of foreground process running in the
>> system, it is important that the log writer get to run as soon as data
>> is available.
>>
>> Here are the experiments we have done with 2.6.28-rc7.
>> 1. Increase log writer priority "renice -20 <log writer pid>" while
>> keeping all other processes running in default CFS priority. We get a
>> baseline performance with log latency (scheduling + i/o) at 7 ms.
>
>Is this better or the same than nice-0 ?
>
>> 2. To reduce log latency, we set log writer to SCHED_RR with higher
>> priority. We tried "chrt -p 49  <log writer pid>" and got 0.7% boost
>> in performance with log latency reduced to 6.4 ms.
>>
>> It seems that in this case renice to higher priority with CFS did not
>> reduce scheduling latency as well as SCHED_RR.
>
>Is there a question in this email?
>

Can renice performance as well as SCHED_RR?

-Chinang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: CFS scheduler OLTP perforamnce
  2008-12-12 13:38   ` Peter Zijlstra
  2008-12-12 14:04     ` Gilles.Carry
@ 2008-12-12 21:45     ` Ma, Chinang
  2008-12-14 14:43       ` Henrik Austad
  1 sibling, 1 reply; 14+ messages in thread
From: Ma, Chinang @ 2008-12-12 21:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Wilcox, Matthew R, Van De Ven, Arjan,
	Styner, Douglas W, Chilukuri, Harita, Wang, Peter Xihong,
	Nueckel, Hubert, Chris Mason



>-----Original Message-----
>From: Peter Zijlstra [mailto:peterz@infradead.org]
>Sent: Friday, December 12, 2008 5:38 AM
>To: Ma, Chinang
>Cc: Ingo Molnar; linux-kernel@vger.kernel.org; Wilcox, Matthew R; Van De
>Ven, Arjan; Styner, Douglas W; Chilukuri, Harita; Wang, Peter Xihong;
>Nueckel, Hubert
>Subject: Re: CFS scheduler OLTP perforamnce
>
>On Fri, 2008-12-12 at 13:12 +0100, Peter Zijlstra wrote:
>> On Thu, 2008-12-11 at 16:25 -0700, Ma, Chinang wrote:
>> > We are evaluating the CFS OLTP performance with 2.6.28-c7 kernel. In
>> > this workload once a database foreground process commit a transaction
>> > it will signal the log writer process to write to the log file.
>> > Foreground processes will wait until log writer finish writing and
>> > wake them up. With hundreds of foreground process running in the
>> > system, it is important that the log writer get to run as soon as data
>> > is available.
>> >
>> > Here are the experiments we have done with 2.6.28-rc7.
>> > 1. Increase log writer priority "renice -20 <log writer pid>" while
>> > keeping all other processes running in default CFS priority. We get a
>> > baseline performance with log latency (scheduling + i/o) at 7 ms.
>>
>> Is this better or the same than nice-0 ?

I left out one detail of the database processes. There are also data writers that responsible for write back dirty buffers to free up buffer for new transactions. These processes also need to be renice to higher priority (-19). When data writers are left at nice-0, the workload was throttle by the limited number of free buffers and we cannot even fully utilize the system. I had to renice data writer and log writer process.

>>
>> > 2. To reduce log latency, we set log writer to SCHED_RR with higher
>> > priority. We tried "chrt -p 49  <log writer pid>" and got 0.7% boost
>> > in performance with log latency reduced to 6.4 ms.
>
>BTW, 6.4ms schedule latency sounds insanely long for a RR task, are you
>running a PREEMPT=n kernel or something?

The 6.4ms log write latency was measured in the foreground process. It went like this:
1. Foreground progress get start time and post log writer, foreground wait and sleep.
2. log writer was scheduled and collect log data. 
3. log writer write to log file and wait for i/o.
4. Write completed. log writer use vector post to wake up all the waiting foreground process.
5. Foreground process wake up and get end time.

>
>How would you characterize the log tasks behaviour?
>
> - does it run long/short (any quantization)

There were 371 log writes per second so ~2.7ms per log writer execution. Out of this we know ~2.13 ms was spent waiting for log file i/o. Log writer was running for (2.7ms - 2.13ms) = 0.57ms  
	
> - does it sleep long/short - how does it compare to its runtime?


With the current throughput, log writer should be constantly writing log and rarely sleep.

> - does it wake others?
>   - if so, always the one who woke it, or multiple others?
>

Log writer wake multiple foreground processes using vector post.

-Chinang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CFS scheduler OLTP perforamnce
  2008-12-12 21:45     ` Ma, Chinang
@ 2008-12-14 14:43       ` Henrik Austad
  2008-12-15 15:32         ` Ma, Chinang
  0 siblings, 1 reply; 14+ messages in thread
From: Henrik Austad @ 2008-12-14 14:43 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Wilcox, Matthew R,
	Van De Ven, Arjan, Styner, Douglas W, Chilukuri, Harita, Wang,
	Peter Xihong, Nueckel, Hubert, Chris Mason

On Friday 12 December 2008 22:45:11 Ma, Chinang wrote:

*snip*

> >> > We are evaluating the CFS OLTP performance with 2.6.28-c7 kernel. In
> >> > this workload once a database foreground process commit a transaction
> >> > it will signal the log writer process to write to the log file.
> >> > Foreground processes will wait until log writer finish writing and
> >> > wake them up. With hundreds of foreground process running in the
> >> > system, it is important that the log writer get to run as soon as data
> >> > is available.
> >> >
> >> > Here are the experiments we have done with 2.6.28-rc7.
> >> > 1. Increase log writer priority "renice -20 <log writer pid>" while
> >> > keeping all other processes running in default CFS priority. We get a
> >> > baseline performance with log latency (scheduling + i/o) at 7 ms.
> >>
> >> Is this better or the same than nice-0 ?
>
> I left out one detail of the database processes. There are also data
> writers that responsible for write back dirty buffers to free up buffer for
> new transactions. These processes also need to be renice to higher priority
> (-19). When data writers are left at nice-0, the workload was throttle by
> the limited number of free buffers and we cannot even fully utilize the
> system. I had to renice data writer and log writer process.
>
> >> > 2. To reduce log latency, we set log writer to SCHED_RR with higher
> >> > priority. We tried "chrt -p 49  <log writer pid>" and got 0.7% boost
> >> > in performance with log latency reduced to 6.4 ms.

What is the time needed to actually write the data to disk?

> >
> >BTW, 6.4ms schedule latency sounds insanely long for a RR task, are you
> >running a PREEMPT=n kernel or something?
>
> The 6.4ms log write latency was measured in the foreground process. It went
> like this: 
> 1. Foreground progress get start time and post log writer, 
> foreground wait and sleep. 
> 2. log writer was scheduled and collect log 
> data.
> 3. log writer write to log file and wait for i/o.
> 4. Write completed. log writer use vector post to wake up all the waiting
> foreground process. 
> 5. Foreground process wake up and get end time. 

OK, let me see if I got this right:
- You have a foreground process that runs with normal priority (i.e. +19 
to -20)
- This process appends data to a buffer, records the time and signals the 
log-writer to flush the buffer to disk as soon as possible
- The log-writer is awoken, writes the buffer to disk, signals the foreground 
process that the job is done and exits.
- The foregorund process records the time when it is awoken.

Is this really a kernel-scheduler problem? Or is it an error in the way the 
timestamps are recoreded?

Does not the time recorded then depend upon how the foreground process is 
scheduled, and not the log-writer? What happens if you log the time at the 
start and end of the log-writer function? Then you would get the time-delta 
between signaling and rr-wakeup, as well as time spent writing buffer to 
disk. By using the final timestamp in the foreground process, you'd get the 
latency for the last foreground-process wakeup as well.

Or am I completely missing the point here?

> >How would you characterize the log tasks behaviour?
> >
> > - does it run long/short (any quantization)
>
> There were 371 log writes per second so ~2.7ms per log writer execution.
> Out of this we know ~2.13 ms was spent waiting for log file i/o. Log writer
> was running for (2.7ms - 2.13ms) = 0.57ms
>
> > - does it sleep long/short - how does it compare to its runtime?
>
> With the current throughput, log writer should be constantly writing log
> and rarely sleep.
>
> > - does it wake others?
> >   - if so, always the one who woke it, or multiple others?
>
> Log writer wake multiple foreground processes using vector post.

So, you don't really know if the initial process that recorded the timestamp 
is the one who is awoken - so the time taken could be *very* long?

henrik

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: CFS scheduler OLTP perforamnce
  2008-12-14 14:43       ` Henrik Austad
@ 2008-12-15 15:32         ` Ma, Chinang
  2008-12-15 16:57           ` Henrik Austad
  0 siblings, 1 reply; 14+ messages in thread
From: Ma, Chinang @ 2008-12-15 15:32 UTC (permalink / raw)
  To: Henrik Austad
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Wilcox, Matthew R,
	Van De Ven, Arjan, Styner, Douglas W, Chilukuri, Harita, Wang,
	Peter Xihong, Nueckel, Hubert, Chris Mason, Tripathi, Sharad C



>-----Original Message-----
>From: Henrik Austad [mailto:henrik@austad.us]
>Sent: Sunday, December 14, 2008 6:43 AM
>To: Ma, Chinang
>Cc: Peter Zijlstra; Ingo Molnar; linux-kernel@vger.kernel.org; Wilcox,
>Matthew R; Van De Ven, Arjan; Styner, Douglas W; Chilukuri, Harita; Wang,
>Peter Xihong; Nueckel, Hubert; Chris Mason
>Subject: Re: CFS scheduler OLTP perforamnce
>
>On Friday 12 December 2008 22:45:11 Ma, Chinang wrote:
>
>*snip*
>
>> >> > We are evaluating the CFS OLTP performance with 2.6.28-c7 kernel. In
>> >> > this workload once a database foreground process commit a
>transaction
>> >> > it will signal the log writer process to write to the log file.
>> >> > Foreground processes will wait until log writer finish writing and
>> >> > wake them up. With hundreds of foreground process running in the
>> >> > system, it is important that the log writer get to run as soon as
>data
>> >> > is available.
>> >> >
>> >> > Here are the experiments we have done with 2.6.28-rc7.
>> >> > 1. Increase log writer priority "renice -20 <log writer pid>" while
>> >> > keeping all other processes running in default CFS priority. We get
>a
>> >> > baseline performance with log latency (scheduling + i/o) at 7 ms.
>> >>
>> >> Is this better or the same than nice-0 ?
>>
>> I left out one detail of the database processes. There are also data
>> writers that responsible for write back dirty buffers to free up buffer
>for
>> new transactions. These processes also need to be renice to higher
>priority
>> (-19). When data writers are left at nice-0, the workload was throttle by
>> the limited number of free buffers and we cannot even fully utilize the
>> system. I had to renice data writer and log writer process.
>>
>> >> > 2. To reduce log latency, we set log writer to SCHED_RR with higher
>> >> > priority. We tried "chrt -p 49  <log writer pid>" and got 0.7% boost
>> >> > in performance with log latency reduced to 6.4 ms.
>
>What is the time needed to actually write the data to disk?
>From the iostat, the log file write latency is 1.62ms. DBMD has stats for log file write of 2.17ms, that is including latency of io_submit, physical i/o and io_getevents.

>
>> >
>> >BTW, 6.4ms schedule latency sounds insanely long for a RR task, are you
>> >running a PREEMPT=n kernel or something?
>>
>> The 6.4ms log write latency was measured in the foreground process. It
>went
>> like this:
>> 1. Foreground progress get start time and post log writer,
>> foreground wait and sleep.
>> 2. log writer was scheduled and collect log
>> data.
>> 3. log writer write to log file and wait for i/o.
>> 4. Write completed. log writer use vector post to wake up all the waiting
>> foreground process.
>> 5. Foreground process wake up and get end time.
>
>OK, let me see if I got this right:
>- You have a foreground process that runs with normal priority (i.e. +19
>to -20)
>- This process appends data to a buffer, records the time and signals the
>log-writer to flush the buffer to disk as soon as possible
>- The log-writer is awoken, writes the buffer to disk, signals the
>foreground
>process that the job is done and exits.
>- The foregorund process records the time when it is awoken.
>
>Is this really a kernel-scheduler problem? Or is it an error in the way the
>timestamps are recoreded?
>
>Does not the time recorded then depend upon how the foreground process is
>scheduled, and not the log-writer? What happens if you log the time at the
>start and end of the log-writer function? Then you would get the time-delta
>between signaling and rr-wakeup, as well as time spent writing buffer to
>disk. By using the final timestamp in the foreground process, you'd get the
>latency for the last foreground-process wakeup as well.
>
>Or am I completely missing the point here?

Since I don't have access to the source code, I cannot make the about change to the foreground or log writer. 

>
>> >How would you characterize the log tasks behaviour?
>> >
>> > - does it run long/short (any quantization)
>>
>> There were 371 log writes per second so ~2.7ms per log writer execution.
>> Out of this we know ~2.13 ms was spent waiting for log file i/o. Log
>writer
>> was running for (2.7ms - 2.13ms) = 0.57ms
>>
>> > - does it sleep long/short - how does it compare to its runtime?
>>
>> With the current throughput, log writer should be constantly writing log
>> and rarely sleep.
>>
>> > - does it wake others?
>> >   - if so, always the one who woke it, or multiple others?
>>
>> Log writer wake multiple foreground processes using vector post.
>
>So, you don't really know if the initial process that recorded the
>timestamp
>is the one who is awoken - so the time taken could be *very* long?
>
Yes. If we only look at the timestamp in just one foreground process. The reported log latency is an average of the sum of all foreground stats. Since this average went down with the log writer set to SCHED_RR, I took that as log writer get to do its job sooner. 

-Chinang 

>henrik

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CFS scheduler OLTP perforamnce
  2008-12-15 15:32         ` Ma, Chinang
@ 2008-12-15 16:57           ` Henrik Austad
  2008-12-15 20:49             ` Ma, Chinang
  0 siblings, 1 reply; 14+ messages in thread
From: Henrik Austad @ 2008-12-15 16:57 UTC (permalink / raw)
  To: Ma, Chinang
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Wilcox, Matthew R,
	Van De Ven, Arjan, Styner, Douglas W, Chilukuri, Harita, Wang,
	Peter Xihong, Nueckel, Hubert, Chris Mason, Tripathi, Sharad C

On Mon, Dec 15, 2008 at 08:32:41AM -0700, Ma, Chinang wrote:
> 
> 
> >-----Original Message-----
> >From: Henrik Austad [mailto:henrik@austad.us]
> >Sent: Sunday, December 14, 2008 6:43 AM
> >To: Ma, Chinang
> >Cc: Peter Zijlstra; Ingo Molnar; linux-kernel@vger.kernel.org; Wilcox,
> >Matthew R; Van De Ven, Arjan; Styner, Douglas W; Chilukuri, Harita; Wang,
> >Peter Xihong; Nueckel, Hubert; Chris Mason
> >Subject: Re: CFS scheduler OLTP perforamnce
> >
> >On Friday 12 December 2008 22:45:11 Ma, Chinang wrote:
> >
> >*snip*
> >
> >> >> > We are evaluating the CFS OLTP performance with 2.6.28-c7 kernel. In
> >> >> > this workload once a database foreground process commit a
> >transaction
> >> >> > it will signal the log writer process to write to the log file.
> >> >> > Foreground processes will wait until log writer finish writing and
> >> >> > wake them up. With hundreds of foreground process running in the
> >> >> > system, it is important that the log writer get to run as soon as
> >data
> >> >> > is available.
> >> >> >
> >> >> > Here are the experiments we have done with 2.6.28-rc7.
> >> >> > 1. Increase log writer priority "renice -20 <log writer pid>" while
> >> >> > keeping all other processes running in default CFS priority. We get
> >a
> >> >> > baseline performance with log latency (scheduling + i/o) at 7 ms.
> >> >>
> >> >> Is this better or the same than nice-0 ?
> >>
> >> I left out one detail of the database processes. There are also data
> >> writers that responsible for write back dirty buffers to free up buffer
> >for
> >> new transactions. These processes also need to be renice to higher
> >priority
> >> (-19). When data writers are left at nice-0, the workload was throttle by
> >> the limited number of free buffers and we cannot even fully utilize the
> >> system. I had to renice data writer and log writer process.
> >>
> >> >> > 2. To reduce log latency, we set log writer to SCHED_RR with higher
> >> >> > priority. We tried "chrt -p 49  <log writer pid>" and got 0.7% boost
> >> >> > in performance with log latency reduced to 6.4 ms.
> >
> >What is the time needed to actually write the data to disk?
> From the iostat, the log file write latency is 1.62ms. DBMD has stats for log file write of 2.17ms, that is including latency of io_submit, physical i/o and io_getevents.
> 
> >
> >> >
> >> >BTW, 6.4ms schedule latency sounds insanely long for a RR task, are you
> >> >running a PREEMPT=n kernel or something?
> >>
> >> The 6.4ms log write latency was measured in the foreground process. It
> >went
> >> like this:
> >> 1. Foreground progress get start time and post log writer,
> >> foreground wait and sleep.
> >> 2. log writer was scheduled and collect log
> >> data.
> >> 3. log writer write to log file and wait for i/o.
> >> 4. Write completed. log writer use vector post to wake up all the waiting
> >> foreground process.
> >> 5. Foreground process wake up and get end time.
> >
> >OK, let me see if I got this right:
> >- You have a foreground process that runs with normal priority (i.e. +19
> >to -20)
> >- This process appends data to a buffer, records the time and signals the
> >log-writer to flush the buffer to disk as soon as possible
> >- The log-writer is awoken, writes the buffer to disk, signals the
> >foreground
> >process that the job is done and exits.
> >- The foregorund process records the time when it is awoken.
> >
> >Is this really a kernel-scheduler problem? Or is it an error in the way the
> >timestamps are recoreded?
> >
> >Does not the time recorded then depend upon how the foreground process is
> >scheduled, and not the log-writer? What happens if you log the time at the
> >start and end of the log-writer function? Then you would get the time-delta
> >between signaling and rr-wakeup, as well as time spent writing buffer to
> >disk. By using the final timestamp in the foreground process, you'd get the
> >latency for the last foreground-process wakeup as well.
> >
> >Or am I completely missing the point here?
> 
> Since I don't have access to the source code, I cannot make the about change to the foreground or log writer. 

Aha, then it makes sense :-)

What about, as a test, set the foreground process as rt-prio (not a good thing 
for a production environment, but as a test, it should give you an indication 
to how long time the wakeup-time is.

> 
> >
> >> >How would you characterize the log tasks behaviour?
> >> >
> >> > - does it run long/short (any quantization)
> >>
> >> There were 371 log writes per second so ~2.7ms per log writer execution.
> >> Out of this we know ~2.13 ms was spent waiting for log file i/o. Log
> >writer
> >> was running for (2.7ms - 2.13ms) = 0.57ms
> >>
> >> > - does it sleep long/short - how does it compare to its runtime?
> >>
> >> With the current throughput, log writer should be constantly writing log
> >> and rarely sleep.
> >>
> >> > - does it wake others?
> >> >   - if so, always the one who woke it, or multiple others?
> >>
> >> Log writer wake multiple foreground processes using vector post.
> >
> >So, you don't really know if the initial process that recorded the
> >timestamp
> >is the one who is awoken - so the time taken could be *very* long?
> >
> Yes. If we only look at the timestamp in just one foreground process. The reported log latency is an average of the sum of all foreground stats. Since this average went down with the log writer set to SCHED_RR, I took that as log writer get to do its job sooner. 

Ah, I guess the time went down because the writer was scheduler much earlier,
and this lead to decreased time, but as the foreground process needs to be 
waken up and scheduled *after* the writer has finished, you will still see 
some latencies here.

I'm suspecting you record a lot of time waiting for the foreground process 
to be scheduled again.


henrik

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: CFS scheduler OLTP perforamnce
  2008-12-15 16:57           ` Henrik Austad
@ 2008-12-15 20:49             ` Ma, Chinang
  0 siblings, 0 replies; 14+ messages in thread
From: Ma, Chinang @ 2008-12-15 20:49 UTC (permalink / raw)
  To: Henrik Austad
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Wilcox, Matthew R,
	Van De Ven, Arjan, Styner, Douglas W, Chilukuri, Harita, Wang,
	Peter Xihong, Nueckel, Hubert, Chris Mason, Tripathi, Sharad C

>-----Original Message-----
>From: Henrik Austad [mailto:henrik@austad.us]
>Sent: Monday, December 15, 2008 8:57 AM
>To: Ma, Chinang
>Cc: Peter Zijlstra; Ingo Molnar; linux-kernel@vger.kernel.org; Wilcox,
>Matthew R; Van De Ven, Arjan; Styner, Douglas W; Chilukuri, Harita; Wang,
>Peter Xihong; Nueckel, Hubert; Chris Mason; Tripathi, Sharad C
>Subject: Re: CFS scheduler OLTP perforamnce
>

*snip*

>On Mon, Dec 15, 2008 at 08:32:41AM -0700, Ma, Chinang wrote:
>>
>> >Is this really a kernel-scheduler problem? Or is it an error in the way
>the
>> >timestamps are recoreded?
>> >
>> >Does not the time recorded then depend upon how the foreground process
>is
>> >scheduled, and not the log-writer? What happens if you log the time at
>the
>> >start and end of the log-writer function? Then you would get the time-
>delta
>> >between signaling and rr-wakeup, as well as time spent writing buffer to
>> >disk. By using the final timestamp in the foreground process, you'd get
>the
>> >latency for the last foreground-process wakeup as well.
>> >
>> >Or am I completely missing the point here?
>>
>> Since I don't have access to the source code, I cannot make the about
>change to the foreground or log writer.
>
>Aha, then it makes sense :-)
>
>What about, as a test, set the foreground process as rt-prio (not a good
>thing
>for a production environment, but as a test, it should give you an
>indication
>to how long time the wakeup-time is.
>

When setting foreground and log writer to rt-prio, the log latency reduced to 4.8ms. Performance is about 1.5% higher than the CFS result.
  
On a side note, we had been using rt-prio on all DBMS processes and log writer ( in higher priority) for the best OLTP performance. That has worked pretty well until 2.6.25 when the new rt scheduler introduced the pull/push task for lower scheduling latency for rt-task. That has negative impact on this workload, probably due to the more elaborated load calculation/balancing for hundred of foreground rt-prio processes. Also, there is that question of no production environment would run DBMS with rt-prio. That is why I am going back to explore CFS and see whether I can drop rt-prio for good.

>>
>> >
>> >> >How would you characterize the log tasks behaviour?
>> >> >
>> >> > - does it run long/short (any quantization)
>> >>
>> >> There were 371 log writes per second so ~2.7ms per log writer
>execution.
>> >> Out of this we know ~2.13 ms was spent waiting for log file i/o. Log
>> >writer
>> >> was running for (2.7ms - 2.13ms) = 0.57ms
>> >>
>> >> > - does it sleep long/short - how does it compare to its runtime?
>> >>
>> >> With the current throughput, log writer should be constantly writing
>log
>> >> and rarely sleep.
>> >>
>> >> > - does it wake others?
>> >> >   - if so, always the one who woke it, or multiple others?
>> >>
>> >> Log writer wake multiple foreground processes using vector post.
>> >
>> >So, you don't really know if the initial process that recorded the
>> >timestamp
>> >is the one who is awoken - so the time taken could be *very* long?
>> >
>> Yes. If we only look at the timestamp in just one foreground process. The
>reported log latency is an average of the sum of all foreground stats.
>Since this average went down with the log writer set to SCHED_RR, I took
>that as log writer get to do its job sooner.
>
>Ah, I guess the time went down because the writer was scheduler much
>earlier,
>and this lead to decreased time, but as the foreground process needs to be
>waken up and scheduled *after* the writer has finished, you will still see
>some latencies here.
>
>I'm suspecting you record a lot of time waiting for the foreground process
>to be scheduled again.
>
>
>henrik

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-12-15 20:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-11 23:25 CFS scheduler OLTP perforamnce Ma, Chinang
2008-12-12 12:12 ` Peter Zijlstra
2008-12-12 13:38   ` Peter Zijlstra
2008-12-12 14:04     ` Gilles.Carry
2008-12-12 21:45     ` Ma, Chinang
2008-12-14 14:43       ` Henrik Austad
2008-12-15 15:32         ` Ma, Chinang
2008-12-15 16:57           ` Henrik Austad
2008-12-15 20:49             ` Ma, Chinang
2008-12-12 14:15   ` Andi Kleen
2008-12-12 14:22     ` Peter Zijlstra
2008-12-12 14:39       ` Andi Kleen
2008-12-12 17:25   ` Ma, Chinang
2008-12-12 12:37 ` Gilles.Carry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.