linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Gang scheduling in CFS
@ 2011-12-19  8:33 Nikunj A. Dadhania
  2011-12-19  8:34 ` [RFC PATCH 1/4] sched: Adding cpu.gang file to cpu cgroup Nikunj A. Dadhania
                   ` (5 more replies)
  0 siblings, 6 replies; 75+ messages in thread
From: Nikunj A. Dadhania @ 2011-12-19  8:33 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: nikunj, vatsa, bharata

    The following patches implements gang scheduling. These patches
    are *highly* experimental in nature and are not proposed for
    inclusion at this time.

    Gang scheduling is an approach where we make an effort to run
    related tasks (the gang) at the same time on a number of CPUs.

    Gang scheduling can be helpful in virtualization scenario. It will
    help in avoiding the lock-holder-preemption[1] problem and other
    benefits include improved lock-acquisition times. This feature
    will help address some limitations of KVM on Power

    On Power, we have an interesting hardware restriction on guests
    running across SMT theads: on any single core, we can only run one
    mm context at any given time.  That means that we can run 4
    threads from one guest, but we can not mix and match threads from
    different guests or host.  In KVM's case, QEMU also counts as
    another mm context, so any VM exits or hypercalls that trap in to
    QEMU will stop all the other threads on the core except the one
    making the call.

    The gang scheduling problem can be broken into two parts:
    a) Placement of the tasks to be gang scheduled 
    b) Synchronized scheduling of the tasks across a set of cpu.
   
    This patch takes care of point "b" and the placement part(pinning)
    is handled manually in user space currently.

Approach:

    Whenever a task is picked, and the task is supposed to be gang
    scheduled we will do some post_schedule magic. post_schedule magic
    for once will decide whether this cpu is the gang_leader or not.

    So what is this gang_leader?  

    We need one of the cpu to start the gang on behalf of the set of
    cpus, IOW the gang granularity. The gang_leader will be sending
    IPIs to fellow cpus, as per the gang granularity. This granularity
    can be decided depending upon the architecture as well.

    All the fellow cpus on receiving an IPI will do the following: If
    the cpu's runqueue has a task belonging to the gang which was
    initiated by the gang_leader, favour the task to be picked up and
    set need_resched.

    The favouring of task can be done in different ways. I have tried
    two options here(patch3 and patch4) and have results from them.

    Interface to invoke a gang for a task group: 
    echo 1 > /cgroup/test/cpu.gang
    
    patch 1: Implements the interface of enabling/disabling gang using
    	     cpu cgroup.
    
    patch 2: Infrastructure to invoke gang scheduling. A gang leader
             would be electected once depending on the gang scheduling
             granularity, IOW, gang across how many cpus(gang_cpus).
             And then on, gang leader will be sending gang initiations
             to gang_cpus.
    
    patch 3: Uses set_next_buddy to favour gang tasks to be picked up
    
    patch 4: Introduces set_gang_buddy to favour gang task
    	     unconditionally.
    
    I have rebased patches for latest scheduler changes
    (3.2-rc4-tip_93e44306).

    PLE - Test Setup:
    - x3850x5 machine - PLE enabled
    - 8 CPUs (HT disabled)
    - 264GB memory
    - VM details:
       - Guest kernel: 2.6.32 based enterprise kernel
       - 4096MB memory
       - 8 VCPUs
    - During gang runs, vcpus are pinned
    
    Results:
     * Below numbers are average across 2runs
     * GangVsBase - Gang vs Baseline kernel
     * GangVsPin  - Gang vs Baseline kernel + vcpus pinned
     * V1 - patch 1, 2 and 3
     * V2 - v1 + patch4
     * Results are % improvement/degradation

    +-------------+---------------------------+-------------------------+
    |             |            V1 (%)         |             V2 (%)      |
    + Benchmarks  +-------------+-------------+-------------------------+
    |             | GangVsBase  |   GangVsPin |  GangVsBase | GangVsPin |
    +-------------+-------------+-------------+-------------------------+
    | kbench  2vm |       -1    |        1    |        1    |        3  |
    | kbench  4vm |      -10    |      -14    |       11    |        7  |
    | kbench  8vm |      -10    |      -13    |        8    |        6  |
    +-------------+-------------+-------------+-------------------------+
    | ebizzy  2vm |        0    |        3    |        2    |        5  |
    | ebizzy  4vm |        1    |        0    |        4    |        3  |
    | ebizzy  8vm |        0    |        1    |       23    |       26  |
    +-------------+-------------+-------------+-------------------------+
    | specjbb 2vm |       -3    |       -3    |      -17    |      -18  |
    | specjbb 4vm |       -9    |      -10    |      -33    |      -34  |
    | specjbb 8vm |      -19    |       -2    |       28    |       55  |
    +-------------+-------------+-------------+-------------------------+
    | hbench  2vm |        3    |      -14    |       28    |       15  |
    | hbench  4vm |      -66    |      -55    |      -20    |      -12  |
    | hbench  8vm |     -239    |      -92    |     -189    |      -64  |
    +-------------+-------------+-------------+-------------------------+
    | dbench  2vm |       -3    |       -3    |       -3    |       -3  |
    | dbench  4vm |      -11    |        3    |      -13    |        0  |
    | dbench  8vm |       25    |       -1    |       12    |      -12  |
    +-------------+-------------+-------------+-------------------------+

    Here are some additional data for the best and worst case in
    V2(GangVsBase). I am not able to figure out one/two data point
    that stands out always, that will say the gang sched
    improved/degraded for this reason.

    specjbb 8VM (improved 28%)
    +------------+--------------------+--------------------+----------+
    |                              SPECJBB                            |
    +------------+--------------------+--------------------+----------+
    | Parameter  |       Baseline     |         gang:V2    | % imprv  |
    +------------+--------------------+--------------------+----------+
    |       Score|            4173.19 |            5343.69 |       28 |
    |     BwUsage|   5745105989024.00 |   6566955369442.00 |       14 |
    |    HostIdle|              63.00 |              79.00 |      -25 |
    |     kvmExit|        31611242.00 |        52477831.00 |      -66 |
    |     UsrTime|              13.00 |              20.00 |       53 |
    |     SysTime|              16.00 |              12.00 |       25 |
    |      IOWait|               7.00 |               4.00 |       42 |
    |    IdleTime|              63.00 |              61.00 |       -3 |
    |         TPS|               7.00 |               6.00 |      -14 |
    | CacheMisses|     14272997833.00 |     14800182895.00 |       -3 |
    |   CacheRefs|     58143057220.00 |     69914413043.00 |       20 |
    |Instructions|   4397381980479.00 |   4572303159016.00 |       -3 |
    |      Cycles|   5884437898653.00 |   6489379310428.00 |      -10 |
    |   ContextSW|        10008378.00 |        14705944.00 |      -46 |
    |   CPUMigrat|           10501.00 |           21705.00 |     -106 |
    +-----------------------------------------------------------------+

    hbench 8VM (degraded 189%)
    +------------+--------------------+--------------------+----------+
    |                            Hackbench                            |
    +------------+--------------------+--------------------+----------+
    | Parameter  |        Baseline    |         gang:V2    | % imprv  |
    +------------+--------------------+--------------------+----------+
    |   HbenchAvg|              28.27 |              81.75 |     -189 |
    |     BwUsage|   1278656649466.00 |   2352504202657.00 |       83 |
    |    HostIdle|              82.00 |              80.00 |        2 |
    |     kvmExit|         6859301.00 |        31853895.00 |     -364 |
    |     UsrTime|              11.00 |              17.00 |       54 |
    |     SysTime|              17.00 |              13.00 |       23 |
    |      IOWait|               7.00 |               5.00 |       28 |
    |    IdleTime|              63.00 |              62.00 |       -1 |
    |         TPS|               8.00 |               7.00 |      -12 |
    | CacheMisses|       194565014.00 |       140098020.00 |       27 |
    |   CacheRefs|      4793875790.00 |     15942118793.00 |      232 |
    |Instructions|    430356490646.00 |   1006560006432.00 |     -133 |
    |      Cycles|    559463222878.00 |   1578421826236.00 |     -182 |
    |   ContextSW|         2587635.00 |         8110060.00 |     -213 |
    |   CPUMigrat|             967.00 |            3844.00 |     -297 |
    +-----------------------------------------------------------------+

    non-PLE - Test Setup:
    - x3650 M2 machine
    - 8 CPUs (HT disabled)
    - 64GB memory
    - VM details:
       - Guest kernel: 2.6.32 based enterprise kernel
       - 1024MB memory
       - 8 VCPUs
    - During gang runs, vcpus are pinned
    
    Results:
     * GangVsBase - Gang vs Baseline kernel
     * GangVsPin  - Gang vs Baseline kernel + vcpus pinned
     * V1 - patch 1, 2 and 3
     * V2 - V1 + patch4
     * Results are % improvement/degradation
    +-------------+---------------------------+-------------------------+
    |             |            V1             |             V2          |
    +-------------+-------------+-------------+-------------------------+
    |             | GangVsBase  |   GangVsPin |  GangVsBase | GangVsPin |
    +-------------+-------------+-------------+-------------------------+
    | kbench  2vm |       -3    |      -42    |       22    |     -6    |
    | kbench  4vm |        4    |      -11    |      -11    |    -29    |
    | kbench  8vm |       -4    |      -11    |       12    |      6    |
    +-------------+-------------+-------------+-------------------------+
    | ebizzy  2vm |     1333    |      772    |     1520    |    885    |
    | ebizzy  4vm |      525    |      423    |      930    |    761    |
    | ebizzy  8vm |      373    |      281    |      771    |    602    |
    +-------------+-------------+-------------+-------------------------+
    | specjbb 2vm |       -2    |       -1    |        0    |      0    |
    | specjbb 4vm |       -4    |       -7    |        2    |      0    |
    | specjbb 8vm |      -14    |      -17    |       -8    |    -11    |
    +-------------+-------------+-------------+-------------------------+
    | hbench  2vm |       12    |        0    |      -32    |    -49    |
    | hbench  4vm |     -234    |      -95    |       12    |     48    |
    | hbench  8vm |     -364    |      -69    |       -7    |     60    |
    +-------------+-------------+-------------+-------------------------+
    | dbench  2vm |      -13    |        3    |      -17    |     -1    |
    | dbench  4vm |       38    |       45    |       -2    |      1    |
    | dbench  8vm |      -36    |      -10    |       44    |    102    |
    +-------------+-------------+-------------+-------------------------+

    Similar data for the best and worst case in V2(GangVsBase). 

    ebizzy 2vm (improved 15 times, i.e. 1520%)
    +------------+--------------------+--------------------+----------+
    |                               Ebizzy                            |
    +------------+--------------------+--------------------+----------+
    | Parameter  |        Basline     |         gang:V2    | % imprv  |
    +------------+--------------------+--------------------+----------+
    | EbzyRecords|            1709.50 |           27701.00 |     1520 |
    |    EbzyUser|              20.48 |             376.64 |     1739 |
    |     EbzySys|            1384.65 |            1071.40 |       22 |
    |    EbzyReal|             300.00 |             300.00 |        0 |
    |     BwUsage|   2456114173416.00 |   2483447784640.00 |        1 |
    |    HostIdle|              34.00 |              35.00 |       -2 |
    |     UsrTime|               6.00 |              14.00 |      133 |
    |     SysTime|              30.00 |              24.00 |       20 |
    |      IOWait|              10.00 |               9.00 |       10 |
    |    IdleTime|              51.00 |              51.00 |        0 |
    |         TPS|              25.00 |              24.00 |       -4 |
    | CacheMisses|       766543805.00 |      8113721819.00 |     -958 |
    |   CacheRefs|      9420204706.00 |    136290854100.00 |     1346 |
    |BranchMisses|      1191336154.00 |     11336436452.00 |     -851 |
    |    Branches|    618882621656.00 |    459161727370.00 |      -25 |
    |Instructions|   2517045997661.00 |   2325227247092.00 |        7 |
    |      Cycles|   7642374654922.00 |   7657626973214.00 |        0 |
    |     PageFlt|           23779.00 |           22195.00 |        6 |
    |   ContextSW|         1517241.00 |         1786319.00 |      -17 |
    |   CPUMigrat|             537.00 |             241.00 |       55 |
    +-----------------------------------------------------------------+

    hbench 2vm (degraded 44%)
    +------------+--------------------+--------------------+----------+
    |                            Hackbench                            |
    +------------+--------------------+--------------------+----------+
    | Parameter  |     Non-Gang       |          gang:V2   | % imprv  |
    +------------+--------------------+--------------------+----------+
    |   HbenchAvg|               8.95 |              11.84 |      -32 |
    |     BwUsage|    140751454716.00 |    188528508986.00 |       33 |
    |    HostIdle|              46.00 |              41.00 |       10 |
    |     UsrTime|               6.00 |              13.00 |      116 |
    |     SysTime|              30.00 |              24.00 |       20 |
    |      IOWait|              10.00 |               9.00 |       10 |
    |    IdleTime|              52.00 |              52.00 |        0 |
    |         TPS|              24.00 |              23.00 |       -4 |
    | CacheMisses|       536001007.00 |       555837077.00 |       -3 |
    |   CacheRefs|      1388722056.00 |      1737837668.00 |       25 |
    |BranchMisses|       260102092.00 |       580784727.00 |     -123 |
    |    Branches|     25083812102.00 |     34960032641.00 |       39 |
    |Instructions|    136018192623.00 |    190522959512.00 |      -40 |
    |      Cycles|    232524382438.00 |    320669938332.00 |      -37 |
    |     PageFlt|            9562.00 |           10461.00 |       -9 |
    |   ContextSW|           78095.00 |          103097.00 |      -32 |
    |   CPUMigrat|             237.00 |             155.00 |       34 |
    +-----------------------------------------------------------------+

    For reference here are the benchmark parameters
    Kernbench: kernbench -f -M -H -o 16
    ebizzy: ebizzy -S 300 -t 16
    hbench: hackbench 8 (10000 loops)
    dbench: dbench 8 -t 120
    specjbb: 8 & 16 warehouses, 512MB heap, 120secs run

Thanks,
Nikunj

1. http://xen.org/files/xensummitboston08/LHP.pdf

---

Nikunj A. Dadhania (4):
      sched:Implement set_gang_buddy
      sched: Gang using set_next_buddy
      sched: Adding gang scheduling infrastrucure
      sched: Adding cpu.gang file to cpu cgroup


 kernel/sched/core.c  |   28 ++++++++++
 kernel/sched/fair.c  |  143 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    8 ++-
 3 files changed, 178 insertions(+), 1 deletions(-)


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [RFC PATCH 1/4] sched: Adding cpu.gang file to cpu cgroup
  2011-12-19  8:33 [RFC PATCH 0/4] Gang scheduling in CFS Nikunj A. Dadhania
@ 2011-12-19  8:34 ` Nikunj A. Dadhania
  2011-12-19  8:34 ` [RFC PATCH 2/4] sched: Adding gang scheduling infrastrucure Nikunj A. Dadhania
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 75+ messages in thread
From: Nikunj A. Dadhania @ 2011-12-19  8:34 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: nikunj, vatsa, bharata

Introduce cpu.gang file in cpu controller, this will be used for enabling and
disabling gang scheduling for the task belonging to this cgroup.  This does
not take into account the cpu controller hierarchy while scheduling.

Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---

 kernel/sched/core.c  |   19 +++++++++++++++++++
 kernel/sched/fair.c  |   20 ++++++++++++++++++++
 kernel/sched/sched.h |    2 ++
 3 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c5b21e..e96f861 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6862,6 +6862,7 @@ void __init sched_init(void)
 		init_rt_rq(&rq->rt, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
+		root_task_group.gang = 0;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
 		/*
 		 * How much cpu bandwidth does root_task_group get?
@@ -7585,6 +7586,19 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 	return (u64) scale_load_down(tg->shares);
 }
 
+static int cpu_gang_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+			      u64 shareval)
+{
+	return sched_group_set_gang(cgroup_tg(cgrp), shareval);
+}
+
+static u64 cpu_gang_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+
+	return (u64) tg->gang;
+}
+
 #ifdef CONFIG_CFS_BANDWIDTH
 static DEFINE_MUTEX(cfs_constraints_mutex);
 
@@ -7851,6 +7865,11 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_shares_read_u64,
 		.write_u64 = cpu_shares_write_u64,
 	},
+	{
+		.name = "gang",
+		.read_u64 = cpu_gang_read_u64,
+		.write_u64 = cpu_gang_write_u64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a4d2b7a..b95575f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5484,6 +5484,26 @@ done:
 	mutex_unlock(&shares_mutex);
 	return 0;
 }
+
+static DEFINE_MUTEX(gang_mutex);
+
+int sched_group_set_gang(struct task_group *tg, unsigned long gang)
+{
+	/*
+	 * root cgroup cannot be gang scheduled
+	 */
+	if (!tg->se[0])
+		return -EINVAL;
+
+	if (gang != 1 && gang != 0)
+		return -EINVAL;
+
+	mutex_lock(&gang_mutex);
+	tg->gang = gang;
+	mutex_unlock(&gang_mutex);
+	return 0;
+}
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
 
 void free_fair_sched_group(struct task_group *tg) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d8d3613..f1a85e3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -114,6 +114,7 @@ struct task_group {
 	/* runqueue "owned" by this group on each cpu */
 	struct cfs_rq **cfs_rq;
 	unsigned long shares;
+	bool gang; /* should the tg be gang scheduled */
 
 	atomic_t load_weight;
 #endif
@@ -185,6 +186,7 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *parent);
 extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
+extern int sched_group_set_gang(struct task_group *tg, unsigned long gang);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [RFC PATCH 2/4] sched: Adding gang scheduling infrastrucure
  2011-12-19  8:33 [RFC PATCH 0/4] Gang scheduling in CFS Nikunj A. Dadhania
  2011-12-19  8:34 ` [RFC PATCH 1/4] sched: Adding cpu.gang file to cpu cgroup Nikunj A. Dadhania
@ 2011-12-19  8:34 ` Nikunj A. Dadhania
  2011-12-19 15:51   ` Peter Zijlstra
  2011-12-19  8:34 ` [RFC PATCH 3/4] sched: Gang using set_next_buddy Nikunj A. Dadhania
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 75+ messages in thread
From: Nikunj A. Dadhania @ 2011-12-19  8:34 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: nikunj, vatsa, bharata

The patch introduces the concept of gang_leader and gang_cpumasks.  For the
first time when the gang_leader is not set, the gang leader is elected. The
election is dependent on the number of cpus that we have to gang, aka gang
granularity. ATM, gang granularity is set to 8cpus, which can be made to set
using a sysctl if required.

TODO: This still does not take care of cpu-offlining and re-electing the
gang-leader

Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---

 kernel/sched/core.c  |    9 +++++
 kernel/sched/fair.c  |   91 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    4 ++
 3 files changed, 104 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e96f861..f3ae29c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1968,6 +1968,12 @@ static inline void post_schedule(struct rq *rq)
 
 		rq->post_schedule = 0;
 	}
+	if (rq->gang_schedule == 1) {
+		struct task_group *tg = task_group(rq->curr);
+
+		gang_sched(tg, rq);
+	}
+
 }
 
 #else
@@ -6903,6 +6909,9 @@ void __init sched_init(void)
 		rq->rd = NULL;
 		rq->cpu_power = SCHED_POWER_SCALE;
 		rq->post_schedule = 0;
+		rq->gang_schedule = 0;
+		rq->gang_leader = -1;
+		rq->gang_cpumask = NULL;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
 		rq->push_cpu = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b95575f..c03efd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3020,6 +3020,7 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
 	struct task_struct *p;
 	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
+	struct task_group *tg;
 
 	if (!cfs_rq->nr_running)
 		return NULL;
@@ -3030,6 +3031,13 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
+	tg = se->cfs_rq->tg;
+
+	if (tg->gang) {
+		if (!rq->gang_schedule && rq->gang_leader)
+			rq->gang_schedule = tg->gang;
+	}
+
 	p = task_of(se);
 	if (hrtick_enabled(rq))
 		hrtick_start_fair(rq, p);
@@ -3533,6 +3541,15 @@ struct sg_lb_stats {
 };
 
 /**
+ * domain_first_cpu - Returns the first cpu in the cpumask of a sched_domain.
+ * @domain: The domain whose first cpu is to be returned.
+ */
+static inline unsigned int domain_first_cpu(struct sched_domain *sd)
+{
+	return cpumask_first(sched_domain_span(sd));
+}
+
+/**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
  * @idle: The Idle status of the CPU for whose sd load_icx is obtained.
@@ -5485,6 +5502,80 @@ done:
 	return 0;
 }
 
+static void gang_sched_member(void *info)
+{
+	struct task_group *tg = (struct task_group *) info;
+	struct cfs_rq *cfs_rq;
+	struct rq *rq;
+	int cpu;
+	unsigned long flags;
+
+	cpu  = smp_processor_id();
+	cfs_rq = tg->cfs_rq[cpu];
+	rq = cfs_rq->rq;
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+
+	/* Check if the runqueue has runnable tasks */
+	if (cfs_rq->nr_running) {
+		/* Favour this task group and set need_resched flag,
+		 * added by following patches */
+	}
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+#define GANG_SCHED_GRANULARITY 8
+
+void gang_sched(struct task_group *tg, struct rq *rq)
+{
+	/* We do not gang sched here */
+	if (rq->gang_leader == 0 || !tg || tg->gang == 0)
+		return;
+
+	/* Yes thats the leader */
+	if (rq->gang_leader == 1) {
+
+		if (!in_interrupt() && !irqs_disabled()) {
+			smp_call_function_many(rq->gang_cpumask,
+					gang_sched_member, tg, 0);
+
+			rq->gang_schedule = 0;
+		}
+
+	} else {
+		/*
+		 * find the gang leader according to the span,
+		 * currently we have it as 8cpu, this can be made
+		 * dynamic
+		 */
+		struct sched_domain *sd;
+		unsigned int count;
+		int i;
+
+		for_each_domain(cpu_of(rq), sd) {
+			count = 0;
+			for_each_cpu(i, sched_domain_span(sd))
+				count++;
+
+			if (count >= GANG_SCHED_GRANULARITY)
+				break;
+		}
+
+		if (sd && cpu_of(rq) == domain_first_cpu(sd)) {
+			printk(KERN_INFO "Selected CPU %d as gang leader\n",
+				cpu_of(rq));
+			rq->gang_leader = 1;
+			rq->gang_cpumask = sched_domain_span(sd);
+		} else if (sd) {
+			/*
+			 * A fellow cpu, it will receive gang
+			 * initiations from the gang leader now
+			 */
+			rq->gang_leader = 0;
+		}
+	}
+}
+
 static DEFINE_MUTEX(gang_mutex);
 
 int sched_group_set_gang(struct task_group *tg, unsigned long gang)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f1a85e3..db8369f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -187,6 +187,7 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 extern int sched_group_set_gang(struct task_group *tg, unsigned long gang);
+extern void gang_sched(struct task_group *tg, struct rq *rq);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
@@ -419,6 +420,9 @@ struct rq {
 	unsigned char idle_balance;
 	/* For active balancing */
 	int post_schedule;
+	int gang_schedule;
+	int gang_leader;
+	struct cpumask *gang_cpumask;
 	int active_balance;
 	int push_cpu;
 	struct cpu_stop_work active_balance_work;


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [RFC PATCH 3/4] sched: Gang using set_next_buddy
  2011-12-19  8:33 [RFC PATCH 0/4] Gang scheduling in CFS Nikunj A. Dadhania
  2011-12-19  8:34 ` [RFC PATCH 1/4] sched: Adding cpu.gang file to cpu cgroup Nikunj A. Dadhania
  2011-12-19  8:34 ` [RFC PATCH 2/4] sched: Adding gang scheduling infrastrucure Nikunj A. Dadhania
@ 2011-12-19  8:34 ` Nikunj A. Dadhania
  2011-12-19  8:35 ` [RFC PATCH 4/4] sched:Implement set_gang_buddy Nikunj A. Dadhania
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 75+ messages in thread
From: Nikunj A. Dadhania @ 2011-12-19  8:34 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: nikunj, vatsa, bharata

Gang task group is faroured to be picked up using the set_next_buddy api and
hope that scheduler gives it priority.

Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---

 kernel/sched/fair.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c03efd2..9a2f291 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5518,8 +5518,11 @@ static void gang_sched_member(void *info)
 
 	/* Check if the runqueue has runnable tasks */
 	if (cfs_rq->nr_running) {
-		/* Favour this task group and set need_resched flag,
-		 * added by following patches */
+		struct sched_entity *se = tg->se[cpu];
+
+		/* Make the parent favourable */
+		set_next_buddy(se);
+		set_tsk_need_resched(current);
 	}
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [RFC PATCH 4/4] sched:Implement set_gang_buddy
  2011-12-19  8:33 [RFC PATCH 0/4] Gang scheduling in CFS Nikunj A. Dadhania
                   ` (2 preceding siblings ...)
  2011-12-19  8:34 ` [RFC PATCH 3/4] sched: Gang using set_next_buddy Nikunj A. Dadhania
@ 2011-12-19  8:35 ` Nikunj A. Dadhania
  2011-12-19 15:51   ` Peter Zijlstra
  2011-12-19 11:23 ` [RFC PATCH 0/4] Gang scheduling in CFS Ingo Molnar
  2011-12-19 15:51 ` Peter Zijlstra
  5 siblings, 1 reply; 75+ messages in thread
From: Nikunj A. Dadhania @ 2011-12-19  8:35 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: nikunj, vatsa, bharata

set_next_buddy does not guarantee the pickup of the gang task because of the
preempt check. This sometimes hurts gang scheduling. Introducing
set_gang_buddy api to pick up gang tasks unconditionally.

Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
---

 kernel/sched/fair.c  |   31 ++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |    2 +-
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a2f291..38f97b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1165,6 +1165,17 @@ static void __clear_buddies_skip(struct sched_entity *se)
 	}
 }
 
+static void __clear_buddies_gang(struct sched_entity *se)
+{
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		if (cfs_rq->gang == se)
+			cfs_rq->gang = NULL;
+		else
+			break;
+	}
+}
+
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (cfs_rq->last == se)
@@ -1175,6 +1186,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 	if (cfs_rq->skip == se)
 		__clear_buddies_skip(se);
+
+	if (cfs_rq->gang == se)
+		__clear_buddies_gang(se);
 }
 
 static void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -1331,6 +1345,12 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
 	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
 		se = cfs_rq->next;
 
+	/*
+	 * Gang buddy, lets be unfair here
+	 */
+	if (cfs_rq->gang)
+		se = cfs_rq->gang;
+
 	clear_buddies(cfs_rq, se);
 
 	return se;
@@ -2929,6 +2949,15 @@ static void set_skip_buddy(struct sched_entity *se)
 		cfs_rq_of(se)->skip = se;
 }
 
+static void set_gang_buddy(struct sched_entity *se)
+{
+	if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
+		return;
+
+	for_each_sched_entity(se)
+		cfs_rq_of(se)->gang = se;
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -5521,7 +5550,7 @@ static void gang_sched_member(void *info)
 		struct sched_entity *se = tg->se[cpu];
 
 		/* Make the parent favourable */
-		set_next_buddy(se);
+		set_gang_buddy(se);
 		set_tsk_need_resched(current);
 	}
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db8369f..a96731f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -226,7 +226,7 @@ struct cfs_rq {
 	 * 'curr' points to currently running entity on this cfs_rq.
 	 * It is set to NULL otherwise (i.e when none are currently running).
 	 */
-	struct sched_entity *curr, *next, *last, *skip;
+	struct sched_entity *curr, *next, *last, *skip, *gang;
 
 #ifdef	CONFIG_SCHED_DEBUG
 	unsigned int nr_spread_over;


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19  8:33 [RFC PATCH 0/4] Gang scheduling in CFS Nikunj A. Dadhania
                   ` (3 preceding siblings ...)
  2011-12-19  8:35 ` [RFC PATCH 4/4] sched:Implement set_gang_buddy Nikunj A. Dadhania
@ 2011-12-19 11:23 ` Ingo Molnar
  2011-12-19 11:44   ` Avi Kivity
                     ` (2 more replies)
  2011-12-19 15:51 ` Peter Zijlstra
  5 siblings, 3 replies; 75+ messages in thread
From: Ingo Molnar @ 2011-12-19 11:23 UTC (permalink / raw)
  To: Nikunj A. Dadhania; +Cc: peterz, linux-kernel, vatsa, bharata


* Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> wrote:

>     The following patches implements gang scheduling. These 
>     patches are *highly* experimental in nature and are not 
>     proposed for inclusion at this time.
> 
>     Gang scheduling is an approach where we make an effort to 
>     run related tasks (the gang) at the same time on a number 
>     of CPUs.

The thing is, the (non-)scalability consequences are awful, gang 
scheduling is a true scalability nightmare. Things like this in 
gang_sched():

+               for_each_domain(cpu_of(rq), sd) {
+      	                count = 0;
+                       for_each_cpu(i, sched_domain_span(sd))
+                               count++;

makes me shudder.

So could we please approach this from the benchmarked workload 
angle first? The highest improvement is in ebizzy:

>     ebizzy 2vm (improved 15 times, i.e. 1520%)
>     +------------+--------------------+--------------------+----------+
>     |                               Ebizzy                            |
>     +------------+--------------------+--------------------+----------+
>     | Parameter  |        Basline     |         gang:V2    | % imprv  |
>     +------------+--------------------+--------------------+----------+
>     | EbzyRecords|            1709.50 |           27701.00 |     1520 |
>     |    EbzyUser|              20.48 |             376.64 |     1739 |
>     |     EbzySys|            1384.65 |            1071.40 |       22 |
>     |    EbzyReal|             300.00 |             300.00 |        0 |
>     |     BwUsage|   2456114173416.00 |   2483447784640.00 |        1 |
>     |    HostIdle|              34.00 |              35.00 |       -2 |
>     |     UsrTime|               6.00 |              14.00 |      133 |
>     |     SysTime|              30.00 |              24.00 |       20 |
>     |      IOWait|              10.00 |               9.00 |       10 |
>     |    IdleTime|              51.00 |              51.00 |        0 |
>     |         TPS|              25.00 |              24.00 |       -4 |
>     | CacheMisses|       766543805.00 |      8113721819.00 |     -958 |
>     |   CacheRefs|      9420204706.00 |    136290854100.00 |     1346 |
>     |BranchMisses|      1191336154.00 |     11336436452.00 |     -851 |
>     |    Branches|    618882621656.00 |    459161727370.00 |      -25 |
>     |Instructions|   2517045997661.00 |   2325227247092.00 |        7 |
>     |      Cycles|   7642374654922.00 |   7657626973214.00 |        0 |
>     |     PageFlt|           23779.00 |           22195.00 |        6 |
>     |   ContextSW|         1517241.00 |         1786319.00 |      -17 |
>     |   CPUMigrat|             537.00 |             241.00 |       55 |
>     +-----------------------------------------------------------------+

What's behind this huge speedup? Does ebizzy use user-space 
spinlocks perhaps? Could we do something on the user-space side 
to get a similar speedup?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 11:23 ` [RFC PATCH 0/4] Gang scheduling in CFS Ingo Molnar
@ 2011-12-19 11:44   ` Avi Kivity
  2011-12-19 11:50     ` Nikunj A Dadhania
  2011-12-19 11:45   ` Nikunj A Dadhania
  2011-12-21 10:39   ` Nikunj A Dadhania
  2 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2011-12-19 11:44 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nikunj A. Dadhania, peterz, linux-kernel, vatsa, bharata

On 12/19/2011 01:23 PM, Ingo Molnar wrote:
> What's behind this huge speedup? Does ebizzy use user-space 
> spinlocks perhaps? Could we do something on the user-space side 
> to get a similar speedup?

kvm tries to detect spinlocks (by trapping repeated executions of PAUSE)
and yield to a related vcpu.  It's far from perfect however, and relies
on the spinlock code using PAUSE.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 11:23 ` [RFC PATCH 0/4] Gang scheduling in CFS Ingo Molnar
  2011-12-19 11:44   ` Avi Kivity
@ 2011-12-19 11:45   ` Nikunj A Dadhania
  2011-12-19 13:22     ` Nikunj A Dadhania
  2011-12-21 10:39   ` Nikunj A Dadhania
  2 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-19 11:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: peterz, linux-kernel, vatsa, bharata

On Mon, 19 Dec 2011 12:23:26 +0100, Ingo Molnar <mingo@elte.hu> wrote:
> 
> * Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> 
> >     The following patches implements gang scheduling. These 
> >     patches are *highly* experimental in nature and are not 
> >     proposed for inclusion at this time.
> > 
> >     Gang scheduling is an approach where we make an effort to 
> >     run related tasks (the gang) at the same time on a number 
> >     of CPUs.
> 
> The thing is, the (non-)scalability consequences are awful, gang 
> scheduling is a true scalability nightmare. Things like this in 
> gang_sched():
> 
> +               for_each_domain(cpu_of(rq), sd) {
> +      	                count = 0;
> +                       for_each_cpu(i, sched_domain_span(sd))
> +                               count++;
> 
> makes me shudder.
>
One point to note here is this happens only once for electing the
gang_leader, which can be done on bootup as well. And later when
offlining-onlining the cpu.
 
> So could we please approach this from the benchmarked workload 
> angle first? The highest improvement is in ebizzy:
>
<snip> 
> >     ebizzy 2vm (improved 15 times, i.e. 1520%)
> >     +------------+--------------------+--------------------+----------+
> >     |                               Ebizzy                            |
> >     +------------+--------------------+--------------------+----------+
> >     | Parameter  |        Basline     |         gang:V2    | % imprv  |
> >     +------------+--------------------+--------------------+----------+
> >     | EbzyRecords|            1709.50 |           27701.00 |     1520 |
> >     |    EbzyUser|              20.48 |             376.64 | 1739 |
> 
It is getting more usertime.

> >     |     EbzySys|            1384.65 |            1071.40 |       22 |
> >     |    EbzyReal|             300.00 |             300.00 |        0 |
> >     |     BwUsage|   2456114173416.00 |   2483447784640.00 |        1 |
> >     |    HostIdle|              34.00 |              35.00 |       -2 |
> >     |     UsrTime|               6.00 |              14.00 | 133 |
>
Even the guest numbers says so, got using iostat in guest.

> 
> What's behind this huge speedup? Does ebizzy use user-space 
> spinlocks perhaps? Could we do something on the user-space side 
> to get a similar speedup?
> 
Some more oprofile data here for the above ebizzy-2VM run:

ebizzy: gang top callers(2 VMs)
             2147208                     total                    0
              357627       ____pagevec_lru_add                 1064
              297518   native_flush_tlb_others                 1328
              245478    get_page_from_freelist                  174
              219277 default_send_IPI_mask_logical                  978
              168287           __do_page_fault                  159
              156154             release_pages                  336
               73961          handle_pte_fault                   20
               68923         down_read_trylock                 2153
               60094    __alloc_pages_nodemask                   29
ebizzy: nogang top callers(2 VMs)
             2771869                     total                    0
             2653732   native_flush_tlb_others                11847
               16004    get_page_from_freelist                   11
               15977       ____pagevec_lru_add                   47
               13125 default_send_IPI_mask_logical                   58
               10739           __do_page_fault                   10
                9379             release_pages                   20
                5330          handle_pte_fault                    1
                4727         down_read_trylock                  147
                3770    __alloc_pages_nodemask                    1

Regards,
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 11:44   ` Avi Kivity
@ 2011-12-19 11:50     ` Nikunj A Dadhania
  2011-12-19 11:59       ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-19 11:50 UTC (permalink / raw)
  To: Avi Kivity, Ingo Molnar; +Cc: peterz, linux-kernel, vatsa, bharata

On Mon, 19 Dec 2011 13:44:02 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/19/2011 01:23 PM, Ingo Molnar wrote:
> > What's behind this huge speedup? Does ebizzy use user-space 
> > spinlocks perhaps? Could we do something on the user-space side 
> > to get a similar speedup?
> 
> kvm tries to detect spinlocks (by trapping repeated executions of PAUSE)
> and yield to a related vcpu.  It's far from perfect however, and relies
> on the spinlock code using PAUSE.
> 
Avi, is this soft-PLE kind of thing?

Regards
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 11:50     ` Nikunj A Dadhania
@ 2011-12-19 11:59       ` Avi Kivity
  2011-12-19 12:06         ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2011-12-19 11:59 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/19/2011 01:50 PM, Nikunj A Dadhania wrote:
> On Mon, 19 Dec 2011 13:44:02 +0200, Avi Kivity <avi@redhat.com> wrote:
> > On 12/19/2011 01:23 PM, Ingo Molnar wrote:
> > > What's behind this huge speedup? Does ebizzy use user-space 
> > > spinlocks perhaps? Could we do something on the user-space side 
> > > to get a similar speedup?
> > 
> > kvm tries to detect spinlocks (by trapping repeated executions of PAUSE)
> > and yield to a related vcpu.  It's far from perfect however, and relies
> > on the spinlock code using PAUSE.
> > 
> Avi, is this soft-PLE kind of thing?

No, hard PLE, see the calls to yield_to.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 11:59       ` Avi Kivity
@ 2011-12-19 12:06         ` Nikunj A Dadhania
  2011-12-19 12:50           ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-19 12:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Mon, 19 Dec 2011 13:59:37 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/19/2011 01:50 PM, Nikunj A Dadhania wrote:
> > On Mon, 19 Dec 2011 13:44:02 +0200, Avi Kivity <avi@redhat.com> wrote:
> > > On 12/19/2011 01:23 PM, Ingo Molnar wrote:
> > > > What's behind this huge speedup? Does ebizzy use user-space 
> > > > spinlocks perhaps? Could we do something on the user-space side 
> > > > to get a similar speedup?
> > > 
> > > kvm tries to detect spinlocks (by trapping repeated executions of PAUSE)
> > > and yield to a related vcpu.  It's far from perfect however, and relies
> > > on the spinlock code using PAUSE.
> > > 
> > Avi, is this soft-PLE kind of thing?
> 
> No, hard PLE, see the calls to yield_to.
> 
The above ebizzy result is from a non-PLE machine, will yield_to come
in to picture here?

I have two set of results, one for PLE machine and other for non-PLE.

Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 12:06         ` Nikunj A Dadhania
@ 2011-12-19 12:50           ` Avi Kivity
  2011-12-19 13:09             ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2011-12-19 12:50 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/19/2011 02:06 PM, Nikunj A Dadhania wrote:
> On Mon, 19 Dec 2011 13:59:37 +0200, Avi Kivity <avi@redhat.com> wrote:
> > On 12/19/2011 01:50 PM, Nikunj A Dadhania wrote:
> > > On Mon, 19 Dec 2011 13:44:02 +0200, Avi Kivity <avi@redhat.com> wrote:
> > > > On 12/19/2011 01:23 PM, Ingo Molnar wrote:
> > > > > What's behind this huge speedup? Does ebizzy use user-space 
> > > > > spinlocks perhaps? Could we do something on the user-space side 
> > > > > to get a similar speedup?
> > > > 
> > > > kvm tries to detect spinlocks (by trapping repeated executions of PAUSE)
> > > > and yield to a related vcpu.  It's far from perfect however, and relies
> > > > on the spinlock code using PAUSE.
> > > > 
> > > Avi, is this soft-PLE kind of thing?
> > 
> > No, hard PLE, see the calls to yield_to.
> > 
> The above ebizzy result is from a non-PLE machine, will yield_to come
> in to picture here?

No.  I have a soft-PLE patchset but I don't think it's any good.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 12:50           ` Avi Kivity
@ 2011-12-19 13:09             ` Nikunj A Dadhania
  0 siblings, 0 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-19 13:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Mon, 19 Dec 2011 14:50:31 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/19/2011 02:06 PM, Nikunj A Dadhania wrote:
> > The above ebizzy result is from a non-PLE machine, will yield_to come
> > in to picture here?
> 
> No.  I have a soft-PLE patchset but I don't think it's any good.
> 
I had tried your soft-PLE patchset some time back, gang sched was doing
pretty good compared to soft-PLE.

Regards,
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 11:45   ` Nikunj A Dadhania
@ 2011-12-19 13:22     ` Nikunj A Dadhania
  2011-12-19 16:28       ` Ingo Molnar
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-19 13:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: peterz, linux-kernel, vatsa, bharata

On Mon, 19 Dec 2011 17:15:05 +0530, Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > 
> > What's behind this huge speedup? Does ebizzy use user-space 
> > spinlocks perhaps? Could we do something on the user-space side 
> > to get a similar speedup?
> > 
> Some more oprofile data here for the above ebizzy-2VM run:
> 
That is readprofile data, kernel booted with profile=2

Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19  8:33 [RFC PATCH 0/4] Gang scheduling in CFS Nikunj A. Dadhania
                   ` (4 preceding siblings ...)
  2011-12-19 11:23 ` [RFC PATCH 0/4] Gang scheduling in CFS Ingo Molnar
@ 2011-12-19 15:51 ` Peter Zijlstra
  2011-12-19 16:09   ` Alan Cox
                     ` (3 more replies)
  5 siblings, 4 replies; 75+ messages in thread
From: Peter Zijlstra @ 2011-12-19 15:51 UTC (permalink / raw)
  To: Nikunj A. Dadhania
  Cc: mingo, linux-kernel, vatsa, bharata, Benjamin Herrenschmidt, paulus

On Mon, 2011-12-19 at 14:03 +0530, Nikunj A. Dadhania wrote:
> The following patches implements gang scheduling. These patches
>     are *highly* experimental in nature and are not proposed for
>     inclusion at this time.

Nor will they ever be, I've always strongly opposed the whole concept
and I'm not about to change my mind. Gang scheduling is a scalability
nightmare. 

>     Gang scheduling can be helpful in virtualization scenario. It will
>     help in avoiding the lock-holder-preemption[1] problem and other
>     benefits include improved lock-acquisition times. This feature
>     will help address some limitations of KVM on Power

Use paravirt ticket locks or a pause-loop-filter like thing.

>     On Power, we have an interesting hardware restriction on guests
>     running across SMT theads: on any single core, we can only run one
>     mm context at any given time. 

OMFG are your hardware engineers insane?

Anyway, I had a look at your patches and I don't see how could ever
work. You gang-schedule cgroup entities, but there's no guarantee the
load-balancer will have at least one task for each group on every cpu.



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 2/4] sched: Adding gang scheduling infrastrucure
  2011-12-19  8:34 ` [RFC PATCH 2/4] sched: Adding gang scheduling infrastrucure Nikunj A. Dadhania
@ 2011-12-19 15:51   ` Peter Zijlstra
  2011-12-19 16:51     ` Peter Zijlstra
  2011-12-20  1:39     ` Nikunj A Dadhania
  0 siblings, 2 replies; 75+ messages in thread
From: Peter Zijlstra @ 2011-12-19 15:51 UTC (permalink / raw)
  To: Nikunj A. Dadhania; +Cc: mingo, linux-kernel, vatsa, bharata

On Mon, 2011-12-19 at 14:04 +0530, Nikunj A. Dadhania wrote:

> +static void gang_sched_member(void *info)
> +{
> +	struct task_group *tg = (struct task_group *) info;
> +	struct cfs_rq *cfs_rq;
> +	struct rq *rq;
> +	int cpu;
> +	unsigned long flags;
> +
> +	cpu  = smp_processor_id();
> +	cfs_rq = tg->cfs_rq[cpu];
> +	rq = cfs_rq->rq;
> +
> +	raw_spin_lock_irqsave(&rq->lock, flags);
> +
> +	/* Check if the runqueue has runnable tasks */
> +	if (cfs_rq->nr_running) {
> +		/* Favour this task group and set need_resched flag,
> +		 * added by following patches */

That's just plain insanity, patch 3 is all of 4 lines, why split that
and have an incomplete patch here?

> +	}
> +	raw_spin_unlock_irqrestore(&rq->lock, flags);
> +}
> +
> +#define GANG_SCHED_GRANULARITY 8

Why have this magical number to begin with?

> +void gang_sched(struct task_group *tg, struct rq *rq)
> +{
> +	/* We do not gang sched here */
> +	if (rq->gang_leader == 0 || !tg || tg->gang == 0)
> +		return;
> +
> +	/* Yes thats the leader */
> +	if (rq->gang_leader == 1) {
> +
> +		if (!in_interrupt() && !irqs_disabled()) {

How can this ever happen, schedule() can't be called from interrupt
context and post_schedule() ensures interrupts are enabled.

> +			smp_call_function_many(rq->gang_cpumask,
> +					gang_sched_member, tg, 0);

See this is just not going to happen..

> +			rq->gang_schedule = 0;
> +		}
> +
> +	} else {
> +		/*
> +		 * find the gang leader according to the span,
> +		 * currently we have it as 8cpu, this can be made
> +		 * dynamic
> +		 */
> +		struct sched_domain *sd;
> +		unsigned int count;
> +		int i;
> +
> +		for_each_domain(cpu_of(rq), sd) {
> +			count = 0;
> +			for_each_cpu(i, sched_domain_span(sd))
> +				count++;

That's just incompetent; there's cpumask_weight(), also that's called
sd->span_weight.

> +			if (count >= GANG_SCHED_GRANULARITY)
> +				break;
> +		}
> +
> +		if (sd && cpu_of(rq) == domain_first_cpu(sd)) {
> +			printk(KERN_INFO "Selected CPU %d as gang leader\n",
> +				cpu_of(rq));
> +			rq->gang_leader = 1;
> +			rq->gang_cpumask = sched_domain_span(sd);
> +		} else if (sd) {
> +			/*
> +			 * A fellow cpu, it will receive gang
> +			 * initiations from the gang leader now
> +			 */
> +			rq->gang_leader = 0;
> +		}
> +	}
> +}

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 4/4] sched:Implement set_gang_buddy
  2011-12-19  8:35 ` [RFC PATCH 4/4] sched:Implement set_gang_buddy Nikunj A. Dadhania
@ 2011-12-19 15:51   ` Peter Zijlstra
  2011-12-20  1:43     ` Nikunj A Dadhania
  2011-12-26  2:30     ` Nikunj A Dadhania
  0 siblings, 2 replies; 75+ messages in thread
From: Peter Zijlstra @ 2011-12-19 15:51 UTC (permalink / raw)
  To: Nikunj A. Dadhania; +Cc: mingo, linux-kernel, vatsa, bharata

On Mon, 2011-12-19 at 14:05 +0530, Nikunj A. Dadhania wrote:
> +       /*
> +        * Gang buddy, lets be unfair here
> +        */ 

And why would you think that's an option?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 15:51 ` Peter Zijlstra
@ 2011-12-19 16:09   ` Alan Cox
  2011-12-19 22:10   ` Benjamin Herrenschmidt
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 75+ messages in thread
From: Alan Cox @ 2011-12-19 16:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikunj A. Dadhania, mingo, linux-kernel, vatsa, bharata,
	Benjamin Herrenschmidt, paulus

On Mon, 19 Dec 2011 16:51:41 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, 2011-12-19 at 14:03 +0530, Nikunj A. Dadhania wrote:
> > The following patches implements gang scheduling. These patches
> >     are *highly* experimental in nature and are not proposed for
> >     inclusion at this time.
> 
> Nor will they ever be, I've always strongly opposed the whole concept
> and I'm not about to change my mind. Gang scheduling is a scalability
> nightmare. 

For most situations: I think the question is whether you can write a
clean gang scheduling option which has no impact on "normal" users. 

Yes gang scheduling is insane but for some insane workloads its the right
thing to do.

Alan

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 13:22     ` Nikunj A Dadhania
@ 2011-12-19 16:28       ` Ingo Molnar
  0 siblings, 0 replies; 75+ messages in thread
From: Ingo Molnar @ 2011-12-19 16:28 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: peterz, linux-kernel, vatsa, bharata


* Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:

> On Mon, 19 Dec 2011 17:15:05 +0530, Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > > 
> > > What's behind this huge speedup? Does ebizzy use user-space 
> > > spinlocks perhaps? Could we do something on the user-space side 
> > > to get a similar speedup?
> > > 
> > Some more oprofile data here for the above ebizzy-2VM run:
> > 
> That is readprofile data, kernel booted with profile=2

Btw., you could try this newfangled tools/perf/ thing to profile 
the kernel ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 2/4] sched: Adding gang scheduling infrastrucure
  2011-12-19 15:51   ` Peter Zijlstra
@ 2011-12-19 16:51     ` Peter Zijlstra
  2011-12-20  1:43       ` Nikunj A Dadhania
  2011-12-20  1:39     ` Nikunj A Dadhania
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2011-12-19 16:51 UTC (permalink / raw)
  To: Nikunj A. Dadhania; +Cc: mingo, linux-kernel, vatsa, bharata

On Mon, 2011-12-19 at 16:51 +0100, Peter Zijlstra wrote:
> > +                     smp_call_function_many(rq->gang_cpumask,
> > +                                     gang_sched_member, tg, 0);
> 
> See this is just not going to happen. 

Furthermore its racy against task_group destruction.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 15:51 ` Peter Zijlstra
  2011-12-19 16:09   ` Alan Cox
@ 2011-12-19 22:10   ` Benjamin Herrenschmidt
  2011-12-20  1:56   ` Nikunj A Dadhania
  2011-12-20  8:52   ` Jeremy Fitzhardinge
  3 siblings, 0 replies; 75+ messages in thread
From: Benjamin Herrenschmidt @ 2011-12-19 22:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikunj A. Dadhania, mingo, linux-kernel, vatsa, bharata, paulus

On Mon, 2011-12-19 at 16:51 +0100, Peter Zijlstra wrote:
> On Mon, 2011-12-19 at 14:03 +0530, Nikunj A. Dadhania wrote:
> > The following patches implements gang scheduling. These patches
> >     are *highly* experimental in nature and are not proposed for
> >     inclusion at this time.
> 
> Nor will they ever be, I've always strongly opposed the whole concept
> and I'm not about to change my mind. Gang scheduling is a scalability
> nightmare. 
> 
> >     Gang scheduling can be helpful in virtualization scenario. It will
> >     help in avoiding the lock-holder-preemption[1] problem and other
> >     benefits include improved lock-acquisition times. This feature
> >     will help address some limitations of KVM on Power
> 
> Use paravirt ticket locks or a pause-loop-filter like thing.
> 
> >     On Power, we have an interesting hardware restriction on guests
> >     running across SMT theads: on any single core, we can only run one
> >     mm context at any given time. 
> 
> OMFG are your hardware engineers insane?

No we can run separate mm contexts, but we can only run one -partition-
at a time. Sadly the host kernel is also a partition for the MMU so that
means that all 4 threads must be running the same guest and enter/exit
the guest at the same time.

> Anyway, I had a look at your patches and I don't see how could ever
> work. You gang-schedule cgroup entities, but there's no guarantee the
> load-balancer will have at least one task for each group on every cpu.

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 2/4] sched: Adding gang scheduling infrastrucure
  2011-12-19 15:51   ` Peter Zijlstra
  2011-12-19 16:51     ` Peter Zijlstra
@ 2011-12-20  1:39     ` Nikunj A Dadhania
  1 sibling, 0 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-20  1:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, vatsa, bharata

On Mon, 19 Dec 2011 16:51:44 +0100, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-12-19 at 14:04 +0530, Nikunj A. Dadhania wrote:
> 
> > +	raw_spin_lock_irqsave(&rq->lock, flags);
> > +
> > +	/* Check if the runqueue has runnable tasks */
> > +	if (cfs_rq->nr_running) {
> > +		/* Favour this task group and set need_resched flag,
> > +		 * added by following patches */
> 
> That's just plain insanity, patch 3 is all of 4 lines, why split that
> and have an incomplete patch here?
> 
I will fold that in this patch.

> > +	}
> > +	raw_spin_unlock_irqrestore(&rq->lock, flags);
> > +}
> > +
> > +#define GANG_SCHED_GRANULARITY 8
> 
> Why have this magical number to begin with?
> 
We do not want to gang across the complete machine say 128cpus. Break it
to 16 independent gang. So that way we can scale up. 

This can be a sysctl or architecture specific define.

> > +void gang_sched(struct task_group *tg, struct rq *rq)
> > +{
> > +	/* We do not gang sched here */
> > +	if (rq->gang_leader == 0 || !tg || tg->gang == 0)
> > +		return;
> > +
> > +	/* Yes thats the leader */
> > +	if (rq->gang_leader == 1) {
> > +
> > +		if (!in_interrupt() && !irqs_disabled()) {
> 
> How can this ever happen, schedule() can't be called from interrupt
> context and post_schedule() ensures interrupts are enabled.
> 
Ah... thought that schedule can get called from interrupt
context. Sometime back I had some crash without this, let me remove this
and check it.

And smp_call_function_many required that, so those conditions. From the
function header;

 * You must not call this function with disabled interrupts or from a
 * hardware interrupt handler or from a bottom half handler. Preemption
 * must be disabled when calling this function.
 */

> > +			smp_call_function_many(rq->gang_cpumask,
> > +					gang_sched_member, tg, 0);
> 
> See this is just not going to happen..
> 
Why do you say that? I had trace functions in my debug code and I was
hitting gang_sched_member on the other cpus.

> > +
> > +		for_each_domain(cpu_of(rq), sd) {
> > +			count = 0;
> > +			for_each_cpu(i, sched_domain_span(sd))
> > +				count++;
> 
> That's just incompetent; there's cpumask_weight(), also that's called
> sd->span_weight.
> 
Let me go and check that out, will use them. It will definitely reduce
the code here.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 4/4] sched:Implement set_gang_buddy
  2011-12-19 15:51   ` Peter Zijlstra
@ 2011-12-20  1:43     ` Nikunj A Dadhania
  2011-12-26  2:30     ` Nikunj A Dadhania
  1 sibling, 0 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-20  1:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, vatsa, bharata

On Mon, 19 Dec 2011 16:51:48 +0100, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-12-19 at 14:05 +0530, Nikunj A. Dadhania wrote:
> > +       /*
> > +        * Gang buddy, lets be unfair here
> > +        */ 
> 
> And why would you think that's an option?
> 
Experimenting ;)

/me runs


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 2/4] sched: Adding gang scheduling infrastrucure
  2011-12-19 16:51     ` Peter Zijlstra
@ 2011-12-20  1:43       ` Nikunj A Dadhania
  0 siblings, 0 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-20  1:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, vatsa, bharata

On Mon, 19 Dec 2011 17:51:39 +0100, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-12-19 at 16:51 +0100, Peter Zijlstra wrote:
> > > +                     smp_call_function_many(rq->gang_cpumask,
> > > +                                     gang_sched_member, tg, 0);
> > 
> > See this is just not going to happen. 
> 
> Furthermore its racy against task_group destruction.
> 
Yes, i definitely need to check that part.

Thanks
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 15:51 ` Peter Zijlstra
  2011-12-19 16:09   ` Alan Cox
  2011-12-19 22:10   ` Benjamin Herrenschmidt
@ 2011-12-20  1:56   ` Nikunj A Dadhania
  2011-12-20  8:52   ` Jeremy Fitzhardinge
  3 siblings, 0 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-20  1:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, vatsa, bharata, Benjamin Herrenschmidt, paulus

On Mon, 19 Dec 2011 16:51:41 +0100, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> Anyway, I had a look at your patches and I don't see how could ever
> work. You gang-schedule cgroup entities, but there's no guarantee the
> load-balancer will have at least one task for each group on every cpu.
>
As stated earlier:

    The gang scheduling problem can be broken into two parts:
    a) Placement of the tasks to be gang scheduled 
    b) Synchronized scheduling of the tasks across a subset of cpu.

In the patch, I have (b) implemented and placement is done by pinning
the vcpu of a VM in userspace. Yes, thats not the right way. 

Effectively, no trouble to the load-balancer here.

Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 15:51 ` Peter Zijlstra
                     ` (2 preceding siblings ...)
  2011-12-20  1:56   ` Nikunj A Dadhania
@ 2011-12-20  8:52   ` Jeremy Fitzhardinge
  3 siblings, 0 replies; 75+ messages in thread
From: Jeremy Fitzhardinge @ 2011-12-20  8:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikunj A. Dadhania, mingo, linux-kernel, vatsa, bharata,
	Benjamin Herrenschmidt, paulus

On 12/19/2011 07:51 AM, Peter Zijlstra wrote:
> Use paravirt ticket locks

I guess it's time I reposted that series.

    J

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-19 11:23 ` [RFC PATCH 0/4] Gang scheduling in CFS Ingo Molnar
  2011-12-19 11:44   ` Avi Kivity
  2011-12-19 11:45   ` Nikunj A Dadhania
@ 2011-12-21 10:39   ` Nikunj A Dadhania
  2011-12-21 10:43     ` Avi Kivity
  2 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-21 10:39 UTC (permalink / raw)
  To: Ingo Molnar, Avi Kivity; +Cc: peterz, linux-kernel, vatsa, bharata

On Mon, 19 Dec 2011 12:23:26 +0100, Ingo Molnar <mingo@elte.hu> wrote:
> 
> * Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> 
> So could we please approach this from the benchmarked workload 
> angle first? The highest improvement is in ebizzy:
> 
> >     ebizzy 2vm (improved 15 times, i.e. 1520%)
> 
> What's behind this huge speedup? Does ebizzy use user-space 
> spinlocks perhaps? Could we do something on the user-space side 
> to get a similar speedup?
> 
This is from the perf run on the host:

Baseline:

16.22%         qemu-kvm  [kvm_intel]    [k] free_kvm_area
 8.27%         qemu-kvm  [kvm]          [k] start_apic_timer
 7.53%         qemu-kvm  [kvm]          [k] kvm_put_guest_fpu

Gang:

24.44%         qemu-kvm  [kvm_intel]    [k] free_kvm_area
13.42%         qemu-kvm  [kvm]          [k] start_apic_timer 
 9.91%         qemu-kvm  [kvm]          [k] kvm_put_guest_fpu

Ingo, Avi, I am not getting anything obvious from this. Any ideas?

Regards
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-21 10:39   ` Nikunj A Dadhania
@ 2011-12-21 10:43     ` Avi Kivity
  2011-12-23  3:20       ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2011-12-21 10:43 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/21/2011 12:39 PM, Nikunj A Dadhania wrote:
> On Mon, 19 Dec 2011 12:23:26 +0100, Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > * Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > 
> > So could we please approach this from the benchmarked workload 
> > angle first? The highest improvement is in ebizzy:
> > 
> > >     ebizzy 2vm (improved 15 times, i.e. 1520%)
> > 
> > What's behind this huge speedup? Does ebizzy use user-space 
> > spinlocks perhaps? Could we do something on the user-space side 
> > to get a similar speedup?
> > 
> This is from the perf run on the host:
>
> Baseline:
>
> 16.22%         qemu-kvm  [kvm_intel]    [k] free_kvm_area
>  8.27%         qemu-kvm  [kvm]          [k] start_apic_timer
>  7.53%         qemu-kvm  [kvm]          [k] kvm_put_guest_fpu
>
> Gang:
>
> 24.44%         qemu-kvm  [kvm_intel]    [k] free_kvm_area
> 13.42%         qemu-kvm  [kvm]          [k] start_apic_timer 
>  9.91%         qemu-kvm  [kvm]          [k] kvm_put_guest_fpu
>
> Ingo, Avi, I am not getting anything obvious from this. Any ideas?
>

Looks like perf is confused, this sometimes happens if you rebuild the
kernel but only rmmod/insmod kvm.  Try a clean build + boot.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-21 10:43     ` Avi Kivity
@ 2011-12-23  3:20       ` Nikunj A Dadhania
  2011-12-23 10:36         ` Ingo Molnar
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-23  3:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Wed, 21 Dec 2011 12:43:43 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/21/2011 12:39 PM, Nikunj A Dadhania wrote:
> > On Mon, 19 Dec 2011 12:23:26 +0100, Ingo Molnar <mingo@elte.hu> wrote:
> > > 
> > > * Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > > 
> > > So could we please approach this from the benchmarked workload 
> > > angle first? The highest improvement is in ebizzy:
> > > 
> > > >     ebizzy 2vm (improved 15 times, i.e. 1520%)
> > > 
> > > What's behind this huge speedup? Does ebizzy use user-space 
> > > spinlocks perhaps? Could we do something on the user-space side 
> > > to get a similar speedup?
> > > 
> > This is from the perf run on the host:
> >
> > Baseline:
> >
> > 16.22%         qemu-kvm  [kvm_intel]    [k] free_kvm_area
> >  8.27%         qemu-kvm  [kvm]          [k] start_apic_timer
> >  7.53%         qemu-kvm  [kvm]          [k] kvm_put_guest_fpu
> >
> > Gang:
> >
> > 24.44%         qemu-kvm  [kvm_intel]    [k] free_kvm_area
> > 13.42%         qemu-kvm  [kvm]          [k] start_apic_timer 
> >  9.91%         qemu-kvm  [kvm]          [k] kvm_put_guest_fpu
> >
> > Ingo, Avi, I am not getting anything obvious from this. Any ideas?
> >

Here some interesting perf reports from inside the guest:

Baseline:
  29.79%   ebizzy  [kernel.kallsyms]   [k] native_flush_tlb_others
  18.70%   ebizzy  libc-2.12.so        [.] __GI_memcpy
   7.23%   ebizzy  [kernel.kallsyms]   [k] get_page_from_freelist
   5.38%   ebizzy  [kernel.kallsyms]   [k] __do_page_fault
   4.50%   ebizzy  [kernel.kallsyms]   [k] ____pagevec_lru_add
   3.58%   ebizzy  [kernel.kallsyms]   [k] default_send_IPI_mask_logical
   3.26%   ebizzy  [kernel.kallsyms]   [k] native_flush_tlb_single
   2.82%   ebizzy  [kernel.kallsyms]   [k] handle_pte_fault
   2.16%   ebizzy  [kernel.kallsyms]   [k] kunmap_atomic
   2.10%   ebizzy  [kernel.kallsyms]   [k] _spin_unlock_irqrestore
   1.90%   ebizzy  [kernel.kallsyms]   [k] down_read_trylock
   1.65%   ebizzy  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge.clone.4
   1.60%   ebizzy  [kernel.kallsyms]   [k] up_read
   1.24%   ebizzy  [kernel.kallsyms]   [k] __alloc_pages_nodemask

Gang:
  22.53%   ebizzy  libc-2.12.so       [.] __GI_memcpy
   9.73%   ebizzy  [kernel.kallsyms]  [k] ____pagevec_lru_add
   8.22%   ebizzy  [kernel.kallsyms]  [k] get_page_from_freelist
   7.80%   ebizzy  [kernel.kallsyms]  [k] default_send_IPI_mask_logical
   7.68%   ebizzy  [kernel.kallsyms]  [k] native_flush_tlb_others
   6.22%   ebizzy  [kernel.kallsyms]  [k] __do_page_fault
   5.54%   ebizzy  [kernel.kallsyms]  [k] native_flush_tlb_single
   4.44%   ebizzy  [kernel.kallsyms]  [k] _spin_unlock_irqrestore
   2.90%   ebizzy  [kernel.kallsyms]  [k] kunmap_atomic
   2.78%   ebizzy  [kernel.kallsyms]  [k] __mem_cgroup_commit_charge.clone.4
   2.76%   ebizzy  [kernel.kallsyms]  [k] handle_pte_fault
   2.16%   ebizzy  [kernel.kallsyms]  [k] __mem_cgroup_uncharge_common
   1.59%   ebizzy  [kernel.kallsyms]  [k] down_read_trylock
   1.43%   ebizzy  [kernel.kallsyms]  [k] up_read

I see the main difference between both the reports is:
native_flush_tlb_others.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-23  3:20       ` Nikunj A Dadhania
@ 2011-12-23 10:36         ` Ingo Molnar
  2011-12-25 10:58           ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Ingo Molnar @ 2011-12-23 10:36 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Avi Kivity, peterz, linux-kernel, vatsa, bharata


* Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:

> Here some interesting perf reports from inside the guest:
> 
> Baseline:
>   29.79%   ebizzy  [kernel.kallsyms]   [k] native_flush_tlb_others
>   18.70%   ebizzy  libc-2.12.so        [.] __GI_memcpy
>    7.23%   ebizzy  [kernel.kallsyms]   [k] get_page_from_freelist
>    5.38%   ebizzy  [kernel.kallsyms]   [k] __do_page_fault
>    4.50%   ebizzy  [kernel.kallsyms]   [k] ____pagevec_lru_add
>    3.58%   ebizzy  [kernel.kallsyms]   [k] default_send_IPI_mask_logical
>    3.26%   ebizzy  [kernel.kallsyms]   [k] native_flush_tlb_single
>    2.82%   ebizzy  [kernel.kallsyms]   [k] handle_pte_fault
>    2.16%   ebizzy  [kernel.kallsyms]   [k] kunmap_atomic
>    2.10%   ebizzy  [kernel.kallsyms]   [k] _spin_unlock_irqrestore
>    1.90%   ebizzy  [kernel.kallsyms]   [k] down_read_trylock
>    1.65%   ebizzy  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge.clone.4
>    1.60%   ebizzy  [kernel.kallsyms]   [k] up_read
>    1.24%   ebizzy  [kernel.kallsyms]   [k] __alloc_pages_nodemask
> 
> Gang:
>   22.53%   ebizzy  libc-2.12.so       [.] __GI_memcpy
>    9.73%   ebizzy  [kernel.kallsyms]  [k] ____pagevec_lru_add
>    8.22%   ebizzy  [kernel.kallsyms]  [k] get_page_from_freelist
>    7.80%   ebizzy  [kernel.kallsyms]  [k] default_send_IPI_mask_logical
>    7.68%   ebizzy  [kernel.kallsyms]  [k] native_flush_tlb_others
>    6.22%   ebizzy  [kernel.kallsyms]  [k] __do_page_fault
>    5.54%   ebizzy  [kernel.kallsyms]  [k] native_flush_tlb_single
>    4.44%   ebizzy  [kernel.kallsyms]  [k] _spin_unlock_irqrestore
>    2.90%   ebizzy  [kernel.kallsyms]  [k] kunmap_atomic
>    2.78%   ebizzy  [kernel.kallsyms]  [k] __mem_cgroup_commit_charge.clone.4
>    2.76%   ebizzy  [kernel.kallsyms]  [k] handle_pte_fault
>    2.16%   ebizzy  [kernel.kallsyms]  [k] __mem_cgroup_uncharge_common
>    1.59%   ebizzy  [kernel.kallsyms]  [k] down_read_trylock
>    1.43%   ebizzy  [kernel.kallsyms]  [k] up_read
> 
> I see the main difference between both the reports is:
> native_flush_tlb_others.

So it would be important to figure out why ebizzy gets into so 
many TLB flushes and why gang scheduling makes it go away.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-23 10:36         ` Ingo Molnar
@ 2011-12-25 10:58           ` Avi Kivity
  2011-12-25 15:45             ` Avi Kivity
                               ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: Avi Kivity @ 2011-12-25 10:58 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nikunj A Dadhania, peterz, linux-kernel, vatsa, bharata

On 12/23/2011 12:36 PM, Ingo Molnar wrote:
> * Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
>
> > Here some interesting perf reports from inside the guest:
> > 
> > Baseline:
> >   29.79%   ebizzy  [kernel.kallsyms]   [k] native_flush_tlb_others
> >   18.70%   ebizzy  libc-2.12.so        [.] __GI_memcpy
> >    7.23%   ebizzy  [kernel.kallsyms]   [k] get_page_from_freelist
> >    5.38%   ebizzy  [kernel.kallsyms]   [k] __do_page_fault
> >    4.50%   ebizzy  [kernel.kallsyms]   [k] ____pagevec_lru_add
> >    3.58%   ebizzy  [kernel.kallsyms]   [k] default_send_IPI_mask_logical
> >    3.26%   ebizzy  [kernel.kallsyms]   [k] native_flush_tlb_single
> >    2.82%   ebizzy  [kernel.kallsyms]   [k] handle_pte_fault
> >    2.16%   ebizzy  [kernel.kallsyms]   [k] kunmap_atomic
> >    2.10%   ebizzy  [kernel.kallsyms]   [k] _spin_unlock_irqrestore
> >    1.90%   ebizzy  [kernel.kallsyms]   [k] down_read_trylock
> >    1.65%   ebizzy  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge.clone.4
> >    1.60%   ebizzy  [kernel.kallsyms]   [k] up_read
> >    1.24%   ebizzy  [kernel.kallsyms]   [k] __alloc_pages_nodemask
> > 
> > Gang:
> >   22.53%   ebizzy  libc-2.12.so       [.] __GI_memcpy
> >    9.73%   ebizzy  [kernel.kallsyms]  [k] ____pagevec_lru_add
> >    8.22%   ebizzy  [kernel.kallsyms]  [k] get_page_from_freelist
> >    7.80%   ebizzy  [kernel.kallsyms]  [k] default_send_IPI_mask_logical
> >    7.68%   ebizzy  [kernel.kallsyms]  [k] native_flush_tlb_others
> >    6.22%   ebizzy  [kernel.kallsyms]  [k] __do_page_fault
> >    5.54%   ebizzy  [kernel.kallsyms]  [k] native_flush_tlb_single
> >    4.44%   ebizzy  [kernel.kallsyms]  [k] _spin_unlock_irqrestore
> >    2.90%   ebizzy  [kernel.kallsyms]  [k] kunmap_atomic
> >    2.78%   ebizzy  [kernel.kallsyms]  [k] __mem_cgroup_commit_charge.clone.4
> >    2.76%   ebizzy  [kernel.kallsyms]  [k] handle_pte_fault
> >    2.16%   ebizzy  [kernel.kallsyms]  [k] __mem_cgroup_uncharge_common
> >    1.59%   ebizzy  [kernel.kallsyms]  [k] down_read_trylock
> >    1.43%   ebizzy  [kernel.kallsyms]  [k] up_read
> > 
> > I see the main difference between both the reports is:
> > native_flush_tlb_others.
>
> So it would be important to figure out why ebizzy gets into so 
> many TLB flushes and why gang scheduling makes it go away.

The second part is easy - a remote tlb flush involves IPIs to many other
vcpus (possible waking them up and scheduling them), then busy-waiting
until they acknowledge the flush.  Gang scheduling is really good here
since it shortens the busy wait, would be even better if we schedule
halted vcpus (see the yield_on_hlt module parameter, set to 0). 
Directed yield on PLE should provide intermediate results between doing
nothing and gang sched.

The first part appears to be unrelated to ebizzy itself - it's the
kunmap_atomic() flushing ptes.  It could be eliminated by switching to a
non-highmem kernel, or by allocating more PTEs for kmap_atomic() and
batching the flush.

btw you can get an additional speedup by enabling x2apic, for
default_send_IPI_mask_logical().

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-25 10:58           ` Avi Kivity
@ 2011-12-25 15:45             ` Avi Kivity
  2011-12-26  3:14             ` Nikunj A Dadhania
  2011-12-30  9:51             ` Ingo Molnar
  2 siblings, 0 replies; 75+ messages in thread
From: Avi Kivity @ 2011-12-25 15:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nikunj A Dadhania, peterz, linux-kernel, vatsa, bharata

On 12/25/2011 12:58 PM, Avi Kivity wrote:
> On 12/23/2011 12:36 PM, Ingo Molnar wrote:
> > * Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> >
> > > Here some interesting perf reports from inside the guest:
> > > 
> > > Baseline:
> > >   29.79%   ebizzy  [kernel.kallsyms]   [k] native_flush_tlb_others
> > >   18.70%   ebizzy  libc-2.12.so        [.] __GI_memcpy
> > >    7.23%   ebizzy  [kernel.kallsyms]   [k] get_page_from_freelist
> > >    5.38%   ebizzy  [kernel.kallsyms]   [k] __do_page_fault
> > >    4.50%   ebizzy  [kernel.kallsyms]   [k] ____pagevec_lru_add
> > >    3.58%   ebizzy  [kernel.kallsyms]   [k] default_send_IPI_mask_logical
> > >    3.26%   ebizzy  [kernel.kallsyms]   [k] native_flush_tlb_single
> > >    2.82%   ebizzy  [kernel.kallsyms]   [k] handle_pte_fault
> > >    2.16%   ebizzy  [kernel.kallsyms]   [k] kunmap_atomic
> > >    2.10%   ebizzy  [kernel.kallsyms]   [k] _spin_unlock_irqrestore
> > >    1.90%   ebizzy  [kernel.kallsyms]   [k] down_read_trylock
> > >    1.65%   ebizzy  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge.clone.4
> > >    1.60%   ebizzy  [kernel.kallsyms]   [k] up_read
> > >    1.24%   ebizzy  [kernel.kallsyms]   [k] __alloc_pages_nodemask
> > > 
> > > Gang:
> > >   22.53%   ebizzy  libc-2.12.so       [.] __GI_memcpy
> > >    9.73%   ebizzy  [kernel.kallsyms]  [k] ____pagevec_lru_add
> > >    8.22%   ebizzy  [kernel.kallsyms]  [k] get_page_from_freelist
> > >    7.80%   ebizzy  [kernel.kallsyms]  [k] default_send_IPI_mask_logical
> > >    7.68%   ebizzy  [kernel.kallsyms]  [k] native_flush_tlb_others
> > >    6.22%   ebizzy  [kernel.kallsyms]  [k] __do_page_fault
> > >    5.54%   ebizzy  [kernel.kallsyms]  [k] native_flush_tlb_single
> > >    4.44%   ebizzy  [kernel.kallsyms]  [k] _spin_unlock_irqrestore
> > >    2.90%   ebizzy  [kernel.kallsyms]  [k] kunmap_atomic
> > >    2.78%   ebizzy  [kernel.kallsyms]  [k] __mem_cgroup_commit_charge.clone.4
> > >    2.76%   ebizzy  [kernel.kallsyms]  [k] handle_pte_fault
> > >    2.16%   ebizzy  [kernel.kallsyms]  [k] __mem_cgroup_uncharge_common
> > >    1.59%   ebizzy  [kernel.kallsyms]  [k] down_read_trylock
> > >    1.43%   ebizzy  [kernel.kallsyms]  [k] up_read
> > > 
> > > I see the main difference between both the reports is:
> > > native_flush_tlb_others.
> >
> > So it would be important to figure out why ebizzy gets into so 
> > many TLB flushes and why gang scheduling makes it go away.

<snip>

> The first part appears to be unrelated to ebizzy itself - it's the
> kunmap_atomic() flushing ptes.  It could be eliminated by switching to a
> non-highmem kernel, or by allocating more PTEs for kmap_atomic() and
> batching the flush.

Um, that makes no sense.  I was reading the profile as if it was a
backtrace.

Anyway, google says the ebizzy does a lot of large allocations - and
presumably deallocations - to simulate a database workload, which
explain the large number of tlb flushes.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 4/4] sched:Implement set_gang_buddy
  2011-12-19 15:51   ` Peter Zijlstra
  2011-12-20  1:43     ` Nikunj A Dadhania
@ 2011-12-26  2:30     ` Nikunj A Dadhania
  1 sibling, 0 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-26  2:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, vatsa, bharata

On Mon, 19 Dec 2011 16:51:48 +0100, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-12-19 at 14:05 +0530, Nikunj A. Dadhania wrote:
> > +       /*
> > +        * Gang buddy, lets be unfair here
> > +        */ 
> 
> And why would you think that's an option?
> 
Long answer, my previous experiments with set_next_buddy showed that the
gang groups were getting lesser cpu bandwidth than the baseline. Then I
thought of having a new helper(set_gang_buddy) that would give better
chance to gang sched tasks. This will only be affecting the following
cpus.  In the cpu, which has gang_leader set, the code is not giving
undue advantage to the gang task.

Regards,
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-25 10:58           ` Avi Kivity
  2011-12-25 15:45             ` Avi Kivity
@ 2011-12-26  3:14             ` Nikunj A Dadhania
  2011-12-26  9:05               ` Avi Kivity
  2011-12-27  3:15               ` Nikunj A Dadhania
  2011-12-30  9:51             ` Ingo Molnar
  2 siblings, 2 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-26  3:14 UTC (permalink / raw)
  To: Avi Kivity, Ingo Molnar; +Cc: peterz, linux-kernel, vatsa, bharata

On Sun, 25 Dec 2011 12:58:15 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/23/2011 12:36 PM, Ingo Molnar wrote:
> > * Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> >
[...]
> > > 
> > > I see the main difference between both the reports is:
> > > native_flush_tlb_others.
> >
> > So it would be important to figure out why ebizzy gets into so 
> > many TLB flushes and why gang scheduling makes it go away.
> 
> The second part is easy - a remote tlb flush involves IPIs to many other
> vcpus (possible waking them up and scheduling them), then busy-waiting
> until they acknowledge the flush.  Gang scheduling is really good here
> since it shortens the busy wait, would be even better if we schedule
> halted vcpus (see the yield_on_hlt module parameter, set to 0). 
I will check this.

> Directed yield on PLE should provide intermediate results between doing
> nothing and gang sched.
>
Yes, thats true, I have pasted the results from my first mail to
highlight this:

    +-------------+---------------------------+-------------------------+
    |             |            V1 (%)         |             V2 (%)      |
    + Benchmarks  +-------------+-------------+-------------------------+
    |             | GangVsBase  |   GangVsPin |  GangVsBase | GangVsPin |
    +-------------+-------------+-------------+-------------------------+
    | ebizzy  2vm |        0    |        3    |        2    |        5  |
    | ebizzy  4vm |        1    |        0    |        4    |        3  |
    | ebizzy  8vm |        0    |        1    |       23    |       26  |
    +-------------+-------------+-------------+-------------------------+
 
> 
> btw you can get an additional speedup by enabling x2apic, for
> default_send_IPI_mask_logical().
> 
In the host?

Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-26  3:14             ` Nikunj A Dadhania
@ 2011-12-26  9:05               ` Avi Kivity
  2011-12-26 11:33                 ` Nikunj A Dadhania
  2011-12-27  3:15               ` Nikunj A Dadhania
  1 sibling, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2011-12-26  9:05 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/26/2011 05:14 AM, Nikunj A Dadhania wrote:
> > 
> > btw you can get an additional speedup by enabling x2apic, for
> > default_send_IPI_mask_logical().
> > 
> In the host?
>

In the host, for the guest:

 qemu -cpu ...,+x2apic

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-26  9:05               ` Avi Kivity
@ 2011-12-26 11:33                 ` Nikunj A Dadhania
  2011-12-26 11:41                   ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-26 11:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Mon, 26 Dec 2011 11:05:01 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/26/2011 05:14 AM, Nikunj A Dadhania wrote:
> > > 
> > > btw you can get an additional speedup by enabling x2apic, for
> > > default_send_IPI_mask_logical().
> > > 
> > In the host?
> >
> 
> In the host, 
>
The machine(IBM x3650 M2, Nehalem) does not seem to support x2apic. 

I have enabled the following in my config;

[root@krm1 linux-tip]# grep X2APIC .config
CONFIG_X86_X2APIC=y
[root@krm1 linux-tip]#

And booted the kernel with "apic=verbose" command. 

[root@krm1 linux-tip]# dmesg | grep -i x2apic
[root@krm1 linux-tip]# 

Does not give anything. I safely assumed that x2apic is not
supported. Is there something to do in the bios to enable this?

> for the guest:
> 
>  qemu -cpu ...,+x2apic
> 
I need to try this, i have libvirt setup. Let me dig how to enable this
through xml file.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-26 11:33                 ` Nikunj A Dadhania
@ 2011-12-26 11:41                   ` Avi Kivity
  2011-12-27  1:47                     ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2011-12-26 11:41 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/26/2011 01:33 PM, Nikunj A Dadhania wrote:
> On Mon, 26 Dec 2011 11:05:01 +0200, Avi Kivity <avi@redhat.com> wrote:
> > On 12/26/2011 05:14 AM, Nikunj A Dadhania wrote:
> > > > 
> > > > btw you can get an additional speedup by enabling x2apic, for
> > > > default_send_IPI_mask_logical().
> > > > 
> > > In the host?
> > >
> > 
> > In the host, 
> >
> The machine(IBM x3650 M2, Nehalem) does not seem to support x2apic. 

It's emulated.

> I have enabled the following in my config;
>
> [root@krm1 linux-tip]# grep X2APIC .config
> CONFIG_X86_X2APIC=y
> [root@krm1 linux-tip]#
>
> And booted the kernel with "apic=verbose" command. 
>
> [root@krm1 linux-tip]# dmesg | grep -i x2apic
> [root@krm1 linux-tip]# 
>
> Does not give anything. I safely assumed that x2apic is not
> supported. Is there something to do in the bios to enable this?

Sorry, I was imprecise.  Boot the guest, it will recognize the emulated
x2apic.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-26 11:41                   ` Avi Kivity
@ 2011-12-27  1:47                     ` Nikunj A Dadhania
  2011-12-27  9:15                       ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-27  1:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Mon, 26 Dec 2011 13:41:04 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/26/2011 01:33 PM, Nikunj A Dadhania wrote:
> 
> > I have enabled the following in my config;
> >
> > [root@krm1 linux-tip]# grep X2APIC .config
> > CONFIG_X86_X2APIC=y
> > [root@krm1 linux-tip]#
> >
> > And booted the kernel with "apic=verbose" command. 
> >
> > [root@krm1 linux-tip]# dmesg | grep -i x2apic
> > [root@krm1 linux-tip]# 
> >
> > Does not give anything. I safely assumed that x2apic is not
> > supported. Is there something to do in the bios to enable this?
> 
> Sorry, I was imprecise.  Boot the guest, it will recognize the emulated
> x2apic.
>
I booted the guest with -cpu ...,+x2apic (using libvirt), and verified
that qemu command-line does contain x2apic.

There is a log saying this:

Using CPU model
"Nehalem,+rdtscp,+x2apic,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme"
 
Though in the guest dmesg, I still do not see any x2apic logs. I am
missing something obvious here. 

Regards
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-26  3:14             ` Nikunj A Dadhania
  2011-12-26  9:05               ` Avi Kivity
@ 2011-12-27  3:15               ` Nikunj A Dadhania
  2011-12-27  9:17                 ` Avi Kivity
  1 sibling, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-27  3:15 UTC (permalink / raw)
  To: Avi Kivity, Ingo Molnar; +Cc: peterz, linux-kernel, vatsa, bharata

On Mon, 26 Dec 2011 08:44:58 +0530, Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> On Sun, 25 Dec 2011 12:58:15 +0200, Avi Kivity <avi@redhat.com> wrote:
> > On 12/23/2011 12:36 PM, Ingo Molnar wrote:
> > > * Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > >
> [...]
> > > > 
> > > > I see the main difference between both the reports is:
> > > > native_flush_tlb_others.
> > >
> > > So it would be important to figure out why ebizzy gets into so 
> > > many TLB flushes and why gang scheduling makes it go away.
> > 
> > The second part is easy - a remote tlb flush involves IPIs to many other
> > vcpus (possible waking them up and scheduling them), then busy-waiting
> > until they acknowledge the flush.  Gang scheduling is really good here
> > since it shortens the busy wait, would be even better if we schedule
> > halted vcpus (see the yield_on_hlt module parameter, set to 0). 
> I will check this.
> 
I am seeing a drop of ~44% when setting yield_on_hlt = 0

Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-27  1:47                     ` Nikunj A Dadhania
@ 2011-12-27  9:15                       ` Avi Kivity
  2011-12-27 10:24                         ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2011-12-27  9:15 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/27/2011 03:47 AM, Nikunj A Dadhania wrote:
> On Mon, 26 Dec 2011 13:41:04 +0200, Avi Kivity <avi@redhat.com> wrote:
> > On 12/26/2011 01:33 PM, Nikunj A Dadhania wrote:
> > 
> > > I have enabled the following in my config;
> > >
> > > [root@krm1 linux-tip]# grep X2APIC .config
> > > CONFIG_X86_X2APIC=y
> > > [root@krm1 linux-tip]#
> > >
> > > And booted the kernel with "apic=verbose" command. 
> > >
> > > [root@krm1 linux-tip]# dmesg | grep -i x2apic
> > > [root@krm1 linux-tip]# 
> > >
> > > Does not give anything. I safely assumed that x2apic is not
> > > supported. Is there something to do in the bios to enable this?
> > 
> > Sorry, I was imprecise.  Boot the guest, it will recognize the emulated
> > x2apic.
> >
> I booted the guest with -cpu ...,+x2apic (using libvirt), and verified
> that qemu command-line does contain x2apic.
>
> There is a log saying this:
>
> Using CPU model
> "Nehalem,+rdtscp,+x2apic,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme"
>  
> Though in the guest dmesg, I still do not see any x2apic logs. I am
> missing something obvious here. 
>

Check that X2APIC is enabled in the guest .config, as well as
CONFIG_KVM_GUEST.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-27  3:15               ` Nikunj A Dadhania
@ 2011-12-27  9:17                 ` Avi Kivity
  2011-12-27  9:44                   ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2011-12-27  9:17 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/27/2011 05:15 AM, Nikunj A Dadhania wrote:
> On Mon, 26 Dec 2011 08:44:58 +0530, Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > On Sun, 25 Dec 2011 12:58:15 +0200, Avi Kivity <avi@redhat.com> wrote:
> > > On 12/23/2011 12:36 PM, Ingo Molnar wrote:
> > > > * Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > > >
> > [...]
> > > > > 
> > > > > I see the main difference between both the reports is:
> > > > > native_flush_tlb_others.
> > > >
> > > > So it would be important to figure out why ebizzy gets into so 
> > > > many TLB flushes and why gang scheduling makes it go away.
> > > 
> > > The second part is easy - a remote tlb flush involves IPIs to many other
> > > vcpus (possible waking them up and scheduling them), then busy-waiting
> > > until they acknowledge the flush.  Gang scheduling is really good here
> > > since it shortens the busy wait, would be even better if we schedule
> > > halted vcpus (see the yield_on_hlt module parameter, set to 0). 
> > I will check this.
> > 
> I am seeing a drop of ~44% when setting yield_on_hlt = 0
>

A drop of 44% of what?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-27  9:17                 ` Avi Kivity
@ 2011-12-27  9:44                   ` Nikunj A Dadhania
  2011-12-27  9:51                     ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-27  9:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Tue, 27 Dec 2011 11:17:29 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/27/2011 05:15 AM, Nikunj A Dadhania wrote:
> > On Mon, 26 Dec 2011 08:44:58 +0530, Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > > On Sun, 25 Dec 2011 12:58:15 +0200, Avi Kivity <avi@redhat.com> wrote:
> > > > On 12/23/2011 12:36 PM, Ingo Molnar wrote:
> > > > > * Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > > > >
> > > [...]
> > > > > > 
> > > > > > I see the main difference between both the reports is:
> > > > > > native_flush_tlb_others.
> > > > >
> > > > > So it would be important to figure out why ebizzy gets into so 
> > > > > many TLB flushes and why gang scheduling makes it go away.
> > > > 
> > > > The second part is easy - a remote tlb flush involves IPIs to many other
> > > > vcpus (possible waking them up and scheduling them), then busy-waiting
> > > > until they acknowledge the flush.  Gang scheduling is really good here
> > > > since it shortens the busy wait, would be even better if we schedule
> > > > halted vcpus (see the yield_on_hlt module parameter, set to 0). 
> > > I will check this.
> > > 
> > I am seeing a drop of ~44% when setting yield_on_hlt = 0
> >
> 
> A drop of 44% of what?
> 
records/sec

Ebizzy - 2VM running in parallel, both having gang scheduling enabled.

                    yield_on_hlt=1         yield_on_hlt=0
 EbzyRecords/sec      41955.50               23285.00            -44 

Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-27  9:44                   ` Nikunj A Dadhania
@ 2011-12-27  9:51                     ` Avi Kivity
  2011-12-27 10:10                       ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2011-12-27  9:51 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/27/2011 11:44 AM, Nikunj A Dadhania wrote:
> > > I am seeing a drop of ~44% when setting yield_on_hlt = 0
> > >
> > 
> > A drop of 44% of what?
> > 
> records/sec
>
> Ebizzy - 2VM running in parallel, both having gang scheduling enabled.
>
>                     yield_on_hlt=1         yield_on_hlt=0
>  EbzyRecords/sec      41955.50               23285.00            -44 
>
>

Interesting, are you overcommitting vcpus?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-27  9:51                     ` Avi Kivity
@ 2011-12-27 10:10                       ` Nikunj A Dadhania
  2011-12-27 10:34                         ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-27 10:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Tue, 27 Dec 2011 11:51:46 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/27/2011 11:44 AM, Nikunj A Dadhania wrote:
> > > > I am seeing a drop of ~44% when setting yield_on_hlt = 0
> > > >
> > > 
> > > A drop of 44% of what?
> > > 
> > records/sec
> >
> > Ebizzy - 2VM running in parallel, both having gang scheduling enabled.
> >
> >                     yield_on_hlt=1         yield_on_hlt=0
> >  EbzyRecords/sec      41955.50               23285.00            -44 
> >
> >
> 
> Interesting, are you overcommitting vcpus?
> 
Yes(1:2). On a 8cpu machine, have 2VMs running, which is 8vcpus each.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-27  9:15                       ` Avi Kivity
@ 2011-12-27 10:24                         ` Nikunj A Dadhania
  0 siblings, 0 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-27 10:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Tue, 27 Dec 2011 11:15:32 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/27/2011 03:47 AM, Nikunj A Dadhania wrote:
[...]
> > Though in the guest dmesg, I still do not see any x2apic logs. I am
> > missing something obvious here. 
> >
> 
> Check that X2APIC is enabled in the guest .config, as well as
> CONFIG_KVM_GUEST.
> 
X2APIC is not enabled
CONFIG_KVM_GUEST is enabled.
Its a 2.6.32 based distro kernel.

Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-27 10:10                       ` Nikunj A Dadhania
@ 2011-12-27 10:34                         ` Avi Kivity
  2011-12-27 10:43                           ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2011-12-27 10:34 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/27/2011 12:10 PM, Nikunj A Dadhania wrote:
> On Tue, 27 Dec 2011 11:51:46 +0200, Avi Kivity <avi@redhat.com> wrote:
> > On 12/27/2011 11:44 AM, Nikunj A Dadhania wrote:
> > > > > I am seeing a drop of ~44% when setting yield_on_hlt = 0
> > > > >
> > > > 
> > > > A drop of 44% of what?
> > > > 
> > > records/sec
> > >
> > > Ebizzy - 2VM running in parallel, both having gang scheduling enabled.
> > >
> > >                     yield_on_hlt=1         yield_on_hlt=0
> > >  EbzyRecords/sec      41955.50               23285.00            -44 
> > >
> > >
> > 
> > Interesting, are you overcommitting vcpus?
> > 
> Yes(1:2). On a 8cpu machine, have 2VMs running, which is 8vcpus each.

Ah, yield_on_hlt=0 isn't any good with overcommit.  The vcpu stays
scheduled when idle, so other vcpus don't get to use its cpu time.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-27 10:34                         ` Avi Kivity
@ 2011-12-27 10:43                           ` Nikunj A Dadhania
  2011-12-27 10:53                             ` Avi Kivity
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-27 10:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Tue, 27 Dec 2011 12:34:34 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/27/2011 12:10 PM, Nikunj A Dadhania wrote:
> > On Tue, 27 Dec 2011 11:51:46 +0200, Avi Kivity <avi@redhat.com> wrote:
> > > On 12/27/2011 11:44 AM, Nikunj A Dadhania wrote:
> > > > > > I am seeing a drop of ~44% when setting yield_on_hlt = 0
> > > > > >
> > > > > 
> > > > > A drop of 44% of what?
> > > > > 
> > > > records/sec
> > > >
> > > > Ebizzy - 2VM running in parallel, both having gang scheduling enabled.
> > > >
> > > >                     yield_on_hlt=1         yield_on_hlt=0
> > > >  EbzyRecords/sec      41955.50               23285.00            -44 
> > > >
> > > >
> > > 
> > > Interesting, are you overcommitting vcpus?
> > > 
> > Yes(1:2). On a 8cpu machine, have 2VMs running, which is 8vcpus each.
> 
> Ah, yield_on_hlt=0 isn't any good with overcommit.  The vcpu stays
> scheduled when idle, so other vcpus don't get to use its cpu time.
> 
This is similar to booting the guest with idle=poll kernel parameter?


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-27 10:43                           ` Nikunj A Dadhania
@ 2011-12-27 10:53                             ` Avi Kivity
  0 siblings, 0 replies; 75+ messages in thread
From: Avi Kivity @ 2011-12-27 10:53 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/27/2011 12:43 PM, Nikunj A Dadhania wrote:
> > > > Interesting, are you overcommitting vcpus?
> > > > 
> > > Yes(1:2). On a 8cpu machine, have 2VMs running, which is 8vcpus each.
> > 
> > Ah, yield_on_hlt=0 isn't any good with overcommit.  The vcpu stays
> > scheduled when idle, so other vcpus don't get to use its cpu time.
> > 
> This is similar to booting the guest with idle=poll kernel parameter?

Yes, only with lower power consumption.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-25 10:58           ` Avi Kivity
  2011-12-25 15:45             ` Avi Kivity
  2011-12-26  3:14             ` Nikunj A Dadhania
@ 2011-12-30  9:51             ` Ingo Molnar
  2011-12-30 10:10               ` Nikunj A Dadhania
  2 siblings, 1 reply; 75+ messages in thread
From: Ingo Molnar @ 2011-12-30  9:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nikunj A Dadhania, peterz, linux-kernel, vatsa, bharata


* Avi Kivity <avi@redhat.com> wrote:

> [...]
> 
> The first part appears to be unrelated to ebizzy itself - it's 
> the kunmap_atomic() flushing ptes.  It could be eliminated by 
> switching to a non-highmem kernel, or by allocating more PTEs 
> for kmap_atomic() and batching the flush.

Nikunj, please only run pure 64-bit/64-bit combinations - by the 
time any fix goes upstream and trickles down to distros 32-bit 
guests will be even less relevant than they are today.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-30  9:51             ` Ingo Molnar
@ 2011-12-30 10:10               ` Nikunj A Dadhania
  2011-12-31  2:21                 ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-30 10:10 UTC (permalink / raw)
  To: Ingo Molnar, Avi Kivity; +Cc: peterz, linux-kernel, vatsa, bharata

On Fri, 30 Dec 2011 10:51:47 +0100, Ingo Molnar <mingo@elte.hu> wrote:
> 
> * Avi Kivity <avi@redhat.com> wrote:
> 
> > [...]
> > 
> > The first part appears to be unrelated to ebizzy itself - it's 
> > the kunmap_atomic() flushing ptes.  It could be eliminated by 
> > switching to a non-highmem kernel, or by allocating more PTEs 
> > for kmap_atomic() and batching the flush.
> 
> Nikunj, please only run pure 64-bit/64-bit combinations - by the 
> time any fix goes upstream and trickles down to distros 32-bit 
> guests will be even less relevant than they are today.
> 
Sure Ingo, got a 64bit guest working yesterday and I am in process of
getting the benchmark numbers for the same.

Regards,
Nikunj



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-30 10:10               ` Nikunj A Dadhania
@ 2011-12-31  2:21                 ` Nikunj A Dadhania
  2012-01-02  4:20                   ` Nikunj A Dadhania
  2012-01-02  9:37                   ` Avi Kivity
  0 siblings, 2 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2011-12-31  2:21 UTC (permalink / raw)
  To: Ingo Molnar, Avi Kivity; +Cc: peterz, linux-kernel, vatsa, bharata

On Fri, 30 Dec 2011 15:40:06 +0530, Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> On Fri, 30 Dec 2011 10:51:47 +0100, Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > * Avi Kivity <avi@redhat.com> wrote:
> > 
> > > [...]
> > > 
> > > The first part appears to be unrelated to ebizzy itself - it's 
> > > the kunmap_atomic() flushing ptes.  It could be eliminated by 
> > > switching to a non-highmem kernel, or by allocating more PTEs 
> > > for kmap_atomic() and batching the flush.
> > 
> > Nikunj, please only run pure 64-bit/64-bit combinations - by the 
> > time any fix goes upstream and trickles down to distros 32-bit 
> > guests will be even less relevant than they are today.
> > 
> Sure Ingo, got a 64bit guest working yesterday and I am in process of
> getting the benchmark numbers for the same.
> 
Here is the results collected from the 64bit VM runs. 

Avi, x2apic is enabled in the both guest/host. 

One more change in the test setup is I am creating and destroying the VM
for each benchmark run. Earlier, I used to create 2/4/8 VMs and run 5
benchmarks one by one(VM was not fresh for some benchmark)

    PLE - Test Setup:
    =================
    - x3850x5 machine - PLE enabled
    - 8 CPUs (HT disabled)
    - 264GB memory
    - VM details:
       - Guest kernel: 2.6.32 based enterprise kernel
       - 1024MB memory
       - 8 VCPUs
    - During gang runs, vcpus are pinned

    Results:
     * GangVsBase - Gang vs Baseline kernel
     * GangVsPin  - Gang vs Baseline kernel + vcpus pinned
     * V1 - Using set_next_buddy
     * V2 - Using set_gang_buddy
     * Results are % improvement/degradation
    +-------------+-----------------------+----------------------+
    |             |          V1           |           V2         |
    +  Benchmarks +-----------+-----------+-----------+----------+
    |             | GngVsBase | GngVsPin  | GngVsBase | GngVsPin |
    +-------------+-----------+-----------+-----------+----------+
    |  kbench-2vm |       -4  |       -5  |       -1  |       -1 |
    |  kbench-4vm |      -13  |       -3  |        3  |       12 |
    |  kbench-8vm |      -11  |        0  |       -5  |        5 |
    +-------------+-----------+-----------+-----------+----------+
    |  ebizzy-2vm |       -1  |       -2  |       17  |       16 |
    |  ebizzy-4vm |        4  |        6  |       58  |       61 |
    |  ebizzy-8vm |        3  |       25  |       68  |      103 |
    +-------------+-----------+-----------+-----------+----------+
    | specjbb-2vm |       -7  |        0  |       -6  |        1 |
    | specjbb-4vm |       19  |       30  |       -5  |        3 |
    | specjbb-8vm |       -6  |        1  |        5  |       15 |
    +-------------+-----------+-----------+-----------+----------+
    |  hbench-2vm |       -1  |       -6  |       18  |       14 |
    |  hbench-4vm |      -64  |       -9  |       -2  |       31 |
    |  hbench-8vm |      -28  |       10  |       32  |       53 |
    +-------------+-----------+-----------+-----------+----------+
    |  dbench-2vm |       -3  |       -5  |       -2  |       -3 |
    |  dbench-4vm |        9  |        0  |        3  |       -5 |
    |  dbench-8vm |       -3  |      -23  |       -8  |      -26 |
    +-------------+-----------+-----------+-----------+----------+

    The best and worst case in V2(GangVsBase). 

    ebizzy 8vm (improved 68%)
    +------------+--------------------+--------------------+----------+
    |                               Ebizzy                            |
    +------------+--------------------+--------------------+----------+
    | Parameter  | GangBase           |   Gang V2          | % imprv  |
    +------------+--------------------+--------------------+----------+
    |      ebizzy|            2531.75 |            4268.12 |       68 |
    |    EbzyUser|              32.60 |              60.70 |       86 |
    |     EbzySys|             165.48 |             171.05 |       -3 |
    |    EbzyReal|              60.00 |              60.00 |        0 |
    |     BwUsage|    568645533105.00 |    767186043286.00 |       34 |
    |    HostIdle|              89.00 |              89.00 |        0 |
    |     UsrTime|               2.00 |               4.00 |      100 |
    |     SysTime|              12.00 |              13.00 |       -8 |
    |      IOWait|               3.00 |               4.00 |      -33 |
    |    IdleTime|              81.00 |              77.00 |       -4 |
    |         TPS|              12.00 |              12.00 |        0 |
    +-----------------------------------------------------------------+

    GangV2:
    27.45%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
    12.12%       ebizzy  [kernel.kallsyms]       [k] clear_page
     9.22%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault
     6.91%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
     4.06%       ebizzy  [kernel.kallsyms]       [k] get_page_from_freelist
     4.04%       ebizzy  [kernel.kallsyms]       [k] ____pagevec_lru_add

    GangBase:
    45.08%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
    15.38%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
     7.00%       ebizzy  [kernel.kallsyms]       [k] clear_page
     4.88%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault

    dbench 8vm (degraded -8%)
    +------------+--------------------+--------------------+----------+
    |                               Dbench                            |
    +------------+--------------------+--------------------+----------+
    | Parameter  | GangBase           |   Gang V2          | % imprv  |
    +------------+--------------------+--------------------+----------+
    |      dbench|               2.27 |               2.09 |       -8 |
    |     BwUsage|    138973336762.00 |    187382519973.00 |       34 |
    |    HostIdle|              95.00 |              93.00 |        2 |
    |      IOWait|              20.00 |              19.00 |        5 |
    |    IdleTime|              78.00 |              78.00 |        0 |
    |         TPS|              13.00 |              14.00 |        7 |
    | CacheMisses|        81611667.00 |        72959014.00 |       10 |
    |   CacheRefs|      4990591975.00 |      4624251595.00 |       -7 |
    |BranchMisses|       812569051.00 |      1162137278.00 |      -43 |
    |    Branches|     20196543212.00 |     30318934960.00 |       50 |
    |Instructions|     99519592926.00 |    152169154440.00 |      -52 |
    |      Cycles|    265699995531.00 |    330718402913.00 |      -24 |
    |     PageFlt|           36083.00 |           35897.00 |        0 |
    |   ContextSW|         3170710.00 |         8304284.00 |     -161 |
    |   CPUMigrat|           63387.00 |          155521.00 |     -145 |
    +-----------------------------------------------------------------+
    dbench needs some more love, i will get the perf top caller for
    that.

    non-PLE - Test Setup:
    =====================
    - x3650 M2 machine
    - 8 CPUs (HT disabled)
    - 64GB memory
    - VM details:
       - Guest kernel: 2.6.32 based enterprise kernel
       - 1024MB memory
       - 8 VCPUs
    - During gang runs, vcpus are pinned

    Results:
     * GangVsBase - Gang vs Baseline kernel
     * GangVsPin  - Gang vs Baseline kernel + vcpus pinned
     * V1 - using set_next_buddy
     * V2 - using set_gang_buddy
     * Results are % improvement/degradation
    +-------------+-----------------------+----------------------+
    |             |          V1           |           V2         |
    +  Benchmarks +-----------+-----------+-----------+----------+
    |             | GngVsBase | GngVsPin  | GngVsBase | GngVsPin |
    +-------------+-----------+-----------+-----------+----------+
    |  kbench-2vm |        0  |        2  |       -7  |       -5 |
    |  kbench-4vm |        2  |       -3  |        7  |        2 |
    |  kbench-8vm |        0  |       -1  |       -1  |       -3 |
    +-------------+-----------+-----------+-----------+----------+
    |  ebizzy-2vm |      221  |      109  |      241  |      122 |
    |  ebizzy-4vm |      215  |      173  |      366  |      304 |
    |  ebizzy-8vm |      225  |       88  |      331  |      149 |
    +-------------+-----------+-----------+-----------+----------+
    | specjbb-2vm |       -5  |       -3  |       -7  |       -5 |
    | specjbb-4vm |       29  |       -4  |        3  |      -23 |
    | specjbb-8vm |        6  |       -6  |       16  |        2 |
    +-------------+-----------+-----------+-----------+----------+
    |  hbench-2vm |      -16  |        2  |       15  |       29 |
    |  hbench-4vm |      -25  |        2  |       32  |       47 |
    |  hbench-8vm |      -46  |      -19  |       35  |       47 |
    +-------------+-----------+-----------+-----------+----------+
    |  dbench-2vm |        0  |        1  |       -5  |       -3 |
    |  dbench-4vm |       -9  |       -4  |       -2  |        2 |
    |  dbench-8vm |      -52  |       17  |      -30  |       69 |
    +-------------+-----------+-----------+-----------+----------+

    The best and worst case in V2(GangVsBase). 

    ebizzy 8vm (improved 331%)
    +------------+--------------------+--------------------+----------+
    |                               Ebizzy                            |
    +------------+--------------------+--------------------+----------+
    | Parameter  | GangBase           |   Gang V2          | % imprv  |
    +------------+--------------------+--------------------+----------+
    |      ebizzy|             719.50 |            3101.38 |      331 |
    |    EbzyUser|               3.79 |              58.04 |     1432 |
    |     EbzySys|              66.61 |             140.04 |     -110 |
    |    EbzyReal|              60.00 |              60.00 |        0 |
    |     BwUsage|    526550032993.00 |    652012141757.00 |       23 |
    |    HostIdle|              59.00 |              62.00 |       -5 |
    |     SysTime|               5.00 |              11.00 |     -120 |
    |      IOWait|               4.00 |               4.00 |        0 |
    |    IdleTime|              89.00 |              79.00 |      -11 |
    |         TPS|              11.00 |              12.00 |        9 |
    +-----------------------------------------------------------------+

    GangV2:
    27.96%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
    12.13%       ebizzy  [kernel.kallsyms]       [k] clear_page
    11.66%       ebizzy  [kernel.kallsyms]       [k] __bitmap_empty
    11.54%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
     5.93%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault

    GangBase;
    36.34%       ebizzy  [kernel.kallsyms]  [k] __bitmap_empty
    35.95%       ebizzy  [kernel.kallsyms]  [k] flush_tlb_others_ipi
     8.52%       ebizzy  libc-2.12.so       [.] __memcpy_ssse3_back

    dbench 8vm (degraded -30%)
    +------------+--------------------+--------------------+----------+
    |                               Dbench                            |
    +------------+--------------------+--------------------+----------+
    | Parameter  | GangBase           |   Gang V2          | % imprv  |
    +------------+--------------------+--------------------+----------+
    |      dbench|               2.01 |               1.38 |      -30 |
    |     BwUsage|    100408068913.00 |    176095548113.00 |       75 |
    |    HostIdle|              82.00 |              74.00 |        9 |
    |      IOWait|              25.00 |              23.00 |        8 |
    |    IdleTime|              74.00 |              71.00 |       -4 |
    |         TPS|              13.00 |              13.00 |        0 |
    | CacheMisses|       137351386.00 |       267116184.00 |      -94 |
    |   CacheRefs|      4347880250.00 |      5830408064.00 |       34 |
    |BranchMisses|       602120546.00 |      1110592466.00 |      -84 |
    |    Branches|     22275747114.00 |     39163309805.00 |       75 |
    |Instructions|    107942079625.00 |    195313721170.00 |      -80 |
    |      Cycles|    271014283494.00 |    481886203993.00 |      -77 |
    |     PageFlt|           44373.00 |           47679.00 |       -7 |
    |   ContextSW|         3318033.00 |        11598234.00 |     -249 |
    |   CPUMigrat|           82475.00 |          423066.00 |     -412 |
    +-----------------------------------------------------------------+

Regards
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-31  2:21                 ` Nikunj A Dadhania
@ 2012-01-02  4:20                   ` Nikunj A Dadhania
  2012-01-02  9:39                     ` Avi Kivity
  2012-01-02  9:37                   ` Avi Kivity
  1 sibling, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2012-01-02  4:20 UTC (permalink / raw)
  To: Ingo Molnar, Avi Kivity; +Cc: peterz, linux-kernel, vatsa, bharata

On Sat, 31 Dec 2011 07:51:15 +0530, Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> On Fri, 30 Dec 2011 15:40:06 +0530, Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > On Fri, 30 Dec 2011 10:51:47 +0100, Ingo Molnar <mingo@elte.hu> wrote:
> > > 
> > > * Avi Kivity <avi@redhat.com> wrote:
> > > 
> > > > [...]
> > > > 
> > > > The first part appears to be unrelated to ebizzy itself - it's 
> > > > the kunmap_atomic() flushing ptes.  It could be eliminated by 
> > > > switching to a non-highmem kernel, or by allocating more PTEs 
> > > > for kmap_atomic() and batching the flush.
> > > 
> > > Nikunj, please only run pure 64-bit/64-bit combinations - by the 
> > > time any fix goes upstream and trickles down to distros 32-bit 
> > > guests will be even less relevant than they are today.
> > > 
> > Sure Ingo, got a 64bit guest working yesterday and I am in process of
> > getting the benchmark numbers for the same.
> > 
> Here is the results collected from the 64bit VM runs. 
> 
[...]

PLE worst case:

>      
>     dbench 8vm (degraded -8%)
>     |      dbench|               2.27 |               2.09 |       -8 |
[...]
>     dbench needs some more love, i will get the perf top caller for
>     that.
>

    Baseline:
    75.18%         init  [kernel.kallsyms]  [k] native_safe_halt
    23.32%      swapper  [kernel.kallsyms]  [k] native_safe_halt

    Gang V2:
    73.21%         init  [kernel.kallsyms]       [k] native_safe_halt
    25.74%      swapper  [kernel.kallsyms]       [k] native_safe_halt

That does not give much clue :(
Comments?

>     non-PLE - Test Setup:
> 
>     dbench 8vm (degraded -30%)
>     |      dbench|               2.01 |               1.38 |      -30 |


    Baseline:
    57.75%         init  [kernel.kallsyms]  [k] native_safe_halt
    40.88%      swapper  [kernel.kallsyms]  [k] native_safe_halt

    Gang V2:
    56.25%         init  [kernel.kallsyms]  [k] native_safe_halt
    42.84%      swapper  [kernel.kallsyms]  [k] native_safe_halt

Similar comparison here.

Regards
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2011-12-31  2:21                 ` Nikunj A Dadhania
  2012-01-02  4:20                   ` Nikunj A Dadhania
@ 2012-01-02  9:37                   ` Avi Kivity
  2012-01-02 10:30                     ` Nikunj A Dadhania
  2012-01-04 10:52                     ` Nikunj A Dadhania
  1 sibling, 2 replies; 75+ messages in thread
From: Avi Kivity @ 2012-01-02  9:37 UTC (permalink / raw)
  To: Nikunj A Dadhania, Rik van Riel
  Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 12/31/2011 04:21 AM, Nikunj A Dadhania wrote:
> Here is the results collected from the 64bit VM runs. 

Thanks, the data is clearer now.

> Avi, x2apic is enabled in the both guest/host. 
>
> One more change in the test setup is I am creating and destroying the VM
> for each benchmark run. Earlier, I used to create 2/4/8 VMs and run 5
> benchmarks one by one(VM was not fresh for some benchmark)
>
>     PLE - Test Setup:
>     =================
>     - x3850x5 machine - PLE enabled
>     - 8 CPUs (HT disabled)
>     - 264GB memory
>     - VM details:
>        - Guest kernel: 2.6.32 based enterprise kernel
>        - 1024MB memory
>        - 8 VCPUs
>     - During gang runs, vcpus are pinned
>
>     Results:
>      * GangVsBase - Gang vs Baseline kernel
>      * GangVsPin  - Gang vs Baseline kernel + vcpus pinned
>      * V1 - Using set_next_buddy
>      * V2 - Using set_gang_buddy
>      * Results are % improvement/degradation
>     +-------------+-----------------------+----------------------+
>     |             |          V1           |           V2         |
>     +  Benchmarks +-----------+-----------+-----------+----------+
>     |             | GngVsBase | GngVsPin  | GngVsBase | GngVsPin |
>     +-------------+-----------+-----------+-----------+----------+
>     |  kbench-2vm |       -4  |       -5  |       -1  |       -1 |
>     |  kbench-4vm |      -13  |       -3  |        3  |       12 |
>     |  kbench-8vm |      -11  |        0  |       -5  |        5 |
>     +-------------+-----------+-----------+-----------+----------+
>     |  ebizzy-2vm |       -1  |       -2  |       17  |       16 |
>     |  ebizzy-4vm |        4  |        6  |       58  |       61 |
>     |  ebizzy-8vm |        3  |       25  |       68  |      103 |
>     +-------------+-----------+-----------+-----------+----------+
>     | specjbb-2vm |       -7  |        0  |       -6  |        1 |
>     | specjbb-4vm |       19  |       30  |       -5  |        3 |
>     | specjbb-8vm |       -6  |        1  |        5  |       15 |
>     +-------------+-----------+-----------+-----------+----------+
>     |  hbench-2vm |       -1  |       -6  |       18  |       14 |
>     |  hbench-4vm |      -64  |       -9  |       -2  |       31 |
>     |  hbench-8vm |      -28  |       10  |       32  |       53 |
>     +-------------+-----------+-----------+-----------+----------+
>     |  dbench-2vm |       -3  |       -5  |       -2  |       -3 |
>     |  dbench-4vm |        9  |        0  |        3  |       -5 |
>     |  dbench-8vm |       -3  |      -23  |       -8  |      -26 |
>     +-------------+-----------+-----------+-----------+----------+
>
>     The best and worst case in V2(GangVsBase). 
>
>     ebizzy 8vm (improved 68%)
>     +------------+--------------------+--------------------+----------+
>     |                               Ebizzy                            |
>     +------------+--------------------+--------------------+----------+
>     | Parameter  | GangBase           |   Gang V2          | % imprv  |
>     +------------+--------------------+--------------------+----------+
>     |      ebizzy|            2531.75 |            4268.12 |       68 |
>     |    EbzyUser|              32.60 |              60.70 |       86 |
>     |     EbzySys|             165.48 |             171.05 |       -3 |
>     |    EbzyReal|              60.00 |              60.00 |        0 |
>     |     BwUsage|    568645533105.00 |    767186043286.00 |       34 |
>     |    HostIdle|              89.00 |              89.00 |        0 |
>     |     UsrTime|               2.00 |               4.00 |      100 |
>     |     SysTime|              12.00 |              13.00 |       -8 |
>     |      IOWait|               3.00 |               4.00 |      -33 |
>     |    IdleTime|              81.00 |              77.00 |       -4 |
>     |         TPS|              12.00 |              12.00 |        0 |
>     +-----------------------------------------------------------------+
>
>     GangV2:
>     27.45%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
>     12.12%       ebizzy  [kernel.kallsyms]       [k] clear_page
>      9.22%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault
>      6.91%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
>      4.06%       ebizzy  [kernel.kallsyms]       [k] get_page_from_freelist
>      4.04%       ebizzy  [kernel.kallsyms]       [k] ____pagevec_lru_add
>
>     GangBase:
>     45.08%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
>     15.38%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
>      7.00%       ebizzy  [kernel.kallsyms]       [k] clear_page
>      4.88%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault

Looping in flush_tlb_others().  Rik, what trace an we run to find out
why PLE directed yield isn't working as expected?

>
>     dbench 8vm (degraded -8%)
>     +------------+--------------------+--------------------+----------+
>     |                               Dbench                            |
>     +------------+--------------------+--------------------+----------+
>     | Parameter  | GangBase           |   Gang V2          | % imprv  |
>     +------------+--------------------+--------------------+----------+
>     |      dbench|               2.27 |               2.09 |       -8 |
>     |     BwUsage|    138973336762.00 |    187382519973.00 |       34 |
>     |    HostIdle|              95.00 |              93.00 |        2 |
>     |      IOWait|              20.00 |              19.00 |        5 |
>     |    IdleTime|              78.00 |              78.00 |        0 |
>     |         TPS|              13.00 |              14.00 |        7 |
>     | CacheMisses|        81611667.00 |        72959014.00 |       10 |
>     |   CacheRefs|      4990591975.00 |      4624251595.00 |       -7 |
>     |BranchMisses|       812569051.00 |      1162137278.00 |      -43 |
>     |    Branches|     20196543212.00 |     30318934960.00 |       50 |
>     |Instructions|     99519592926.00 |    152169154440.00 |      -52 |
>     |      Cycles|    265699995531.00 |    330718402913.00 |      -24 |
>     |     PageFlt|           36083.00 |           35897.00 |        0 |
>     |   ContextSW|         3170710.00 |         8304284.00 |     -161 |
>     |   CPUMigrat|           63387.00 |          155521.00 |     -145 |
>     +-----------------------------------------------------------------+
>     dbench needs some more love, i will get the perf top caller for
>     that.
>
>     non-PLE - Test Setup:
>     =====================
>     - x3650 M2 machine
>     - 8 CPUs (HT disabled)
>     - 64GB memory
>     - VM details:
>        - Guest kernel: 2.6.32 based enterprise kernel
>        - 1024MB memory
>        - 8 VCPUs
>     - During gang runs, vcpus are pinned
>
>     Results:
>      * GangVsBase - Gang vs Baseline kernel
>      * GangVsPin  - Gang vs Baseline kernel + vcpus pinned
>      * V1 - using set_next_buddy
>      * V2 - using set_gang_buddy
>      * Results are % improvement/degradation
>     +-------------+-----------------------+----------------------+
>     |             |          V1           |           V2         |
>     +  Benchmarks +-----------+-----------+-----------+----------+
>     |             | GngVsBase | GngVsPin  | GngVsBase | GngVsPin |
>     +-------------+-----------+-----------+-----------+----------+
>     |  kbench-2vm |        0  |        2  |       -7  |       -5 |
>     |  kbench-4vm |        2  |       -3  |        7  |        2 |
>     |  kbench-8vm |        0  |       -1  |       -1  |       -3 |
>     +-------------+-----------+-----------+-----------+----------+
>     |  ebizzy-2vm |      221  |      109  |      241  |      122 |
>     |  ebizzy-4vm |      215  |      173  |      366  |      304 |
>     |  ebizzy-8vm |      225  |       88  |      331  |      149 |
>     +-------------+-----------+-----------+-----------+----------+
>     | specjbb-2vm |       -5  |       -3  |       -7  |       -5 |
>     | specjbb-4vm |       29  |       -4  |        3  |      -23 |
>     | specjbb-8vm |        6  |       -6  |       16  |        2 |
>     +-------------+-----------+-----------+-----------+----------+
>     |  hbench-2vm |      -16  |        2  |       15  |       29 |
>     |  hbench-4vm |      -25  |        2  |       32  |       47 |
>     |  hbench-8vm |      -46  |      -19  |       35  |       47 |
>     +-------------+-----------+-----------+-----------+----------+
>     |  dbench-2vm |        0  |        1  |       -5  |       -3 |
>     |  dbench-4vm |       -9  |       -4  |       -2  |        2 |
>     |  dbench-8vm |      -52  |       17  |      -30  |       69 |
>     +-------------+-----------+-----------+-----------+----------+
>
>     The best and worst case in V2(GangVsBase). 
>
>     ebizzy 8vm (improved 331%)
>     +------------+--------------------+--------------------+----------+
>     |                               Ebizzy                            |
>     +------------+--------------------+--------------------+----------+
>     | Parameter  | GangBase           |   Gang V2          | % imprv  |
>     +------------+--------------------+--------------------+----------+
>     |      ebizzy|             719.50 |            3101.38 |      331 |
>     |    EbzyUser|               3.79 |              58.04 |     1432 |
>     |     EbzySys|              66.61 |             140.04 |     -110 |
>     |    EbzyReal|              60.00 |              60.00 |        0 |
>     |     BwUsage|    526550032993.00 |    652012141757.00 |       23 |
>     |    HostIdle|              59.00 |              62.00 |       -5 |
>     |     SysTime|               5.00 |              11.00 |     -120 |
>     |      IOWait|               4.00 |               4.00 |        0 |
>     |    IdleTime|              89.00 |              79.00 |      -11 |
>     |         TPS|              11.00 |              12.00 |        9 |
>     +-----------------------------------------------------------------+
>
>     GangV2:
>     27.96%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
>     12.13%       ebizzy  [kernel.kallsyms]       [k] clear_page
>     11.66%       ebizzy  [kernel.kallsyms]       [k] __bitmap_empty
>     11.54%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
>      5.93%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault
>
>     GangBase;
>     36.34%       ebizzy  [kernel.kallsyms]  [k] __bitmap_empty
>     35.95%       ebizzy  [kernel.kallsyms]  [k] flush_tlb_others_ipi
>      8.52%       ebizzy  libc-2.12.so       [.] __memcpy_ssse3_back

Same thing.  __bitmap_empty() is likely the cpumask_empty() called from
flush_tlb_others_ipi(), so 70% of time is spent in this loop.

Xen works around this particular busy loop by having a hypercall for
flushing the tlb, but this is very fragile (and broken wrt
get_user_pages_fast() IIRC).

>
>     dbench 8vm (degraded -30%)
>     +------------+--------------------+--------------------+----------+
>     |                               Dbench                            |
>     +------------+--------------------+--------------------+----------+
>     | Parameter  | GangBase           |   Gang V2          | % imprv  |
>     +------------+--------------------+--------------------+----------+
>     |      dbench|               2.01 |               1.38 |      -30 |
>     |     BwUsage|    100408068913.00 |    176095548113.00 |       75 |
>     |    HostIdle|              82.00 |              74.00 |        9 |
>     |      IOWait|              25.00 |              23.00 |        8 |
>     |    IdleTime|              74.00 |              71.00 |       -4 |
>     |         TPS|              13.00 |              13.00 |        0 |
>     | CacheMisses|       137351386.00 |       267116184.00 |      -94 |
>     |   CacheRefs|      4347880250.00 |      5830408064.00 |       34 |
>     |BranchMisses|       602120546.00 |      1110592466.00 |      -84 |
>     |    Branches|     22275747114.00 |     39163309805.00 |       75 |
>     |Instructions|    107942079625.00 |    195313721170.00 |      -80 |
>     |      Cycles|    271014283494.00 |    481886203993.00 |      -77 |
>     |     PageFlt|           44373.00 |           47679.00 |       -7 |
>     |   ContextSW|         3318033.00 |        11598234.00 |     -249 |
>     |   CPUMigrat|           82475.00 |          423066.00 |     -412 |
>     +-----------------------------------------------------------------+
>

Rik, what's going on?  ContextSW is relatively low in the base load,
looks like PLE is asleep on the wheel.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-02  4:20                   ` Nikunj A Dadhania
@ 2012-01-02  9:39                     ` Avi Kivity
  2012-01-02 10:22                       ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2012-01-02  9:39 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 01/02/2012 06:20 AM, Nikunj A Dadhania wrote:
> On Sat, 31 Dec 2011 07:51:15 +0530, Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > On Fri, 30 Dec 2011 15:40:06 +0530, Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:
> > > On Fri, 30 Dec 2011 10:51:47 +0100, Ingo Molnar <mingo@elte.hu> wrote:
> > > > 
> > > > * Avi Kivity <avi@redhat.com> wrote:
> > > > 
> > > > > [...]
> > > > > 
> > > > > The first part appears to be unrelated to ebizzy itself - it's 
> > > > > the kunmap_atomic() flushing ptes.  It could be eliminated by 
> > > > > switching to a non-highmem kernel, or by allocating more PTEs 
> > > > > for kmap_atomic() and batching the flush.
> > > > 
> > > > Nikunj, please only run pure 64-bit/64-bit combinations - by the 
> > > > time any fix goes upstream and trickles down to distros 32-bit 
> > > > guests will be even less relevant than they are today.
> > > > 
> > > Sure Ingo, got a 64bit guest working yesterday and I am in process of
> > > getting the benchmark numbers for the same.
> > > 
> > Here is the results collected from the 64bit VM runs. 
> > 
> [...]
>
> PLE worst case:
>
> >      
> >     dbench 8vm (degraded -8%)
> >     |      dbench|               2.27 |               2.09 |       -8 |
> [...]
> >     dbench needs some more love, i will get the perf top caller for
> >     that.
> >
>
>     Baseline:
>     75.18%         init  [kernel.kallsyms]  [k] native_safe_halt
>     23.32%      swapper  [kernel.kallsyms]  [k] native_safe_halt
>
>     Gang V2:
>     73.21%         init  [kernel.kallsyms]       [k] native_safe_halt
>     25.74%      swapper  [kernel.kallsyms]       [k] native_safe_halt
>
> That does not give much clue :(
> Comments?
>
> >     non-PLE - Test Setup:
> > 
> >     dbench 8vm (degraded -30%)
> >     |      dbench|               2.01 |               1.38 |      -30 |
>
>
>     Baseline:
>     57.75%         init  [kernel.kallsyms]  [k] native_safe_halt
>     40.88%      swapper  [kernel.kallsyms]  [k] native_safe_halt
>
>     Gang V2:
>     56.25%         init  [kernel.kallsyms]  [k] native_safe_halt
>     42.84%      swapper  [kernel.kallsyms]  [k] native_safe_halt
>
> Similar comparison here.
>

Wierd, looks like a mismeasurement... what happens if you add a bash
busy loop?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-02  9:39                     ` Avi Kivity
@ 2012-01-02 10:22                       ` Nikunj A Dadhania
  0 siblings, 0 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2012-01-02 10:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Mon, 02 Jan 2012 11:39:00 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 01/02/2012 06:20 AM, Nikunj A Dadhania wrote:
[...]
> > >     non-PLE - Test Setup:
> > > 
> > >     dbench 8vm (degraded -30%)
> > >     |      dbench|               2.01 |               1.38 |      -30 |
> >
> >
> >     Baseline:
> >     57.75%         init  [kernel.kallsyms]  [k] native_safe_halt
> >     40.88%      swapper  [kernel.kallsyms]  [k] native_safe_halt
> >
> >     Gang V2:
> >     56.25%         init  [kernel.kallsyms]  [k] native_safe_halt
> >     42.84%      swapper  [kernel.kallsyms]  [k] native_safe_halt
> >
> > Similar comparison here.
> >
> 
> Wierd, looks like a mismeasurement... 
>
Getting similar numbers across different runs/reboots with dbench.

> what happens if you add a bash
> busy loop?
> 
Perf output for bash busy loops inside the guest:

     9.93%           sh  libc-2.12.so       [.] _int_free
     8.37%           sh  libc-2.12.so       [.] _int_malloc
     6.14%           sh  libc-2.12.so       [.] __GI___libc_malloc
     6.03%           sh  bash               [.] 0x480e6         


loop.sh
----------------------------------
for i in `seq 1 8`
do
	while :; do :; done &
	pid[$i]=$!;
done
sleep 60

for i in `seq 1 8`
do
	kill -9 ${pid[$i]}
done
----------------------------------

Used the following command to capture the perf events inside the guest:

ssh root@192.168.123.11 'perf record -a -o loop-perf.out  --
/root/loop.sh '

Regards,
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-02  9:37                   ` Avi Kivity
@ 2012-01-02 10:30                     ` Nikunj A Dadhania
  2012-01-02 13:33                       ` Avi Kivity
  2012-01-04 10:52                     ` Nikunj A Dadhania
  1 sibling, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2012-01-02 10:30 UTC (permalink / raw)
  To: Avi Kivity, Rik van Riel
  Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Mon, 02 Jan 2012 11:37:22 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/31/2011 04:21 AM, Nikunj A Dadhania wrote:

> >
> >     non-PLE - Test Setup:
> >     =====================

> >
> >     ebizzy 8vm (improved 331%)
[...]
> >     GangV2:
> >     27.96%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
> >     12.13%       ebizzy  [kernel.kallsyms]       [k] clear_page
> >     11.66%       ebizzy  [kernel.kallsyms]       [k] __bitmap_empty
> >     11.54%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
> >      5.93%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault
> >
> >     GangBase;
> >     36.34%       ebizzy  [kernel.kallsyms]  [k] __bitmap_empty
> >     35.95%       ebizzy  [kernel.kallsyms]  [k] flush_tlb_others_ipi
> >      8.52%       ebizzy  libc-2.12.so       [.] __memcpy_ssse3_back
> 
> Same thing.  __bitmap_empty() is likely the cpumask_empty() called from
> flush_tlb_others_ipi(), so 70% of time is spent in this loop.
> 
> Xen works around this particular busy loop by having a hypercall for
> flushing the tlb, but this is very fragile (and broken wrt
> get_user_pages_fast() IIRC).
> 
> >
> >     dbench 8vm (degraded -30%)
> >     +------------+--------------------+--------------------+----------+
> >     |                               Dbench                            |
> >     +------------+--------------------+--------------------+----------+
> >     | Parameter  | GangBase           |   Gang V2          | % imprv  |
> >     +------------+--------------------+--------------------+----------+
> >     |      dbench|               2.01 |               1.38 |      -30 |
> >     |     BwUsage|    100408068913.00 |    176095548113.00 |       75 |
> >     |    HostIdle|              82.00 |              74.00 |        9 |
> >     |      IOWait|              25.00 |              23.00 |        8 |
> >     |    IdleTime|              74.00 |              71.00 |       -4 |
> >     |         TPS|              13.00 |              13.00 |        0 |
> >     | CacheMisses|       137351386.00 |       267116184.00 |      -94 |
> >     |   CacheRefs|      4347880250.00 |      5830408064.00 |       34 |
> >     |BranchMisses|       602120546.00 |      1110592466.00 |      -84 |
> >     |    Branches|     22275747114.00 |     39163309805.00 |       75 |
> >     |Instructions|    107942079625.00 |    195313721170.00 |      -80 |
> >     |      Cycles|    271014283494.00 |    481886203993.00 |      -77 |
> >     |     PageFlt|           44373.00 |           47679.00 |       -7 |
> >     |   ContextSW|         3318033.00 |        11598234.00 |     -249 |
> >     |   CPUMigrat|           82475.00 |          423066.00 |     -412 |
> >     +-----------------------------------------------------------------+
> >
> 
> Rik, what's going on?  ContextSW is relatively low in the base load,
> looks like PLE is asleep on the wheel.
> 
Avi, the above dbench result is from a non-PLE machine. So PLE will not
come into picture here.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-02 10:30                     ` Nikunj A Dadhania
@ 2012-01-02 13:33                       ` Avi Kivity
  0 siblings, 0 replies; 75+ messages in thread
From: Avi Kivity @ 2012-01-02 13:33 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Rik van Riel, Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 01/02/2012 12:30 PM, Nikunj A Dadhania wrote:
> > 
> > Rik, what's going on?  ContextSW is relatively low in the base load,
> > looks like PLE is asleep on the wheel.
> > 
> Avi, the above dbench result is from a non-PLE machine. So PLE will not
> come into picture here.

Ah, sorry, read too fast.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-02  9:37                   ` Avi Kivity
  2012-01-02 10:30                     ` Nikunj A Dadhania
@ 2012-01-04 10:52                     ` Nikunj A Dadhania
  2012-01-04 14:41                       ` Avi Kivity
  1 sibling, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2012-01-04 10:52 UTC (permalink / raw)
  To: Avi Kivity, Rik van Riel
  Cc: Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Mon, 02 Jan 2012 11:37:22 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 12/31/2011 04:21 AM, Nikunj A Dadhania wrote:
> >
> >     GangV2:
> >     27.45%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
> >     12.12%       ebizzy  [kernel.kallsyms]       [k] clear_page
> >      9.22%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault
> >      6.91%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
> >      4.06%       ebizzy  [kernel.kallsyms]       [k] get_page_from_freelist
> >      4.04%       ebizzy  [kernel.kallsyms]       [k] ____pagevec_lru_add
> >
> >     GangBase:
> >     45.08%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
> >     15.38%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
> >      7.00%       ebizzy  [kernel.kallsyms]       [k] clear_page
> >      4.88%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault
> 
> Looping in flush_tlb_others().  Rik, what trace an we run to find out
> why PLE directed yield isn't working as expected?
> 
I tried some experiments by adding a pause_loop_exits stat in the
kvm_vpu_stat.

Here are some observation related to Baseline-only(8vm case)

              | ple_gap=128 | ple_gap=64 | ple_gap=256 | ple_window=2048
--------------+-------------+------------+-------------+----------------
EbzyRecords/s |    2247.50  |    2132.75 |    2086.25  |      1835.62
PauseExits    | 7928154.00  | 6696342.00 | 7365999.00  |  50319582.00

With ple_window = 2048, PauseExits is more than 6times the default case

-----

    From: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>

    Add Pause-loop-exit stats to kvm_vcpu_stat

    Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>


diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b4973f4..be2e7f2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -539,6 +539,7 @@ struct kvm_vcpu_stat {
        u32 hypercalls;
        u32 irq_injections;
        u32 nmi_injections;
+       u32 pause_loop_exits;
 };
 
 struct x86_instruction_info;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 579a0b5..29e90b7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4897,6 +4897,8 @@ out:
 static int handle_pause(struct kvm_vcpu *vcpu)
 {
        skip_emulated_instruction(vcpu);
+       ++vcpu->stat.pause_loop_exits;
        kvm_vcpu_on_spin(vcpu);
 
        return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c38efd7..87433a8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -149,6 +149,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
        { "mmu_unsync", VM_STAT(mmu_unsync) },
        { "remote_tlb_flush", VM_STAT(remote_tlb_flush) },
        { "largepages", VM_STAT(lpages) },
+       { "pause_loop_exits", VCPU_STAT(pause_loop_exits) },
        { NULL }
 };


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 10:52                     ` Nikunj A Dadhania
@ 2012-01-04 14:41                       ` Avi Kivity
  2012-01-04 14:56                         ` Srivatsa Vaddagiri
                                           ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: Avi Kivity @ 2012-01-04 14:41 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Rik van Riel, Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 01/04/2012 12:52 PM, Nikunj A Dadhania wrote:
> On Mon, 02 Jan 2012 11:37:22 +0200, Avi Kivity <avi@redhat.com> wrote:
> > On 12/31/2011 04:21 AM, Nikunj A Dadhania wrote:
> > >
> > >     GangV2:
> > >     27.45%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
> > >     12.12%       ebizzy  [kernel.kallsyms]       [k] clear_page
> > >      9.22%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault
> > >      6.91%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
> > >      4.06%       ebizzy  [kernel.kallsyms]       [k] get_page_from_freelist
> > >      4.04%       ebizzy  [kernel.kallsyms]       [k] ____pagevec_lru_add
> > >
> > >     GangBase:
> > >     45.08%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
> > >     15.38%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
> > >      7.00%       ebizzy  [kernel.kallsyms]       [k] clear_page
> > >      4.88%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault
> > 
> > Looping in flush_tlb_others().  Rik, what trace an we run to find out
> > why PLE directed yield isn't working as expected?
> > 
> I tried some experiments by adding a pause_loop_exits stat in the
> kvm_vpu_stat.

(that's deprecated, we use tracepoints these days for stats)

> Here are some observation related to Baseline-only(8vm case)
>
>               | ple_gap=128 | ple_gap=64 | ple_gap=256 | ple_window=2048
> --------------+-------------+------------+-------------+----------------
> EbzyRecords/s |    2247.50  |    2132.75 |    2086.25  |      1835.62
> PauseExits    | 7928154.00  | 6696342.00 | 7365999.00  |  50319582.00
>
> With ple_window = 2048, PauseExits is more than 6times the default case

So it looks like the default is optimal, at least wrt the cases you
tested and your test workload.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 14:41                       ` Avi Kivity
@ 2012-01-04 14:56                         ` Srivatsa Vaddagiri
  2012-01-04 17:13                           ` Avi Kivity
  2012-01-04 16:47                         ` Rik van Riel
  2012-01-05  2:10                         ` Nikunj A Dadhania
  2 siblings, 1 reply; 75+ messages in thread
From: Srivatsa Vaddagiri @ 2012-01-04 14:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A Dadhania, Rik van Riel, Ingo Molnar, peterz,
	linux-kernel, bharata

* Avi Kivity <avi@redhat.com> [2012-01-04 16:41:58]:

> > Here are some observation related to Baseline-only(8vm case)
> >
> >               | ple_gap=128 | ple_gap=64 | ple_gap=256 | ple_window=2048
> > --------------+-------------+------------+-------------+----------------
> > EbzyRecords/s |    2247.50  |    2132.75 |    2086.25  |      1835.62
> > PauseExits    | 7928154.00  | 6696342.00 | 7365999.00  |  50319582.00
> >
> > With ple_window = 2048, PauseExits is more than 6times the default case
> 
> So it looks like the default is optimal, at least wrt the cases you
> tested and your test workload.

The default case still lags considerably behind the results we are seeing with
gang scheduling. One more interesting data point would be to see how
many PLE exits we are seeing when the vcpu is spinning in
flush_tlb_others_ipi(). Is there any easy way to determine that?

- vatsa


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 14:41                       ` Avi Kivity
  2012-01-04 14:56                         ` Srivatsa Vaddagiri
@ 2012-01-04 16:47                         ` Rik van Riel
  2012-01-04 17:16                           ` Avi Kivity
  2012-01-05  2:10                         ` Nikunj A Dadhania
  2 siblings, 1 reply; 75+ messages in thread
From: Rik van Riel @ 2012-01-04 16:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A Dadhania, Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 01/04/2012 09:41 AM, Avi Kivity wrote:
> On 01/04/2012 12:52 PM, Nikunj A Dadhania wrote:
>> On Mon, 02 Jan 2012 11:37:22 +0200, Avi Kivity<avi@redhat.com>  wrote:
>>> On 12/31/2011 04:21 AM, Nikunj A Dadhania wrote:
>>>>
>>>>      GangV2:
>>>>      27.45%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
>>>>      12.12%       ebizzy  [kernel.kallsyms]       [k] clear_page
>>>>       9.22%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault
>>>>       6.91%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
>>>>       4.06%       ebizzy  [kernel.kallsyms]       [k] get_page_from_freelist
>>>>       4.04%       ebizzy  [kernel.kallsyms]       [k] ____pagevec_lru_add
>>>>
>>>>      GangBase:
>>>>      45.08%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi
>>>>      15.38%       ebizzy  libc-2.12.so            [.] __memcpy_ssse3_back
>>>>       7.00%       ebizzy  [kernel.kallsyms]       [k] clear_page
>>>>       4.88%       ebizzy  [kernel.kallsyms]       [k] __do_page_fault
>>>
>>> Looping in flush_tlb_others().  Rik, what trace an we run to find out
>>> why PLE directed yield isn't working as expected?
>>>
>> I tried some experiments by adding a pause_loop_exits stat in the
>> kvm_vpu_stat.
>
> (that's deprecated, we use tracepoints these days for stats)
>
>> Here are some observation related to Baseline-only(8vm case)
>>
>>                | ple_gap=128 | ple_gap=64 | ple_gap=256 | ple_window=2048
>> --------------+-------------+------------+-------------+----------------
>> EbzyRecords/s |    2247.50  |    2132.75 |    2086.25  |      1835.62
>> PauseExits    | 7928154.00  | 6696342.00 | 7365999.00  |  50319582.00
>>
>> With ple_window = 2048, PauseExits is more than 6times the default case
>
> So it looks like the default is optimal, at least wrt the cases you
> tested and your test workload.

It depends on the workload.

I believe ebizzy synchronously bounces messages around between
userland threads, and may benefit from lower latency preemption
and re-scheduling.

Workloads like AMQP do asynchronous messaging, and are likely
to benefit from having a lower number of switches.

I do not know which kind of workload is more prevalent.

Another worry with gang scheduling is scalability.  One of
the reasons Linux scales well to larger systems is that a
lot of things are done CPU local, without communicating
things with other CPUs.  Making the scheduling algorithm
system-global has the potential to add in a lot of overhead.

Likewise, removing the ability to migrate workloads to idle
CPUs is likely to hurt a lot of real world workloads.

Benchmarks don't care, because they run full-out. However,
users do not run benchmarks nearly as much as they run
actual workloads...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 14:56                         ` Srivatsa Vaddagiri
@ 2012-01-04 17:13                           ` Avi Kivity
  2012-01-05  6:57                             ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2012-01-04 17:13 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Nikunj A Dadhania, Rik van Riel, Ingo Molnar, peterz,
	linux-kernel, bharata

On 01/04/2012 04:56 PM, Srivatsa Vaddagiri wrote:
> * Avi Kivity <avi@redhat.com> [2012-01-04 16:41:58]:
>
> > > Here are some observation related to Baseline-only(8vm case)
> > >
> > >               | ple_gap=128 | ple_gap=64 | ple_gap=256 | ple_window=2048
> > > --------------+-------------+------------+-------------+----------------
> > > EbzyRecords/s |    2247.50  |    2132.75 |    2086.25  |      1835.62
> > > PauseExits    | 7928154.00  | 6696342.00 | 7365999.00  |  50319582.00
> > >
> > > With ple_window = 2048, PauseExits is more than 6times the default case
> > 
> > So it looks like the default is optimal, at least wrt the cases you
> > tested and your test workload.
>
> The default case still lags considerably behind the results we are seeing with
> gang scheduling. One more interesting data point would be to see how
> many PLE exits we are seeing when the vcpu is spinning in
> flush_tlb_others_ipi(). Is there any easy way to determine that?
>

You could get an exit trace (trace-cmd -e kvm:kvm_exit) and filter on
PLE exits; the trace contains the guest %rip, so you could match it
against flush_tlb_others_ipi().

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 16:47                         ` Rik van Riel
@ 2012-01-04 17:16                           ` Avi Kivity
  2012-01-04 20:56                             ` Rik van Riel
  2012-01-04 21:31                             ` Peter Zijlstra
  0 siblings, 2 replies; 75+ messages in thread
From: Avi Kivity @ 2012-01-04 17:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nikunj A Dadhania, Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 01/04/2012 06:47 PM, Rik van Riel wrote:
>> So it looks like the default is optimal, at least wrt the cases you
>> tested and your test workload.
>
>
> It depends on the workload.
>
> I believe ebizzy synchronously bounces messages around between
> userland threads, and may benefit from lower latency preemption
> and re-scheduling.
>
> Workloads like AMQP do asynchronous messaging, and are likely
> to benefit from having a lower number of switches.
>
> I do not know which kind of workload is more prevalent.
>
> Another worry with gang scheduling is scalability.  One of
> the reasons Linux scales well to larger systems is that a
> lot of things are done CPU local, without communicating
> things with other CPUs.  Making the scheduling algorithm
> system-global has the potential to add in a lot of overhead.
>
> Likewise, removing the ability to migrate workloads to idle
> CPUs is likely to hurt a lot of real world workloads.
>
> Benchmarks don't care, because they run full-out. However,
> users do not run benchmarks nearly as much as they run
> actual workloads...
>

I think we can solve it at the guest level.  The paravirt ticketlock
stuff introduces wait/wake calls (actually wait is just a HLT
instruction); we could spin for a while, then HLT until the other side
wakes us.  We should do this for all sites that busy wait.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 17:16                           ` Avi Kivity
@ 2012-01-04 20:56                             ` Rik van Riel
  2012-01-04 21:31                             ` Peter Zijlstra
  1 sibling, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-01-04 20:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nikunj A Dadhania, Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On 01/04/2012 12:16 PM, Avi Kivity wrote:
> On 01/04/2012 06:47 PM, Rik van Riel wrote:
>>> So it looks like the default is optimal, at least wrt the cases you
>>> tested and your test workload.
>>
>>
>> It depends on the workload.
>>
>> I believe ebizzy synchronously bounces messages around between
>> userland threads, and may benefit from lower latency preemption
>> and re-scheduling.
>>
>> Workloads like AMQP do asynchronous messaging, and are likely
>> to benefit from having a lower number of switches.
>>
>> I do not know which kind of workload is more prevalent.
>>
>> Another worry with gang scheduling is scalability.  One of
>> the reasons Linux scales well to larger systems is that a
>> lot of things are done CPU local, without communicating
>> things with other CPUs.  Making the scheduling algorithm
>> system-global has the potential to add in a lot of overhead.
>>
>> Likewise, removing the ability to migrate workloads to idle
>> CPUs is likely to hurt a lot of real world workloads.
>>
>> Benchmarks don't care, because they run full-out. However,
>> users do not run benchmarks nearly as much as they run
>> actual workloads...
>>
>
> I think we can solve it at the guest level.  The paravirt ticketlock
> stuff introduces wait/wake calls (actually wait is just a HLT
> instruction); we could spin for a while, then HLT until the other side
> wakes us.  We should do this for all sites that busy wait.

Agreed, that would probably be the best (and nicest) solution.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 17:16                           ` Avi Kivity
  2012-01-04 20:56                             ` Rik van Riel
@ 2012-01-04 21:31                             ` Peter Zijlstra
  2012-01-04 21:41                               ` Avi Kivity
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2012-01-04 21:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Nikunj A Dadhania, Ingo Molnar, linux-kernel,
	vatsa, bharata

On Wed, 2012-01-04 at 19:16 +0200, Avi Kivity wrote:
> 
> 
> I think we can solve it at the guest level.  The paravirt ticketlock
> stuff introduces wait/wake calls (actually wait is just a HLT
> instruction); we could spin for a while, then HLT until the other side
> wakes us.  We should do this for all sites that busy wait.
> 
This is all TLB invalidates, right?

So why wait for non-running vcpus at all? That is, why not paravirt the
TLB flush such that the invalidate marks the non-running VCPU's state so
that on resume it will first flush its TLBs. That way you don't have to
wake it up and wait for it to invalidate its TLBs.

Or am I like totally missing the point (I am after all reading the
thread backwards and I haven't yet fully paged the kernel stuff back
into my brain).

I guess tagging remote VCPU state like that might be somewhat tricky..
but it seems worth considering, the whole wake and wait for flush thing
seems daft.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 21:31                             ` Peter Zijlstra
@ 2012-01-04 21:41                               ` Avi Kivity
  2012-01-05  9:10                                 ` Ingo Molnar
  0 siblings, 1 reply; 75+ messages in thread
From: Avi Kivity @ 2012-01-04 21:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Nikunj A Dadhania, Ingo Molnar, linux-kernel,
	vatsa, bharata

On 01/04/2012 11:31 PM, Peter Zijlstra wrote:
> On Wed, 2012-01-04 at 19:16 +0200, Avi Kivity wrote:
> > 
> > 
> > I think we can solve it at the guest level.  The paravirt ticketlock
> > stuff introduces wait/wake calls (actually wait is just a HLT
> > instruction); we could spin for a while, then HLT until the other side
> > wakes us.  We should do this for all sites that busy wait.
> > 
> This is all TLB invalidates, right?
>
> So why wait for non-running vcpus at all? That is, why not paravirt the
> TLB flush such that the invalidate marks the non-running VCPU's state so
> that on resume it will first flush its TLBs. That way you don't have to
> wake it up and wait for it to invalidate its TLBs.

That's what Xen does, but it's tricky.  For example
get_user_pages_fast() depends on the IPI to hold off page freeing, if we
paravirt it we have to take that into consideration.

> Or am I like totally missing the point (I am after all reading the
> thread backwards and I haven't yet fully paged the kernel stuff back
> into my brain).

You aren't, and I bet those kernel pages are unswappable anyway.

> I guess tagging remote VCPU state like that might be somewhat tricky..
> but it seems worth considering, the whole wake and wait for flush thing
> seems daft.

It's nasty, but then so is paravirt.  It's hard to get right, and it has
a tendency to cause performance regressions as hardware improves.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 14:41                       ` Avi Kivity
  2012-01-04 14:56                         ` Srivatsa Vaddagiri
  2012-01-04 16:47                         ` Rik van Riel
@ 2012-01-05  2:10                         ` Nikunj A Dadhania
  2 siblings, 0 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2012-01-05  2:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rik van Riel, Ingo Molnar, peterz, linux-kernel, vatsa, bharata

On Wed, 04 Jan 2012 16:41:58 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 01/04/2012 12:52 PM, Nikunj A Dadhania wrote:
[...]
> > I tried some experiments by adding a pause_loop_exits stat in the
> > kvm_vpu_stat.
> 
> (that's deprecated, we use tracepoints these days for stats)
> 
Ah ok, did not notice that.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 17:13                           ` Avi Kivity
@ 2012-01-05  6:57                             ` Nikunj A Dadhania
  0 siblings, 0 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2012-01-05  6:57 UTC (permalink / raw)
  To: Avi Kivity, Srivatsa Vaddagiri
  Cc: Rik van Riel, Ingo Molnar, peterz, linux-kernel, bharata

On Wed, 04 Jan 2012 19:13:15 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 01/04/2012 04:56 PM, Srivatsa Vaddagiri wrote:
> > * Avi Kivity <avi@redhat.com> [2012-01-04 16:41:58]:
> >
> > > > Here are some observation related to Baseline-only(8vm case)
> > > >
> > > >               | ple_gap=128 | ple_gap=64 | ple_gap=256 | ple_window=2048
> > > > --------------+-------------+------------+-------------+----------------
> > > > EbzyRecords/s |    2247.50  |    2132.75 |    2086.25  |      1835.62
> > > > PauseExits    | 7928154.00  | 6696342.00 | 7365999.00  |  50319582.00
> > > >
> > > > With ple_window = 2048, PauseExits is more than 6times the default case
> > > 
> > > So it looks like the default is optimal, at least wrt the cases you
> > > tested and your test workload.
> >
> > The default case still lags considerably behind the results we are seeing with
> > gang scheduling. One more interesting data point would be to see how
> > many PLE exits we are seeing when the vcpu is spinning in
> > flush_tlb_others_ipi(). Is there any easy way to determine that?
> >
> 
> You could get an exit trace (trace-cmd -e kvm:kvm_exit) and filter on
> PLE exits; the trace contains the guest %rip, so you could match it
> against flush_tlb_others_ipi().
> 
Cool, this is much easier, had to do some awk script to extract PLE
exits wrt flush_tlb_others_ipi:

Matched 9382616(86%), Not matched 1453845(14%)

So considerable ple exits are from flush_tlb_others_ipi and even then we
see:

    35.01%       ebizzy  [kernel.kallsyms]       [k] flush_tlb_others_ipi

Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-04 21:41                               ` Avi Kivity
@ 2012-01-05  9:10                                 ` Ingo Molnar
  2012-02-20  8:08                                   ` Nikunj A Dadhania
  0 siblings, 1 reply; 75+ messages in thread
From: Ingo Molnar @ 2012-01-05  9:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Rik van Riel, Nikunj A Dadhania, linux-kernel,
	vatsa, bharata


* Avi Kivity <avi@redhat.com> wrote:

> > So why wait for non-running vcpus at all? That is, why not 
> > paravirt the TLB flush such that the invalidate marks the 
> > non-running VCPU's state so that on resume it will first 
> > flush its TLBs. That way you don't have to wake it up and 
> > wait for it to invalidate its TLBs.
> 
> That's what Xen does, but it's tricky.  For example 
> get_user_pages_fast() depends on the IPI to hold off page 
> freeing, if we paravirt it we have to take that into 
> consideration.
> 
> > Or am I like totally missing the point (I am after all 
> > reading the thread backwards and I haven't yet fully paged 
> > the kernel stuff back into my brain).
> 
> You aren't, and I bet those kernel pages are unswappable 
> anyway.
> 
> > I guess tagging remote VCPU state like that might be 
> > somewhat tricky.. but it seems worth considering, the whole 
> > wake and wait for flush thing seems daft.
> 
> It's nasty, but then so is paravirt.  It's hard to get right, 
> and it has a tendency to cause performance regressions as 
> hardware improves.

Here it would massively improve performance - without regressing 
the scheduler code massively.

Or you accept that the hardware does not support intelligent TLB 
flushing yet, hope for future hw to fix it, and live with the 
performance impact for now.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-01-05  9:10                                 ` Ingo Molnar
@ 2012-02-20  8:08                                   ` Nikunj A Dadhania
  2012-02-20  8:14                                     ` Ingo Molnar
  2012-02-20 10:51                                     ` Peter Zijlstra
  0 siblings, 2 replies; 75+ messages in thread
From: Nikunj A Dadhania @ 2012-02-20  8:08 UTC (permalink / raw)
  To: Ingo Molnar, Avi Kivity
  Cc: Peter Zijlstra, Rik van Riel, linux-kernel, vatsa, bharata

On Thu, 5 Jan 2012 10:10:59 +0100, Ingo Molnar <mingo@elte.hu> wrote:
> 
> * Avi Kivity <avi@redhat.com> wrote:
> 
> > > So why wait for non-running vcpus at all? That is, why not 
> > > paravirt the TLB flush such that the invalidate marks the 
> > > non-running VCPU's state so that on resume it will first 
> > > flush its TLBs. That way you don't have to wake it up and 
> > > wait for it to invalidate its TLBs.
> > 
> > That's what Xen does, but it's tricky.  For example 
> > get_user_pages_fast() depends on the IPI to hold off page 
> > freeing, if we paravirt it we have to take that into 
> > consideration.
> > 
> > > Or am I like totally missing the point (I am after all 
> > > reading the thread backwards and I haven't yet fully paged 
> > > the kernel stuff back into my brain).
> > 
> > You aren't, and I bet those kernel pages are unswappable 
> > anyway.
> > 
> > > I guess tagging remote VCPU state like that might be 
> > > somewhat tricky.. but it seems worth considering, the whole 
> > > wake and wait for flush thing seems daft.
> > 
> > It's nasty, but then so is paravirt.  It's hard to get right, 
> > and it has a tendency to cause performance regressions as 
> > hardware improves.
> 
> Here it would massively improve performance - without regressing 
> the scheduler code massively.
> 
I tried doing an experiment with the flush_tlb_others_ipi. This depends
on Raghu's "kvm : Paravirt-spinlock support for KVM guests"
(https://lkml.org/lkml/2012/1/14/66), which has new hypercall for
kicking another vcpu out of halt.

  Here are the results from non-PLE hardware. Running ebizzy
  workload inside the VMs. The table shows the ebizzy score -
  Records/sec.

  8CPU Intel Xeon, HT disabled, 64 bit VM(8vcpu, 1G RAM)

  +--------+------------+------------+-------------+
  |        |  baseline  |   gang     |   pv_flush  |
  +--------+------------+------------+-------------+
  |   2VM  |   3979.50  |   8818.00  |   11002.50  |
  |   4VM  |   1817.50  |   6236.50  |    6196.75  |
  |   8VM  |    922.12  |   4043.00  |    4001.38  |
  +--------+------------+------------+-------------+

I will be posting the results for PLE hardware as well.

Here is the patch, this still needs to be hooked with the pv_mmu_ops. So,

Not-yet-Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>

Index: linux-tip-f4ab688-pv/arch/x86/mm/tlb.c
===================================================================
--- linux-tip-f4ab688-pv.orig/arch/x86/mm/tlb.c	2012-02-14 18:26:21.000000000 +0800
+++ linux-tip-f4ab688-pv/arch/x86/mm/tlb.c	2012-02-20 15:23:10.242576314 +0800
@@ -43,6 +43,7 @@ union smp_flush_state {
 		struct mm_struct *flush_mm;
 		unsigned long flush_va;
 		raw_spinlock_t tlbstate_lock;
+		int sender_cpu;
 		DECLARE_BITMAP(flush_cpumask, NR_CPUS);
 	};
 	char pad[INTERNODE_CACHE_BYTES];
@@ -116,6 +117,9 @@ EXPORT_SYMBOL_GPL(leave_mm);
  *
  * Interrupts are disabled.
  */
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+extern void kvm_kick_cpu(int cpu);
+#endif
 
 /*
  * FIXME: use of asmlinkage is not consistent.  On x86_64 it's noop
@@ -166,6 +170,10 @@ out:
 	smp_mb__before_clear_bit();
 	cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
 	smp_mb__after_clear_bit();
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+	if (cpumask_empty(to_cpumask(f->flush_cpumask)))
+		kvm_kick_cpu(f->sender_cpu);
+#endif
 	inc_irq_stat(irq_tlb_count);
 }
 
@@ -184,7 +192,10 @@ static void flush_tlb_others_ipi(const s
 
 	f->flush_mm = mm;
 	f->flush_va = va;
+	f->sender_cpu = smp_processor_id();
 	if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
+		int loop = 1024;
+
 		/*
 		 * We have to send the IPI only to
 		 * CPUs affected.
@@ -192,8 +203,15 @@ static void flush_tlb_others_ipi(const s
 		apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
 			      INVALIDATE_TLB_VECTOR_START + sender);
 
+#ifdef CONFIG_PARAVIRT_FLUSH_TLB
+		while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
+			cpu_relax();
+		if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
+			halt();
+#else
 		while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
 			cpu_relax();
+#endif
 	}
 
 	f->flush_mm = NULL;
Index: linux-tip-f4ab688-pv/arch/x86/kernel/kvm.c
===================================================================
--- linux-tip-f4ab688-pv.orig/arch/x86/kernel/kvm.c	2012-02-14 18:26:55.000000000 +0800
+++ linux-tip-f4ab688-pv/arch/x86/kernel/kvm.c	2012-02-14 18:26:55.178450933 +0800
@@ -653,16 +653,17 @@ out:
 PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
 
 /* Kick a cpu by its apicid*/
-static inline void kvm_kick_cpu(int apicid)
+void kvm_kick_cpu(int cpu)
 {
+	int apicid = per_cpu(x86_cpu_to_apicid, cpu);
 	kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
 }
+EXPORT_SYMBOL_GPL(kvm_kick_cpu);
 
 /* Kick vcpu waiting on @lock->head to reach value @ticket */
 static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
 {
 	int cpu;
-	int apicid;
 
 	add_stats(RELEASED_SLOW, 1);
 
@@ -671,8 +672,7 @@ static void kvm_unlock_kick(struct arch_
 		if (ACCESS_ONCE(w->lock) == lock &&
 		    ACCESS_ONCE(w->want) == ticket) {
 			add_stats(RELEASED_SLOW_KICKED, 1);
-			apicid = per_cpu(x86_cpu_to_apicid, cpu);
-			kvm_kick_cpu(apicid);
+			kvm_kick_cpu(cpu);
 			break;
 		}
 	}


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-02-20  8:08                                   ` Nikunj A Dadhania
@ 2012-02-20  8:14                                     ` Ingo Molnar
  2012-02-20 10:51                                     ` Peter Zijlstra
  1 sibling, 0 replies; 75+ messages in thread
From: Ingo Molnar @ 2012-02-20  8:14 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Avi Kivity, Peter Zijlstra, Rik van Riel, linux-kernel, vatsa, bharata


* Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> wrote:

> > Here it would massively improve performance - without 
> > regressing the scheduler code massively.
> 
> I tried doing an experiment with the flush_tlb_others_ipi. 
> This depends on Raghu's "kvm : Paravirt-spinlock support for 
> KVM guests" (https://lkml.org/lkml/2012/1/14/66), which has 
> new hypercall for kicking another vcpu out of halt.
> 
>   Here are the results from non-PLE hardware. Running ebizzy 
>   workload inside the VMs. The table shows the ebizzy score - 
>   Records/sec.
> 
>   8CPU Intel Xeon, HT disabled, 64 bit VM(8vcpu, 1G RAM)
> 
>   +--------+------------+------------+-------------+
>   |        |  baseline  |   gang     |   pv_flush  |
>   +--------+------------+------------+-------------+
>   |   2VM  |   3979.50  |   8818.00  |   11002.50  |
>   |   4VM  |   1817.50  |   6236.50  |    6196.75  |
>   |   8VM  |    922.12  |   4043.00  |    4001.38  |
>   +--------+------------+------------+-------------+

Very nice results!

Seems like the PV approach is massively faster on 2 VMs than 
even the gang scheduling hack, because it attacks the problem
at its root, not just the symptom.

The patch is also an order of magnitude simpler. Gang 
scheduling, R.I.P.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-02-20  8:08                                   ` Nikunj A Dadhania
  2012-02-20  8:14                                     ` Ingo Molnar
@ 2012-02-20 10:51                                     ` Peter Zijlstra
  2012-02-20 11:53                                       ` Nikunj A Dadhania
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2012-02-20 10:51 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Ingo Molnar, Avi Kivity, Rik van Riel, linux-kernel, vatsa, bharata

On Mon, 2012-02-20 at 13:38 +0530, Nikunj A Dadhania wrote:
> +#ifdef CONFIG_PARAVIRT_FLUSH_TLB
> +               while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> +                       cpu_relax();
> +               if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
> +                       halt();


That's just vile, you don't need to wait for it, all you need to make
sure is that when that vcpu wakes up it does the flush.

But yeah, the results are a good hint that you're on the right track.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-02-20 10:51                                     ` Peter Zijlstra
@ 2012-02-20 11:53                                       ` Nikunj A Dadhania
  2012-02-20 12:02                                         ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 75+ messages in thread
From: Nikunj A Dadhania @ 2012-02-20 11:53 UTC (permalink / raw)
  To: Peter Zijlstra, Avi Kivity
  Cc: Ingo Molnar, Rik van Riel, linux-kernel, vatsa, bharata

On Mon, 20 Feb 2012 11:51:13 +0100, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2012-02-20 at 13:38 +0530, Nikunj A Dadhania wrote:
> > +#ifdef CONFIG_PARAVIRT_FLUSH_TLB
> > +               while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> > +                       cpu_relax();
> > +               if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
> > +                       halt();
> 
> 
> That's just vile, you don't need to wait for it, all you need to make
> sure is that when that vcpu wakes up it does the flush.
>
Yes, but we are not sure the vcpu will be sleeping or running. In the
case where vcpus are running, it might be beneficial to wait a while.

For example: If its a remote flush to only one of the vcpu and its
already running, is it worthed to halt and come back?

Regards,
Nikunj


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-02-20 11:53                                       ` Nikunj A Dadhania
@ 2012-02-20 12:02                                         ` Srivatsa Vaddagiri
  2012-02-20 12:14                                           ` Peter Zijlstra
  0 siblings, 1 reply; 75+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-20 12:02 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: Peter Zijlstra, Avi Kivity, Ingo Molnar, Rik van Riel,
	linux-kernel, bharata

* Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> [2012-02-20 17:23:16]:

> On Mon, 20 Feb 2012 11:51:13 +0100, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, 2012-02-20 at 13:38 +0530, Nikunj A Dadhania wrote:
> > > +#ifdef CONFIG_PARAVIRT_FLUSH_TLB
> > > +               while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> > > +                       cpu_relax();
> > > +               if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
> > > +                       halt();
> > 
> > 
> > That's just vile, you don't need to wait for it, all you need to make
> > sure is that when that vcpu wakes up it does the flush.
> >
> Yes, but we are not sure the vcpu will be sleeping or running. In the
> case where vcpus are running, it might be beneficial to wait a while.

I guess one possibility is for host scheduler to export run/preempt
information to guest OS, as was discussed in the context of this thread
as well:

http://lkml.org/lkml/2010/4/6/223

Essentially guest OS will know (using such exported information) which of its 
vcpus are currently running and thus do a busy-wait for them.

- vatsa


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH 0/4] Gang scheduling in CFS
  2012-02-20 12:02                                         ` Srivatsa Vaddagiri
@ 2012-02-20 12:14                                           ` Peter Zijlstra
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Zijlstra @ 2012-02-20 12:14 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Nikunj A Dadhania, Avi Kivity, Ingo Molnar, Rik van Riel,
	linux-kernel, bharata

On Mon, 2012-02-20 at 17:32 +0530, Srivatsa Vaddagiri wrote:
> * Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> [2012-02-20 17:23:16]:
> 
> > On Mon, 20 Feb 2012 11:51:13 +0100, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Mon, 2012-02-20 at 13:38 +0530, Nikunj A Dadhania wrote:
> > > > +#ifdef CONFIG_PARAVIRT_FLUSH_TLB
> > > > +               while (!cpumask_empty(to_cpumask(f->flush_cpumask)) && --loop)
> > > > +                       cpu_relax();
> > > > +               if (!loop && !cpumask_empty(to_cpumask(f->flush_cpumask)))
> > > > +                       halt();
> > > 
> > > 
> > > That's just vile, you don't need to wait for it, all you need to make
> > > sure is that when that vcpu wakes up it does the flush.
> > >
> > Yes, but we are not sure the vcpu will be sleeping or running. In the
> > case where vcpus are running, it might be beneficial to wait a while.
> 
> I guess one possibility is for host scheduler to export run/preempt
> information to guest OS, as was discussed in the context of this thread
> as well:
> 
> http://lkml.org/lkml/2010/4/6/223

Doesn't need to be the host scheduler, KVM itself can do that just fine
on guest entry/exit.

> Essentially guest OS will know (using such exported information) which of its 
> vcpus are currently running and thus do a busy-wait for them.

Right, do something like:

again:
  for_each_cpu_in_mask(cpu, flush_cpumask) {
    if !vcpu-running {
      set_flush-on-enter(cpu)
      if !vcpu-running
        cpumask_clear(flush_cpumask, cpu); // vm-enter will do  
  }
  wait-a-while-for-mask-to-clear
  if (!cpumask_empty)
    goto again;    

with the appropriate memory barriers and atomic instructions, that way
you can skip waiting for vcpus that are not in guest mode and vm-enter
will fixup.

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2012-02-20 12:14 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-19  8:33 [RFC PATCH 0/4] Gang scheduling in CFS Nikunj A. Dadhania
2011-12-19  8:34 ` [RFC PATCH 1/4] sched: Adding cpu.gang file to cpu cgroup Nikunj A. Dadhania
2011-12-19  8:34 ` [RFC PATCH 2/4] sched: Adding gang scheduling infrastrucure Nikunj A. Dadhania
2011-12-19 15:51   ` Peter Zijlstra
2011-12-19 16:51     ` Peter Zijlstra
2011-12-20  1:43       ` Nikunj A Dadhania
2011-12-20  1:39     ` Nikunj A Dadhania
2011-12-19  8:34 ` [RFC PATCH 3/4] sched: Gang using set_next_buddy Nikunj A. Dadhania
2011-12-19  8:35 ` [RFC PATCH 4/4] sched:Implement set_gang_buddy Nikunj A. Dadhania
2011-12-19 15:51   ` Peter Zijlstra
2011-12-20  1:43     ` Nikunj A Dadhania
2011-12-26  2:30     ` Nikunj A Dadhania
2011-12-19 11:23 ` [RFC PATCH 0/4] Gang scheduling in CFS Ingo Molnar
2011-12-19 11:44   ` Avi Kivity
2011-12-19 11:50     ` Nikunj A Dadhania
2011-12-19 11:59       ` Avi Kivity
2011-12-19 12:06         ` Nikunj A Dadhania
2011-12-19 12:50           ` Avi Kivity
2011-12-19 13:09             ` Nikunj A Dadhania
2011-12-19 11:45   ` Nikunj A Dadhania
2011-12-19 13:22     ` Nikunj A Dadhania
2011-12-19 16:28       ` Ingo Molnar
2011-12-21 10:39   ` Nikunj A Dadhania
2011-12-21 10:43     ` Avi Kivity
2011-12-23  3:20       ` Nikunj A Dadhania
2011-12-23 10:36         ` Ingo Molnar
2011-12-25 10:58           ` Avi Kivity
2011-12-25 15:45             ` Avi Kivity
2011-12-26  3:14             ` Nikunj A Dadhania
2011-12-26  9:05               ` Avi Kivity
2011-12-26 11:33                 ` Nikunj A Dadhania
2011-12-26 11:41                   ` Avi Kivity
2011-12-27  1:47                     ` Nikunj A Dadhania
2011-12-27  9:15                       ` Avi Kivity
2011-12-27 10:24                         ` Nikunj A Dadhania
2011-12-27  3:15               ` Nikunj A Dadhania
2011-12-27  9:17                 ` Avi Kivity
2011-12-27  9:44                   ` Nikunj A Dadhania
2011-12-27  9:51                     ` Avi Kivity
2011-12-27 10:10                       ` Nikunj A Dadhania
2011-12-27 10:34                         ` Avi Kivity
2011-12-27 10:43                           ` Nikunj A Dadhania
2011-12-27 10:53                             ` Avi Kivity
2011-12-30  9:51             ` Ingo Molnar
2011-12-30 10:10               ` Nikunj A Dadhania
2011-12-31  2:21                 ` Nikunj A Dadhania
2012-01-02  4:20                   ` Nikunj A Dadhania
2012-01-02  9:39                     ` Avi Kivity
2012-01-02 10:22                       ` Nikunj A Dadhania
2012-01-02  9:37                   ` Avi Kivity
2012-01-02 10:30                     ` Nikunj A Dadhania
2012-01-02 13:33                       ` Avi Kivity
2012-01-04 10:52                     ` Nikunj A Dadhania
2012-01-04 14:41                       ` Avi Kivity
2012-01-04 14:56                         ` Srivatsa Vaddagiri
2012-01-04 17:13                           ` Avi Kivity
2012-01-05  6:57                             ` Nikunj A Dadhania
2012-01-04 16:47                         ` Rik van Riel
2012-01-04 17:16                           ` Avi Kivity
2012-01-04 20:56                             ` Rik van Riel
2012-01-04 21:31                             ` Peter Zijlstra
2012-01-04 21:41                               ` Avi Kivity
2012-01-05  9:10                                 ` Ingo Molnar
2012-02-20  8:08                                   ` Nikunj A Dadhania
2012-02-20  8:14                                     ` Ingo Molnar
2012-02-20 10:51                                     ` Peter Zijlstra
2012-02-20 11:53                                       ` Nikunj A Dadhania
2012-02-20 12:02                                         ` Srivatsa Vaddagiri
2012-02-20 12:14                                           ` Peter Zijlstra
2012-01-05  2:10                         ` Nikunj A Dadhania
2011-12-19 15:51 ` Peter Zijlstra
2011-12-19 16:09   ` Alan Cox
2011-12-19 22:10   ` Benjamin Herrenschmidt
2011-12-20  1:56   ` Nikunj A Dadhania
2011-12-20  8:52   ` Jeremy Fitzhardinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).