From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ian Jackson <ian.jackson@eu.citrix.com>
Subject: [OSSTEST PATCH 13/13] Planner: ms-queuedaemon: Restart
	planning when resources become free
Date: Wed, 2 Sep 2015 16:45:19 +0100
Message-ID: <1441208719-31336-14-git-send-email-ian.jackson@eu.citrix.com>
References: <1441208719-31336-1-git-send-email-ian.jackson@eu.citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <prvs=6804eb919=Ian.Jackson@citrix.com>)
	id 1ZXAF0-0000ca-Q6
	for xen-devel@lists.xenproject.org; Wed, 02 Sep 2015 15:46:27 +0000
In-Reply-To: <1441208719-31336-1-git-send-email-ian.jackson@eu.citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: xen-devel@lists.xenproject.org
Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>, Ian Campbell <ian.campbell@citrix.com>
List-Id: xen-devel@lists.xenproject.org

This solves a performance problem with the existing planner.

The problem is that with a large installation, and a big queue, a full
plan can take a long time to prepare.  (In our current installation,
perhaps as long as half an hour.)  Any resource which becomes free
during one plan run cannot be allocated to a new job until the next
plan run starts.  This means resources (test machines) are often
sitting around idle.

Fix this by restarting the planning process as soon as any new
resource becomes free.  This means that jobs at the front of the queue
get a chance to allocate it right away, so it will probably be
allocated soon.

If it is only interesting to jobs later in the queue, then there may
be a delay in reallocating it, but presumably the resource is not much
in demand and those later jobs will allocate it when they get a bit
closer to the head.

But, there is a problem with this: it means that the plan is generally
never completed.  So we have no overview any more of when which
flights will finish and what the overall queue is like.  We solve this
problem by running a second instance of the planner algorithm, all the
way to completion, in a `dummy' mode where no actual resource
allocation takes place.  This second `projection' instance comes into
being whenever the main `plan' instance is restarted, and it inherits
the planning state from the main `plan' instance.

Global livelock (where we keep restarting the plan but never manage to
allocate anything) is not possible because each restart involves a new
resource becoming free.  If nothing gets allocated because we can't
get that far before being restarted, then eventually there will be
nothing left allocated to become newly free.

Starvation, of a form, is possible: a late-in-queue job which wants a
resource available right now might have difficulty allocating it
because the planner is spending its effort rescheduling early-in-queue
jobs which want resources which are in greater demand - so that the
late-in-queue job never gets called.  Arguably this is an appropriate
allocation of planning time.

With this arrangemernt we can generate two reports: a `plan' report
containing the short term plan which was used for actual resource
allocation, and which is frequently restarted and therefore not
necessarily complete; and a `projection' report which contains a
complete plan for all work the system is currently aware of, but which
is less-frequently updated.

Because planner clients do not contain the planning algorithm state,
the only client change needed is the ability to run in a `dummy' mode
without actual allocation; this is the `noalloc' feature earlier in
this series.

The main work is in ms-queuedaemon.  We have prepared the ground for
multiple instances of the planning algorithm; from the point of view
of ms-queuedaemon, an instance of the planning algorithm is mainly a
walk over the job queue.  So we call them `walkers'.

Therefore, what we do here is introduce a new `projection' walker,
as follows:

Add `projection' to the global list of possible walkers.

Invent a new section of code, the `restarter', which is responsible
for managing the relationship between the two walkers.  (It uses
direct knowledge of the queue state data structures, etc., to avoid
having to invent a complete formal interface to a walker.)

If we ever finish the plan walker's queue, we update both the
projection report output and the plan report output, from the same
plan.  Finishing the projection walker's queue means we have a
complete projection, but we don't touch the plan.

In principle it might happen that the plan walker might overtake the
projection walker, and then complete, write out a complete and up to
date plan as the projection, and that the projection walker would then
complete and overwrite the projection with less-up-to-date
information.  We don't explicitly exclude this.  Of course such a
result will be rectified soon enough by another planning run.

The restarter can ask the database for the list of currently-available
resources, and can therefore detect when new become newly-free.

The rest of the code remains largely ignorant of the operation of the
restarter.  There are a few hooks:

runneeded-perhaps-start notifies the restarter when we start the
plan; this is used by the restarter to record the set of free
resources at the start of a planning run, so that it can see later
whether any /new/ resources have become free.

restarter-maybe-provoke-restart is called when we get notification
from the the owner daemon that resources may have become idle.  We
look for newly-idle resources, and if there are any, and we are
running the plan walker, we directly edit the plan walker's queue to
put RESTART at the front.

queuerun-perhaps-step spots the special entry RESTART in its queue and
calls into back the restarter when it finds it.  This deferred
approach is necessary because we can't do the restart operation while
a client is thinking (because we would have to change that client's
cogitation from the `live, can allocate' mode to the `dummy, cannot
allocate' mode; and because that would make the code more complex).

The main work is done in the restarter-restart-now hook.  It reports
the current (incomplete) plan, and then checks to see if a projection
walker is running; if it is, it leaves it alone, and simply abandons
the current plan run and arranges for a new run to started.  If a
projection walker is not running it copies all the plan walker's state
(including the data-plan.pl disk file containing the plan-in-progress)
to the projection walker, and sets the projection walker going.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
---
 README.planner |    8 +++++
 ms-queuedaemon |   98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/README.planner b/README.planner
index 24185ce..c1b2bf6 100644
--- a/README.planner
+++ b/README.planner
@@ -76,6 +76,14 @@ that newly-freed resources are properly offered first to the tasks at
 the front of the queue.  ms-ownerdaemon sets all idle resources to
 allocatable at the start of each planning cycle.
 
+The planner actually sometimes runs two planning cycles: if resources
+become free while the planner is running, it will restart the planning
+cycle in an effort to get those resources into service.  But, it will
+leave the existing planning run going in a projection-only mode (where
+no resources actually get allocated), so that there is a report for
+the administrator showing an idea of what the system thinks may happen
+in the more distant future.
+
 
 ms-ownerdaemon and `ownd' tasks
 -------------------------------
diff --git a/ms-queuedaemon b/ms-queuedaemon
index d2aabf4..425b98f 100755
--- a/ms-queuedaemon
+++ b/ms-queuedaemon
@@ -21,7 +21,7 @@
 
 source ./tcl/daemonlib.tcl
 
-set walkers {plan}
+set walkers {plan projection}
 
 proc walker-globals {w} {
     # introduces queue_running, thinking[_after] for the specific walker
@@ -169,12 +169,19 @@ proc runneeded-perhaps-start {} {
     log "runneeded-perhaps-start starting cleaned=$cleaned"
 
     runneeded-2-requeue
+    restarter-starting-plan-hook
     queuerun-start plan
 }
 
 proc queuerun-finished/plan {} {
     runneeded-ensure-will 0
     report-plan plan plan
+    report-plan plan projection
+}
+
+proc queuerun-finished/projection {} {
+    runneeded-ensure-will 0
+    report-plan projection projection
 }
 
 proc runneeded-ensure-polling {} {
@@ -255,6 +262,11 @@ proc queuerun-perhaps-step {w} {
     }
 
     set next [lindex $queue_running 0]
+    if {![string compare RESTART $next]} {
+	lshift queue_running
+	restarter-restart-now
+    }
+
     set already [we-are-thinking $next]
     if {[llength $already]} {
 	# $already will wake us via walkers-perhaps-queue-steps
@@ -378,9 +390,90 @@ proc cmd/unwait {chan desc} {
     puts-chan $chan "OK unwait $res"
 }
 
+#---------- special magic for restarting the plan ----------
+
+proc for-free-resources {body varname} {
+    jobdb::transaction resources {
+	pg_execute -array free_resources_row dbh {
+		SELECT (restype || '/' || resname || '/' || shareix) AS r
+		  FROM resources
+	     WHERE NOT (SELECT live FROM tasks WHERE taskid=owntaskid)
+	      ORDER BY restype, resname
+	} [list uplevel 1 \
+	       "[list upvar #0 free_resources_row(r) $varname]; $body"]
+    }
+    return $l
+}
+
+proc restarter-starting-plan-hook {} {
+    global wasfree
+    catch { unset wasfree }
+    for-free-resources freeres {
+	set wasfree($freeres) 1
+    }
+}
+
+proc restarter-maybe-provoke-restart {newly_free} {
+    set newly_free {}
+    global wasfree
+    for-free-resources freeres {
+	if {[info exists wasfree($freeres)]} continue
+	lappend newly_free $freeres
+	set wasfree($freeres) 1
+    }
+    if {!$newly_free} {
+	log-event "restarter-maybe-provoke-restart nothing"
+	return
+    }
+ 
+    walker-runvars plan
+
+    if {!([info exists queue_running] && [llength $queue_running]}} {
+	log-event "restarter-maybe-provoke-restart not-running ($newly_free)"
+	return
+    }
+    
+    log-event "restarter-maybe-provoke-restart provoked ($newly_free)"
+
+    if {[string compare RESTART [lindex $queue_running 0]]} {
+	set queue_running [concat RESTART $queue_running]
+    }
+    after idle queuerun-perhaps-step plan
+}
+
+proc restarter-restart-now {} {
+    # We restart the `plan' walker.  Well, actually, if the #
+    # `projection' walker is not running, we transfer the `plan'
+    # walker to it.  At this stage the plan walker is not thinking so
+    # there are no outstanding callbacks to worry about.
+
+    report-plan plan plan
+
+    global projection/queue_running
+    global plan/queue_running
+
+    if {![info exists projection/queue_running]} {
+	log-event "queuerun-restart-now projection-idle continue-as"
+	set projection/queue_running [set plan/queue_running]
+	file rename -force data-plan.pl data-projection.pl
+	after idle queuerun-perhaps-step projection
+    } else {
+	log-event "queuerun-restart-now projection-running"
+    }
+    unset plan/queue_running
+    runneeded-ensure-will 2
+}
+
 proc notify-to-think {w thinking} { 
     for-chan $thinking {
-	puts-chan $thinking "!OK think"
+	switch -glob $w.[info exists info(feature-noalloc)] {
+	    plan.* { puts-chan $thinking "!OK think" }
+	    projection.1 { puts-chan $thinking "!OK think noalloc" }
+	    projection.0 {
+		# oh well, can't include it in the projection; too bad
+		queuerun-step-done $w "!feature-noalloc"
+	    }
+	}
     }
 }
 
@@ -519,6 +612,7 @@ proc await-endings-notified {} {
             error "$owndchan eof"
         }
         runneeded-ensure-will 2
+        restarter-maybe-provoke-restart
     }
 }
 
-- 
1.7.10.4