Yi Cheng Huang,

Bart Selman,

Henry Kautz:

Learning Declarative Control Rules for State-Space Planning

 

 

1.    Introduction

2.    Learning Framework

3.    An Example

4.    Experimental Results

 

 


1.            Introduction

 

Main problem: Deterministic state-space planning is generally PSPACE-complete.

But: For some domains algorithms exist that are only polynomial

► Create efficient planners by automatically learning domain-specific rules [or examples to control a general search engine]

 

Limited success due to

·       poor scalability of traditional domain-independent systems

·       focus on a particular planner: no reusability

 

 

 

 

New Hope:

Planners that treat planning as large constraint solving problems (e.g. Graphplan, SATPLAN)

 

Pros:

·       we can add control knowledge in purely declarative form

·       this knowledge can be reused across planners

 

Next Question:

Can this control knowledge be learned automatically?

It seems so! (‘initial positive results’)

 

 

 

 

 

Basic Approach:

·       Small number of training examples

·       Rules generated with an ILP algorithm based on FOIL

·       No explicit background knowledge

 

Domain:

·       Goal: move packages to cities

·       Means: trucks or airplanes

·       Actions: move truck or airplane, unload vehicle, load package into vehicle

·       Plan quality measure: minimise # of time steps, then # of actions

 


 

2.            Learning Framework

 

 

 

 

 

 

 

 

 


How does the learning module work?

1.     Extraction of training examples from optimal or near-optimal plan

2.     ILP algorithm run on training examples (with help of inferred types)

3.     Verification of learned rules against next plan

 

 

What is learned?

·       Prolog-like rules (in simple temporal logic) that define whether an action should be selected or rejected.

► There are select rules and reject rules.

 

Both come in two different flavours:

·      Static rules: depending only on the initial and goal states

·      Dynamic rules: depending on the current state

 

Examples (which is what?):

·       Unload a package from a truck at the package’s goal location.

·       Never unload a package from an airplane at an airport not in the package’s goal city.

·       Do not move an airplane if it contains a package that needs to be unloaded at that airport .

 

Which type of rule is missing (and why)?

 

Extraction of training examples

is done heuristically:

If an action appears in an optimal solution it must be selected, if not, it must be rejected.

 

 

This is quite noisy, so we need techniques to reduce the noise.

 

Categorising training examples (i.e., actions at time steps)

At a time i, an action is either

·       Real,

·       Virtual, or

·       Mutex virtual

 

f Select rule

 

f Reject rule

f Positive f example

f Negative f example

 

f Positive f example

f Negative f example

f Static

f real

f virtual

 

f virtual

freal

f Dynamic

f real

 

fmutex virtual

 

 

fmutex virtual

 

f real

 

 

Why is the set of training examples restricted for dynamic rules?

·       It is hard to learn good dynamic rules as there are more of them that are consistent with a given plan.

·       Mutex virtual actions tend to be more relevant than other virtual actions.

 

Rule Induction

 

Algorithm based on Quinlan’s FOIL:

 

Rules = {} f

Remaining = examples for the target concept R

while Remaining ¹ {}

     rule = Rnull

     while rule covers negative examples

            Add a literal L to the body of rule that maximises gain

     Remove from Remaining examples that are covered by rule

     Add rule to Rules

 

 

Separate Learning of static and dynamic rules:

·       Static rules may contain equality or disequality literals, static predicate literals, and goal literals, but no fluent literals.

·       Dynamic rules must contain at least one fluent literal.

 

Criteria for choosing body literals:

·       The literal with greatest gain if this gain is close to the maximum possible; otherwise

·       all determinate literals found; otherwise

·       the literal with the highest gain; otherwise

·       the first literal considered that introduces a new variable.

 

Gain function:

Gain(r) = (p+1)  / (p+n+2)   (Laplace-estimate),

where r is the candidate rule, p the number of positive examples covered by r, and n the number of negative examples covered by r.

 

Advantage: rules with low coverage get penalised: more robust against noise in the training set.

 

 


3.   An Example

 

Initial:   (at o1 apt-A), (at o2 apt-B), (at pln at-A),
(at trk-C apt-C), (in-city apt-A A), (in-city po-A A), ...

Goal:     (at o1 po-C), (at o2 po-C)

Plan:   1) Load-Airplane (o1 pln apt-A)
2) Fly-Airplane (pln apt-A apt-B)
3) Load-Airplane (o2 pln apt-B)
4) Fly-Airplane (pln apt-B apt-C)
5) Unload-Airplane (o1 pln apt-C)
5) Unload-Airplane (o2 pln apt-C)
6) Load-Truck (trk-C o1 apt-C)
6) Load-Truck (trk-C apt-C po-C)
7) Drive-Truck (trk-C apt-C po-C)
8) Unload-Truck (trk-C o1 po-C)
8) Unload-Truck (trk-C o2 po-C)

 

 

Suppose we want to learn a static reject rule for the action
Unload-Airplane(o p a).

 

Extracted Training Examples:

 

time

o

p

a

c

l

+

2

o1

pln

apt-A

A

po-C

+

3

o1

pln

apt-B

B

po-C

+

4

o1

pln

apt-B

B

po-C

+

4

o2

pln

apt-B

B

po-C

-

5

o1

pln

apt-C

C

po-C

-

5

o2

pln

apt-C

C

po-C

 

 

Learning process:

1.     Add to the rule body the determinate literals (in-city a c) and
(goal (at o l)).

2.     Next, a literal is found to give the highest possible gain: ¬(in-city l c).

3.     Now the rule covers only positive examples and no negative ones: the algorithm terminates with the rule:

 

¬ Unload-Airplane (o p a) ←
(in-city a c)
Ù (goal (at o l)) Ù ¬(in-city l c)

 

Translated into natural language, this rule reads, “do not unload a package from the airplane if not in the package’s goal city”.

 

 

4.    Empirical Results

 

(See original paper. Table 5: learning and verification times for rules of some domains; table 6: running Blackbox with and without learned rules on some larger and harder problems from those domains.)

·      With rules, planning is sped up by (sometimes) a few orders of magnitude

·       Sometimes, even better quality plans are generated.

·       Comparing with hand-coded rules, the system found all the static reject-rules used in TLPlan, but failed some of the dynamic reject rules.