Announcements Homework 3 due today (grace period through Friday) Midterm Friday! Fundamental question for this lecture (and really this whole class!): How do you turn a real-world problem into an AI solution? AI Agents and Environments Much (though not all!) of AI is concerned with agents operating in environments. Agent an entity that perceives and acts Environment the problem setting Fleshing it out

Performance measuring desired outcomes Environment what populates the tasks world? Actuators what can the agent act with? Sensors how can the agent perceive the world? What makes an Agent? Agent an entity that perceives its environment through sensors, and acts on it with actuators. Agent Actions are constrained by Actuators + Environment Sensors

Agent Function how does it choose the action? Percepts ? Actuators Actions Environment Percepts are constrained by Sensors + Environment What have we done so far? State-based search Determining an optimal sequence of

actions to reach the goal Choose actions using knowledge about the goal Assumes a deterministic problem with known rules Single agent only Uninformed search: BFS/DFS/UCS Breadth-first search Good: optimal, works well when many options, but not many actions required Bad: assumes all actions have equal cost Depth-first search Good: memory-efficient, works well when few options, but lots of actions required Bad: not optimal, can run infinitely, assumes all actions have equal cost Uniform-cost search Good: optimal, handles variable-cost actions Bad: explores all options, no information about goal location

Informed search: A* A* uses both backward costs and (estimates of) forward costs A* is optimal with admissible / consistent heuristics Heuristic design is key: often use relaxed problems What have we done so far? Adversarial state-based search Determining the best next action, given what opponents will do Choose actions using knowledge about the goal Assumes a deterministic problem with known rules Multiple agents, but in a zero-sum competitive game Adversarial Search (Minimax) Minimax search: A state-space search tree Players alternate turns Compute each nodes minimax value:

the best achievable utility against a rational (optimal) adversary Use alpha-beta pruning for efficiency Can have multiple minimizing opponents Choose actions that yield best subtrees! Minimax values: computed recursively 5 max 5 2 8 2

5 Terminal values: part of the game min 6 What have we done so far? Knowledge-based agents Using existing knowledge to infer new things about the world Determining best next action, given changes to the world Choose actions using knowledge about the world Assumes a deterministic problem; may be able to infer rules Any number of agents, but limited to KB contents

Logical agents Agent Percepts ? Actuators Environment Sensors + Actions Knowledge Base Contains sentences describing the state of the world

Supports inference and derivation Dynamic; changes as a result of agent interactions with the environment! Summary: Knowledge-based agents Use knowledge about the world to choose actions Inference with existing knowledge + new observations Resolution, forward-backward chaining, instantiation and unification, etc. Knowledge represented in knowledge bases Contain statements about the world Structured with an ontology Represents how different kinds of objects/events/etc are categorized Supports higher-level inference Designed for a particular set of problems What have we done so far? Reinforcement learning agents Iteratively update estimates of state/action pair expected utilities

Adapting to random outcomes with expectation Choose actions using learned information from the world Handles stochastic and unknown problems Focused on learning process of a single agent Reinforcement Learning Agent State: s Reward: r Actions: a Environment Basic idea:

Receive feedback in the form of rewards Agents utility is defined by the reward function Must (learn to) act so as to maximize expected rewards All learning is based on observed samples of outcomes! Generalized Policy Iteration Evaluation: For fixed current policy , use N starts into random (s0,a0) and run the policy to halt. Then, for each (s,a), go through the N episodes and find the M episodes that use (s,a) Update Q(s,a) with the average of discounted rewards starting at (s,a) on in each episode Si 1 ( , ) ( ) + ( , ) =1 +1

Improvement: With sampled Q values, get a better policy using policy extraction +1 () ( )=argmax (,) On-policy vs off-policy learning In Sarsa, we choose a specific next action a for updating Q(s,a). This is called on-policy learning, because were using the current policy to help decide an action for learning from. Like policy iteration last time We can also look at all possible next actions A(s), and use the best one to update Q(s,a). This is off-policy learning, because the action we use for updating is separate from the action we actually take. Similar to value iteration, but active Key things to know about

Intelligent Agents How do we formulate AI problems? What is the structure of an agent? Search How do uninformed search methods work? How do informed search methods work? Compare to uninformed What makes a good heuristic? How do we deal with opposing agents? Key things to know about Logic What methods are available to us for inference? What different formalisms do we know for representing knowledge? Why is structuring our knowledge useful? Decision Processes How do we handle stochastic outcomes of actions? How do we learn from observed experiences?

Search Problems Definition of Search Finding a (best) sequence of actions to solve a problem For now, assume the problem is Deterministic Fully observable Known Search Problem Mechanics A search problem consists of: A state space N, 1.0 A successor function (with actions, costs) E, 1.0

A start state and a goal test A solution is a sequence of actions (a plan) which State Space Graphs vs. Search Trees State Space Graph G a Each NODE in in the search tree corresponds to an entire PATH in the state space graph. c

b e d S f h p q r We construct both on demand and we construct as

little as possible. Search Tree S e d b c a a e h p q

q c a p h r p f q G q r

q f c a G Uninformed search methods Breadth-first search Depth-first search Uniform cost search Breadth-First Search G a Strategy: expand a

shallowest node first c b Implementation: Fringe is a FIFO queue e d S f h p r

q S e d Search Tiers b c a a e h

p q q c a h r p f q G p q

r q f c a G Depth-First Search Strategy: expand a deepest node first G a c b

Implementation: Fringe is a LIFO stack e d S f h p r q S

e d b c a a e h p q q c a

h r p f q G p q r q f c a

G Uniform Cost Search 2 Strategy: expand a cheapest node first: b 1 d S 1 c 8 3

Fringe is a priority queue (priority: cumulative cost) G a p 15 2 9 Cost contours c

a 6 a p 1 r 0 9 e h 17 r 11 e 5 11 h 13 r 7

p f 8 q q q 11 c a f 2 q 3

b 4 8 h S d 2 e G 10 q f c

a G p 1 q 16 BFS/DFS/UCS Breadth-first search Good: optimal, works well when many options, but not many actions required Bad: assumes all actions have equal cost Depth-first search Good: memory-efficient, works well when few options, but lots of actions required

Bad: not optimal, can run infinitely, assumes all actions have equal cost Uniform-cost search Good: optimal, handles variable-cost actions Bad: explores all options, no information about goal location Informed search How to efficiently solve search problems with variable-cost actions, using information about the goal state? Heuristics Greedy approach A* search Search Heuristics A heuristic is: A function that estimates how close a state is to a goal Designed for a particular search problem Examples: Manhattan distance, Euclidean distance for

pathing 10 Note that the heuristic is a property of the state, not the action taken to get to the state! 5 11.2 Admissible Heuristics A heuristic h is admissible (optimistic) if: where is the true cost to a nearest goal Examples:

4 15 Coming up with admissible heuristics is most of whats involved in using A* in practice. A* Search Uniform-cost orders by path cost, or backward cost g(n) Greedy orders by goal proximity, or forward cost h(n) 8 S g=1 h=5 h=1 e 1 S

h=6 c h=7 1 a h=5 1 1 3 2 d h=2 G

g=2 h=6 g=3 h=7 h=6 A* Search orders by the sum: f(n) = g(n) + h(n) a b d g=4 h=2 e g=9 h=1

c g=6 G h=0 d g = 10 h=2 h=0 b g=0 h=6 g = 12 G h=0

Example: Teg Grenager A*: Summary A* uses both backward costs and (estimates of) forward costs A* is optimal with admissible / consistent heuristics Heuristic design is key: often use relaxed problems Adversarial Search (Minimax) Deterministic, zero-sum games: Tic-tac-toe, chess, checkers One player maximizes result The other minimizes result Minimax search: A state-space search tree Players alternate turns Compute each nodes minimax value: the best achievable utility against a rational (optimal) adversary Minimax values:

computed recursively 5 max 5 2 8 2 5 Terminal values: part of the game min

6 Minimax Values States Under Agents Control: -8 States Under Opponents Control: -5 -10 Terminal States: +8 Search Summary Formulating search problems Goal test, state space, successor function

Single-agent search Uninformed approaches: BFS, DFS, UCS Informed: Greedy, A* Heuristic design Adversarial multi-agent search Minimax and alpha-beta pruning Evaluation function design Not the only kind of multi-agent search! Search vs Logic Search What is the best action sequence to achieve a goal? Space of atomic states Successor function defined for each state Fully enumerates each possible successor Guided by knowledge about the goal Search vs Logic

Logic Reason about the world to identify good actions Space of attributive states Successor function defined over attribute values Successors effectively partial states (what has changed?) Guided by knowledge about the world Pos: [1,1] Dot1: True Dot2: True Pos: [1,2] Dot4: False Pos: [2,1] Dot8: False Logical agents Agent

Percepts ? Actuators Environment Sensors + Actions Knowledge Base Contains sentences describing the state of the world Supports inference and derivation Dynamic; changes as a result of agent interactions with the environment! Knowledge Bases Agent models what the agent knows about the world

Store different kinds of knowledge Background knowledge the rules of the world Note that if this is an unknown problem, the agent may need to discover the rules! Typically does not change due to agent actions Situational knowledge what is currently true Describes the current state of the world Remember: KB sentences are Remember: Change from state to state, as a result of agent actions KB sentences are agents agents model model of of the the world, world, not not

(successor function) the the true true world world state! state! Logical Entailment Sentence A entails sentence B iff in every model where A is true, B is also true All humans are mortal Socrates is human Socrates is mortal

Proof by Resolution General idea: 1. Apply inference rules 2. Use complementary literals to simplify clauses 3. Rinse, repeat Forward-Backward Chaining Forward Chaining Reason forward to infer new facts from existing knowledge Backward Chaining Given a query, reason backward to find a chain of inference to prove it

Expressiveness of Propositional Logic Want to say: There is a breeze in a square if it is next to a pit. With Propositional Logic, must exhaustively list possibilities First-Order Logic allows much more natural expression: First-Order vs Propositional First-Order Logic also involves statements about individual facts Includes three changes for better expressivity: 1. Objects 2. Quantifiers 3. Relations and Functions Allow FOL to better express statements about the real world! FOL Models Models (aka possible worlds) in first-order logic contain two kinds of

things: Objects the specific things were talking about Wumpus, adventurer, gold, squares, arrow The set of these is the models domain Must be nonempty! Have no meaning in and of themselves! Relations sets of 1+ tuples of objects related in a certain way Carry = { (adventurer, arrow) } Binary, only one pair Adjacent = { ([1,1], [1,2]), ([1,1], [2,1]), }

Binary, many pairs Adventurer = { (adventurer) } Unary, only true for one object Like a property Quantifiers Quantifiers allow expressing properties of collections of objects Universal quantification For every object that satisfies the precedent, the consequent is true. Equivalent to All squares next to the wumpus are smelly. Almost always expressed as implications; otherwise not very informative. Quantifiers Quantifiers allow expressing properties of collections of objects Existential quantification There is some object such that satisfies the logical statement. May be a single relation/conjunction/disjunction == The wumpus is located in some square. Or an implication

Fatal food exists. Order Matters! Everybody loves somebody Somebody loves everybody Logical Inference Summary Main point: its still search! Primarily DFS This is how unification is implemented in Prolog Conjunction/disjunction/implication yield graph structures; can then use graph search. Questions for logical agents 1. What kinds of knowledge do we need? 2. How much knowledge do we need? 3. How do we structure it effectively?

Structuring knowledge representation Ontology Structure/framework of concepts Database tables, essentially Knowledge Base Assertional facts about the world Database rows Note Note that that database database columns columns may may be be designed designed to to encode encode relational

relational info info (e.g., (e.g., CostOf); CostOf); this this isis not not like like ontologies! ontologies! Reasoning with categories Inheritance and classification make inference much more powerful Inheritance objects have the properties of their categories Classification categories can be recognized by object properties h ( , 1 ) h ( , 1 ) How do you pick a representation scheme?

Could attempt to make a general ontology for everything People keep trying; it hasnt worked yet! Too much information to handle with finite computation, anyway Better: use design considerations of problem What kinds of objects/events/measures/etc are relevant? Are there properties of the problem that suggest an organization? What sort of inferences do we need to support? Summary: Knowledge-based agents Use knowledge about the world to choose actions Inference with existing knowledge + new observations Resolution, forward-backward chaining, instantiation and unification, etc. Knowledge represented in knowledge bases Contain statements about the world Structured with an ontology Represents how different kinds of objects/events/etc are categorized Supports higher-level inference Designed for a particular set of problems

Non-deterministic search Markov Decision Process (MDP) Defined by: A set of states s S A set of actions a A A transition function T(s, a, s) Probability that a from s leads to s, i.e., P(s| s, a) Also called the model or the dynamics A reward function R(s, a, s) Sometimes just R(s) or R(s) A start state Maybe a terminal state Successor function now expanded! Rewards replace action costs. Expected utility

Stochastic Grid World Stochastic outcomes mean calculating expected utility of an action U(Up) = -1 + 4 + .2 = 3.2 Sum utilities for possible outcomes, weighted by their likelihood 0.1 Well define a better utility function later! An optimal policy always picks the actions with highest expected utility. R(s) = -10 0.8

R(s) = 5 0.1 R(s) = 2 Policy nuts and bolts Given a policy and discount factor , we can calculate the utility of a policy over the state sequence S it generates: The optimal policy from state s maximizes this: Which gives a straightforward way to choose an action: Bellman equation for utility Idea: utility of a state is its immediate reward plus expected discount utility of the next (best) state Note that this is recursive! Solving for the utilities gives

Unique utilities (i.e., only one possible solution) Yields optimal policy! Solving Bellman iteratively: Value Iteration General idea: 1. Assign an arbitrary utility value to every state 2. Plug those into the right side of Bellman to update all utilities 3. Repeat until values converge (change less than some threshold) Policy Iteration Evaluation: For fixed current policy , find values with policy evaluation: Solve exactly, if reasonably small state space Or, approximate: iterate until values converge: +1

( ) ( ) + ( , | , ( ) ) ( ) Improvement: For fixed values, get a better policy using policy extraction One-step look-ahead: +1 ( )=argmax ( | ,) ( ) ()

Today: Reinforcement learning Defined by: A set of states s S A set of actions a AA start state Maybe a terminal state No longer know transition function T(s, a, s) reward function R(s) But, we do observe two things:

End state s after taking action a from state s Reward R(s) when we enter s Reinforcement Learning Agent State: s Reward: r Actions: a Environment Basic idea: Receive feedback in the form of rewards

Agents utility is defined by the reward function Must (learn to) act so as to maximize expected rewards All learning is based on observed samples of outcomes! Offline (MDPs) vs. Online (RL) Offline Computation Online Learning Monte Carlo method (aka Direct Evaluation) Goal: Compute values for each state under Using a fixed policy, so only one action Q(s,a) = U(s)! Idea: Average together observed sample values Act according to Every time you visit a state, write down what the sum of discounted rewards turned out to be Note that you only know these at the end of execution!

Average those samples Monte Carlo, because using averaged behavior over many random samples Active Reinforcement Learning Full reinforcement learning: optimal policies (like value iteration) You dont know the transitions T(s,a,s) You dont know the rewards R(s,a,s) You choose the actions now Goal: learn the optimal policy and values In this case: Learner makes choices!

Fundamental tradeoff: exploration vs. exploitation This is NOT offline planning! You actually take actions in the world and find out what happens Generalized Policy Iteration Evaluation: For fixed current policy , use N starts into random (s0,a0) and run the policy to halt. Then, for each (s,a), go through the N episodes and find the M episodes that use (s,a) Update Q(s,a) with the average of discounted rewards starting at (s,a) on in each episode Si 1 ( , ) ( ) + ( , ) =1 +1 Improvement: With sampled Q values, get a better policy using policy extraction +1

() ( )=argmax (,) TD learning: What we need Observed reward s Previous state a Action we took to get here R(s) s

Current state Actions we can take next TD evaluation Update current utility estimate for state s every time we take an action from it: ( ) ( ) (

) = ( ) + + [ +1 ( )]

+1 ( ) =( 1 ) ( )+ [ ( ) + ( ) ] If we extend the lookahead to the end of the sequence, get Monte Carlo estimation In the limit, will converge to true expected utility of s And faster than Monte Carlo! Finite lookahead allows dealing with infinite experience sequences! Sarsa: learning policies with TD Like before, now estimate Q values for state-action pairs instead of state utilities +1 ( , ) ( , )+ [ ( ) + ( , ) ( , ) ]

Uses all of s, a, R, s, a Hence, Sarsa a, a both chosen by current (-greedy) policy Now, can use updated Q values to extract new policy ! On-policy vs off-policy learning In Sarsa, we choose a specific next action a for updating Q(s,a). This is called on-policy learning, because were using the current policy to help decide an action for learning from. Like policy iteration last time We can also look at all possible next actions A(s), and use the best one to update Q(s,a). This is off-policy learning, because the action we use for updating is separate from the action we actually take. Similar to value iteration, but active Q-learning: off policy TD

To update Q(s,a), now use the estimated value for the best next action: +1 ( , ) ( , )+ ( ) + max ( , ) ( , ) [ ( ) a, is the action we actually took to get here All available a actions are considered After update, use current (-greedy) policy to choose the actual a to take This will estimate optimal Q values regardless of the quality of the policy being followed! ]

Feature-based representations Describe a state using a vector of features (attributes) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot)2 Is Pacman in a tunnel? (0/1)

etc. Is it the exact state on this slide? Can also describe a q-state (s, a) with features (e.g. action moves closer to food) Approximate Q-Learning Q-learning with linear Q-functions: Exact Qs Approximate Qs Intuitive interpretation: Adjust weights of active features E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that states features Formal justification:

online least squares