Coevolution and security

We have focused on best-response against a fixed opponent given that we know how he plays. I.e., we assumed we had a perfect opponent model. Of course this is, in general, not the case, which could make our calculated policy vulnerable.

In this chapter we will discuss coevolution. This technique can be used to find policies that are more secure against multiple opponent policies. The general idea is to find a policy that is secure against a certain group or population of opponent policies, then to evolve that population and find a new policy that is secure against the new population. By repeating this procedure, the final policy will be secure against all opponent policies; converging to a Nash equilibrium.

The objective of this investigation is twofold. On one hand it describes an alternative way of calculating a Nash-equilibrium. Although the two-player zero-sum case can be solved in polynomial time using linear programming as described in chapter 2, for large problems this remains expensive.

On the other hand, it tries to provide a way to compromise between security and best-response payoff, thus unifying the game- and decision theoretic perspectives.

7.1 Coevolution

The idea behind evolutionary algorithms is that there is population of individuals that represent candidate solutions. By evaluating these candidates against one or more tests their fitness is determined and the fittest produce the next generation. Coevolutionary methods differ from evolutionary methods in the way they treat the tests. Instead of having one evolving population of candidate solutions and a fixed set of tests, two evolving populations are maintained: one for the candidate solution and one for the tests.

In the poker games discussed in this thesis, the population of candidate solutions could consist of a number of policies for the gambler, in which case the corresponding tests would be a set of policies for the dealer. How well one of the gambler policies performs is measured by the outcome achieved against all the tests.

7.1.1 Solution concepts

For coevolution to have any meaning it must specify a goal or solution concept. This can be expressed as a set of candidate solutions satisfying some requirements.

Formally, let C and T be the sets of respectively all possible candidate solutions and tests. The the outcome of a particular candidate C € C against a test T € T is given by the interaction function or game, G : C x T ^ R. In the presence of chance moves in this game G(C, T) is defined to be the expectation of the outcome, E(C,T).

An example of a solution concept expressed using these variables is:

This solution concept is known as 'simultaneous maximization of all outcomes'. As it requires that there is a single solution that maximizes the outcomes against all possible tests, this is a very strong strong solution concept, but has limited application scope. In [15] an brief overview of various other solution concepts is given, among which the Nash-equilibrium, which we will treat in the next section.

7.1.2 Memory

An often encountered problem in coevolutionary approaches is that of forgetting [21], i.e., certain components of behavior, or traits, are lost in a next generation only to be needed again at a later stage. This is especially the case for games with intransitive cycles, such as the Rock-Scissors-Paper game, discussed in section 4.2.2.

In order to counter this forgetting of trades, memory mechanisms are employed. The idea is that in the coevolutionary path to the solution concepts various traits will have to be discovered. Traits that constitute the solution will have to be remembered by the memory.

7.2 Nash equilibrium solution concept

In this section we give an outline of a memory mechanism for reaching the Nash-equilibrium solution concept for symmetric zero-sum games as presented in [21] ("Nash-memory").

7.2.1 Symmetric games and Nash equilibria

In a symmetric game the form of the policy for both players is identical: they can take the same actions in the same information sets 1, as is the case in Rock-Scissors-Paper. Put differently: both players select their (possibly mixed) policy from the same set of pure policies available for the game.

Symmetric zero-sum game always have a value 0, because this is the expectation of a policy played against itself: Vn E(n,n) = 0 or, expressed in terms of

1This implies players take actions simultaneous.

^ (test evaluation)

J ^^ (linear programming)

Figure 7.1: One iteration of Nash-memory coevolution.

candidate solutions and tests: Vn G(n, n) = 0. This means that a Nash equilibrium policy provides a security-level payoff of 0 and that therefore we are searching for a, usually mixed, policy n such that Vn< G(n,n') > 0.

Let S(n) denote the security set of policy n, i.e., S(n) = {n'|G(n,n') > 0}. Now, the Nash-equilibrium solution concept can be expressed as:

7.2.2 Components of the Nash-memory

Let N and M be two mutually exclusive sets of pure policies. N is defined to be the support of mixed policy which will be the approximation of the Nash-policy during the coevolution process. Therefore this is the candidate solution.2

The policies that are not in N are not needed by to be secure against all encountered policies. These unused policies are stored in the set M. The fact that -kn is secure against all policies means that Nu M C S(-kn). Put in coevolutionary terms, M holds those policies, that are currently not needed to be secure against all encountered policies (N u M), in order not to forget particular traits they might embody.

Apart from the candidate solution and an additional memory M, the Nash-memory mechanism specifies a search heuristic H. This is an arbitrary heuristic that delivers new tests against which the candidate solution is evaluated.

7.2.3 The operation

We now turn to the actual working of the Nash-memory. To start, M is initialized as the empty set and N is initialized as a set containing an arbitrary pure policy and as the 'mixed' policy that assigns probability 1 to this pure policy.3 Then the first iteration begins.

2An alternative view is that the Nash-memory maintains a 'population' of candidate solutions consisting of one individual, which in turn consists of multiple of pure policies.

3In [21] the initialization is taken somewhat different, but this doesn't affect the working of the memory mechanism.

Algorithm 3 Nash-memory mechanism fN = initializePolicy N = support(nN)

For iteration = 1:nr.iterations

T = H() //set of tests from search heuristic Forall t in T If G(t,nN) > 0 W = W U {t} End End all_policies = N U M U W

// Calculate a new policy secure against all_policies with // linear programming: fN = LP(all_policies) N = support(nN)

M = all_policies \ N // unused policies stored in M End

Figure 7.1 shows one iteration of the Nash-memory. First, a set of test-policies, T, is delivered by the search heuristic. The policies in this set are evaluated against , to define the set of 'winners':

When this set is non-empty, clearly is not a Nash-equilibrium policy, as it is not secure against all policies, and therefore should be updated.

First a payoff matrix of all policies in MuNuW played against each other is constructed.4In this matrix the rows correspond to policies played by the first player, the columns to those of the second player. The entry (i,j) gives the (expected) outcome of policy i against j, G(ni,nj).

This matrix can than be used to define a linear program. Relating to section 2.2.3 and 2.3.4. the payoff matrix corresponds with A. Therefore this can be solved as outlined in section 2.3.4. The result will be the new policy , the policies to which it assigns positive weight is the new set N', the other policies are stored in M'.

The full algorithm is shown on the current page. Because s(-kn), the set of pure policies against which is secure, grows monotonically with each iteration, repeated application will converge to a Nash-equilibrium, provided that the search heuristic is able to find policies that beat our current estimate (that is, a non-empty W is found).

When resources are limited, it might not be feasible to store all policies encountered. Therefore it is possible to limit the size of M, by discarding policies that have not been recently used by using some heuristic. This, however, might re-introduce the problem of forgetting and will therefore not be considered any further in this thesis.

4Of course the outcomes of pure policies in M UN against each other can be cached, so only the outcomes of policies from W against other policies will have to be calculated to construct this matrix.

7.3 Coevolution for 8-card poker

In this section we apply the Nash-memory mechanism on 8-card poker. In doing so, we extend the Nash-memory for usage with asymmetric games. 5 Secondly, we use the method to calculate a best-response policy as described in chapter 3 to generate new tests. I.e., the search heuristic H we use is a procedure besiResponse(n) that constructs and solves a POMDP model of the game played against an opponent that uses policy n.

7.3.1 Asymmetric games

In order to apply the Nash-memory mechanism to 8-card poker, we need an extension to allow tackling asymmetric games.

A simple solution is to create a new compound game consisting of two games of 8-card poker; one played as gambler and one played as dealer. This compound game is symmetric and a particular policy i is given by nl = (nlgambler,nldealer^. We refer to this as naive symmetrization.

Using this new representation the Nash-memory mechanism can directly be applied without further changes. However, it is clear that the flexibility with which the new mixed policy is constructed is constrained: it is not possible to put more weight on a particular gambler policy nlgambler without putting the same weight on the corresponding dealer policy n%dealer.

In order to overcome this limitation we propose an extension of naive sym-metrization. Observe that in algorithm 3 there are only two reasons why the game must be symmetric: to determine whether a test policy beats the current mixed policy, G(t,nN) > 0, and because the next Nash-approximation is constructed from all encountered policies (M UNU W).

To overcome this, the proposed symmetrization applies the Nash-memory mechanism per player. I.e,. we maintain one sets Np, Mp, Wp, Tp and a Nash-approximation, np,N, for each player p = 1,2 (gambler, dealer). If, without loss of generality, we assume that the search heuristic delivers a single test policy for both players, T1 and T2, we can test whether the compound policy T = (T2, T1)6 beats the compound policy = (n1,N,n2,N), as:

If G(T, nN) > 0, then the current Nash-approximation, , is not secure against compound policy T. In this case the components of T are taken to be 'winners': W1 = T2 and W2 = T1.7

This results in two sets M1 UN1 U W1 and M2 UN2 U W2 with pure policies for respectively gambler and dealer. By constructing the payoff matrix for these pure policies and applying linear programming we calculate n and n2 n , from which M1, N1, M2 and N2 are constructed. The compound policy:

5It is already indicated in [21] that such an extension is possible.

6Note that a test policy Ti for player 1, is a policy for his opponent, player 2, and vice versa. 7

The remark from note 6 applies here too.

Algorithm 4 Asymmetric Nash-memory using bestResponse heuristic. For p = 1,2 //for both players

= initializePolicy(p)

While (converged For p = 1,2

N_stoch(p) = mixed2StochasticPolicy(np,N))

G(T,nN) = G(T(1), ni) + G(T(2). %2,N)) If G(T,^n) > 0 For p = 1,2

W(modulo(p,2)+1) = T(p) NMW(p) = N(p) U M(p) U W(p) End ni,N > n2,N = LP(NMW(1),NMW(2)) For p = 1,2

N(p) = support(nPiN) M(p) = NMW(p) \ N(p) End Else converged = true; End End is secure against all combinations of gambler and dealer policies from , Ni, M2 and N2 in the compound game.

7.3.2 Best-response heuristic

The search heuristic is an important aspect for coevolutionary approaches. It should be powerful enough to discover improvements to the current candidate solution. Within the Nash-memory mechanism this means that it has to find policies that beat the current Nash-approximation.

The approach as outlined in chapter 3 provides a suitable candidate: calculating the best-response policies against the current Nash approximations, ,n2,N. The best-response policies obtain the highest payoff possible. A desirable side effect is that this provides a convergence criterion: when the best-response policies are not able to attain a positive payoff in the compound game, i.e. g(t,-kn) = 0, then is a Nash-policy.

However, using the approach from chapter 3 we can calculate a best response against a stochastic policy. In contrast, the Nash-approximations, are mixed policies. This means it is necessary to convert a mixed policy to a stochastic policy. For now we assume this is done by a procedure mixed2StochasticPolicy. How this procedure works will be covered in detail in section 7.4.


information sets

































Table 7.1: A mixed policy for the gambler in 3-card poker. Shown is the probability of betting ('1').

Table 7.1: A mixed policy for the gambler in 3-card poker. Shown is the probability of betting ('1').

7.3.3 The resulting algorithm

The resulting algorithm is shown on the preceding page. Note that Mp,Np, Tp, Wp are denoted M(p),N(p),T(p), W(p) with p = 1,2 representing the player number.

The expression modulo(p, 2) + 1 assures that the assignments Wi = T2 and W2 = T1 are performed as explained in section 7.3.1.

The procedure LP() constructs the payoff matrix for the two sets of policies and solves the linear program defined. The entries in the payoff matrix can be cached to prevent re-calculating outcomes between pairs of pure policies. In particular, because only one pair of new policies is provided per iteration, only the outcomes of these have to be evaluated against the policies already in memory, i.e. W1 against M2, N2, W2 and vice versa.

7.4 From mixed to stochastic policies

In this section we explain how we can transform a mixed policy to a equivalent stochastic policy. First we will re-introduce some relevant concepts and illustrate the problem. Next, in section 7.4.2 we show that the realization weights are an adequate tool to tackle this problem and after that we discuss computing them.

7.4.1 Problem and concepts

Recall a policy is a mapping from information sets to actions. A deterministic or pure policy specifies exactly one action for each information set. A stochastic policy, on the other hand, is a single policy that specifies a probability distribution over actions for each information set.

A mixed policy is a set of, usually pure, policies together with a probability distribution over this set.8 Intuitively it is possible, at least for tree-like games, to convert a mixed policy to a stochastic policy. Exactly how to do this is not trivial, though.

We will make use of an example 3-card poker game. It is exactly like 8-card poker only with three cards: J, Q and K. Table 7.1 shows a mixed policy for the gambler for this game. Shown are the information sets the gambler has in this game and the probability of betting in those information sets according to 3 policies. Also shown are the probabilities of playing each of the three policies.

A naive approach to convert the mixed policy shown would be to multiply to the rows, i.e the probabilities of betting according to the policies, with the

8In general, the policies in the set can also be stochastic, but not mixed, policies themselves.

probability of the respective policy and add the results. This problem with this approach, however, is that does not respect the fact that the chance of reaching an information set also depends on the policy. Expressed differently, it does not take into concern the probability that a policy realizes a certain move.

Example 7.4.1 As an example consider information set'Jb'in table 7.1. When applying the naive approach the probability of betting in the resulting stochastic policy would become 0.2 ■ 1 + 0.3 ■ 0 + 0.5 ■ 0 = 0.2. In the original mixed policy, however, policy number 1 would specify 'bet' ('1') after observing the jack (information set 'J'). Therefore information set 'Jb' would never be realized using policy 1. As the other policies specify never to bet at 'Jb', the probability of betting at 'Jb' in the stochastic policy is therefore 0. □

In the above the word 'realizes' is stressed with good reason. The problem in concern is very much related to the sequence form and its realization weights.

Recall from section 2.3.2 that a sequence corresponds with a path from the root of the game-tree to a node n. The sequence ak (n) for player k is the string consisting of the concatenation of all labels of edges corresponding with player k's moves and observations. Equivalently, a sequence ak(n) corresponds with an information set of player k concatenated with an action that can be taken at that information set. As each node from the tree corresponds to exactly one sequence, the number of sequences, m, is bounded. We also write alk, with 1 < l < m.

Also recall that the realization weight p\(a\)9 of a sequence alk under policy nlk for player k, is defined as a conditional probability: pk (alk) is the probability that player k takes all the actions specified by alk given that all corresponding information sets are reached.

In the next subsection, we will show that the realization weights are an appropriate tool for our problem of converting a mixed policy ^k for player k to a stochastic one.

7.4.2 Using realization weights

Here we show that using realization weights, we can transform a mixed policy to a stochastic policy that describes the same dynamics, i.e, induces the same outcomes.

Formally, we want to find the probability of an action a at an information set Ik, P(a\Ik,p,k) corresponding to ^k for player k. The crucial step in this problem is that we have to weight the contribution to P(a\Ik ) of a policy ■Klk € ^k by the probability that information set Ik is actually realized by nk.

Theorem 7.4.1 To transform a mixed policy ^k for player k to a stochastic policy, realization weights for all policies € ^k are sufficient. For a particular action a and information set Ik, the stochastic policy is given by:

where o~k(Ik) is the sequence that leads to information set Ik and o~k(I'k) is the sequence that result from appending action a to o~k (Ik).

9We denote the realization weight with p here.

Proof When N players play policies (n1;...,nN), the probability of reaching a node n in the game-tree is given by:

where ft(n) is the product of all chance moves on the path from the root to n. In order to discriminate between the probabilities of moves of the player in concern and its opponents, this can be rewritten to:

P(n|n1,...,nN) = p1(^(n)) ■ ft(n) ■ JJ pk(^k(n)), k=2

in which k = 1 is an arbitrarily chosen player we focus on. Similarly, the actual chance of reaching a particular information set I1 can be given as:

P(/1 |n1,...,nN)= ^ (p1Mn)) ■ ft(n) ■ n PkK(n))J .

As player 1 can not discriminate between the nodes of I1, clearly his sequences for these nodes are the same and we write a1(/1), giving:

P(/1 |n1,...,nN)= p1 M^)) ■ £ (ft(n) ■ fl pkK(n))J .

Now let Popp = E neii ^ft(n) ■ nfc=2 pk(°"k(n))) denote the opponent (and nature) component of the realizing I1. When there are multiple policies n1 € , each played with a probability of P(n|), the probability of realizing I1 becomes:

Next we turn our attention to realizing both I1 and the desired action a. For a single policy n1 € this probability is:

P(I1,a|n1 ,Popp) = Popp ■ p1(^1(/1)) ■ P(a|n1 ,/1). For the mixed policy this becomes:

P(/1,a|M1,Popp) = Popp ■ ^ P(n1) ■ p1(^1(/1)) ■ P(a|ni ,/1).

Finally we can give the probability of action a given I1 for mixed policy :

Popp -Ei P(ni) ■ pi(oi(/i)) Ei P(ni) ■ pi(oi(/i)) ■ P(a|ni,/i)

Now note that the sequence <ri(Ii) followed by action a defines a new sequence, let's call this sequence <i(I[). The realization weight of this new sequence under policy i is p\(I1) = p\(<j1(I1)) • P(a|ni,I1). Therefore we can rewrite equation 7.2 totally in terms of priors and realization weights:

Now observing that the focus on player k = 1 was an arbitrary choice and that this procedure can be extended for any information set and action proves the theorem. □

7.4.3 Calculating realization weights

Having established that realization weights for the policies nk <G pk will give the solution to the problem, the next goal is to determine them. In contrast to the Gala system, we do not want to find realization weights that define an optimal policy, but simply want to extract the realization weights from a policy .

Let <k be the sequence for reaching some node n where player k is to move. Then the continuation <k 0 a is also a sequence for player k and the realization weights are given by the following recurrence relation:

Because P(a|nk, n) is a probability distribution that sums to 1 1 0, the total realization weight of continuations of a sequence, < k , sum to the realization of that sequence itself. I.e p\(<k) = p\(<k 0 a 1) + ... + p\(<k 0 an), exactly as was required in section 2.3.3.

As p\ (root) = 1 for any policy i, starting at the root and iteratively applying equation 7.3 while walking through the game-tree extracts all the realization weights.

We can also formulate this slightly different. Recall that in the proof of theorem 7.4.1, we wrote <k (Ik) for the sequence for player k for reaching any node in Ik , an information set for that player. When using this notation for equation 7.3, we get:

Now, observe that the continuation <k (Ik) 0 a will correspond with the sequence for all successor information sets, I'k, that can be reached from Ik when action a is chosen. By formalizing this it is possible to express everything in terms of information sets.

Definition 7.4.1 The realization weight of an information set Ik of player k under a policy nlk will be denoted p\ (Ik) and is defined as the realization weight of the sequence of any node n <G Ik:

Note, that the realization weight of an information set of another player, i.e., pk(Ii), k = l is undefined.

10In this setting where we considered pure policies , P(»¡n^,n) is 1 for exactly one action. In general, however, a mixed policy might also have stochastic policies in its support.

Algorithm 5 Calculate information set realization weights(nk) Forall IS in initial_ISs(k) rw(IS)=1

append(ISq,IS) //ISq is a queue End

While !empty(ISq) IS=pop(ISq) Forall a in ACTIONS

Forall sucIS in successor_ISs(IS,a,k) rw(sucIS)=rw(IS)-P(a|IS,nfc ) append(ISq,sucIS) End End End

As above, let I'k be any information set for player k, that can be reached from Ik when playing a. The recurrence relation now becomes:

This formulation expresses the close relation between information sets, action probabilities and realization weights more naturally. Also it gives a further formalization of the step taken to obtain equation 7.1 from equation 7.2. Using definition 7.4.1, the latter can be rewritten as:

consecutively applying 7.4 gives:

P(„ | „ T ) = P(nj) ' Pk (Ik) P(a 1 'Ik) = Ei P(nk) ■ Pk (Ik) •

Backwards substitution using definition definition 7.4.1, then immediately gives equation 7.1.

The new recurrence relation (eq. 7.4) also defines an algorithm to find the realization weights for information sets very naturally. This algorithm is shown on the current page and consists of two phases: the first phase finds all initial information sets for player fc, that are the information sets in which the player makes his first move of the game. The realization weights of these information sets are initialized to 1.11 The second phase consists of a pass through the game-tree finding successor information sets and calculating their realization weights.

11The sequence of an initial information set, is the root sequence, 0.

Figure 7.2: Partial game-tree for 3-card poker. The information sets Q and Qb for the first player are clearly indicated.

7.4.4 Calculating the stochastic policy

At this point, calculating a stochastic policy from a mixed policy has become almost trivial. Once the realization weights for the information sets are calculates, all one has to do is apply equation 7.5. We will give an example for the mixed policy from table 7.1.

Example 7.4.2 Figure 7.2 shows a part of the game-tree for 3-card poker. It shows 2 information sets: Q and Qb. In this example we will calculate the stochastic policy for these information sets.

The first thing we need to do is calculating the realization weights of the information sets under the different policies that make up the mixed policy from table 7.1.

As the gambler makes its first move when in Q, this is an initial information set and therefore its realization weight is 1 under all policies. In contrast Qb is not an initial information set and its realization weight is given by:

pi(Qb)= pi(Q) • P('0V, Q), where '0' indicates the action pass.12 This leads to the following table of realization weights:













Table 7.2: Realization weight for the policies in the support of the mixed policy. Now we can apply fill out equation 7.5 for Q, yielding:

Table 7.2: Realization weight for the policies in the support of the mixed policy. Now we can apply fill out equation 7.5 for Q, yielding:

12 Note we omit the subscripts indicating the player (which is 'gambler' throughout this whole example).

Worst case payoff of mixed policy

Worst case payoff of mixed policy

Figure 7.3: Results for the Nash-memory approach to 8-card poker. The dashed lines indicate the Nash-value.

P | „ Q) _ Ei P (ni) ■ ^(Q) ■ P ('IV, Q) P(11 ^ ,Q) _ Ei P(ni) ■ pi(Q)

_ 0.2 ■ 1 ■ 1 + 0.3 ■ 1 ■ 0 + 0.5 ■ 1 ■ 0 _ 0.2 ■ 1 + 0.3 ■ 1 + 0.5 ■ 1 0.2

For Qb this gives:

Ei P(ni) ■ pi (Qb) 0.2 ■ 0 ■ 1 + 0.3 ■ 1 ■ 1 + 0.5 ■ 1 ■ 0 0.2 ■ 0 + 0.3 ■ 1 + 0.5 ■ 1

Concluding the example.

7.5 Experiments

In this section we will describe some experiments performed using the Nash-memory mechanism as outlined in this chapter.

7.5.1 8-card poker

Algorithm 4 was implemented and applied to 8-card poker. Figure 7.3 shows the obtained results. It shows that it only takes a few iterations to obtain a policy that is fairly secure. This is a nice property, as it indicates that this technique might be applied for larger games to obtain an approximate Nash policy.

Worst case payoff of mixed policy Worst case payoff of mixed policy

Figure 7.4: Two larger poker-games. Left: 4-bet 8-card poker. Right: 2-round, 3-bet-per-round 6-card poker.

It also indicates that only a relatively small number of policies is needed to be secure. Further investigation made this even more plausible, as it turned out that the number of pure policies used by the mixed policy is even lower than the figure suggests: when reaching the Nash level (iteration 12) only 6 out of 12 pure policies are assigned weight for the both gambler and dealer policy.

Another observation that can be drawn from the figure is that, although convergence to Nash equilibrium is monotonic, because with each iteration the approximate Nash becomes secure against more policies13, the worst case payoff does not increase monotonically. Apparently, a particular policy against which it is not secure yet might become a best-response and do more damage.

7.5.2 Some larger poker games

After the encouraging results for 8-card poker some experiments were performed on larger poker games. We show resulting curves for two of them here. The first is an 8-card poker variant that allows up betting u to 4 coins bets, with a maximum raise of 2 coins. The game-tree for this game contains nearly 4000 nodes and has 274 sequences for each player.

The second game is a 2 round poker game with a deck of 6 cards, both players receive one card and play a bet-round, after which 1 public card appears face-up on the table. Then a final bet-round is played. In both bet-rounds a maximum of 3 coins coins can be betted per player. This game-tree for this game consists of over 18,000 nodes and has 2162 sequences for both players.

For these games, the obtained results are shown in figure 7.4. As was the case for 8-card poker, the Nash-memory is able to obtain a reasonable security level in a relatively low number of iterations.

Also the small number of policies needed for the support of the mixed policy was confirmed for these larger games. For 4-bet 8-card poker N contained 18 policies out of 100 on convergence. At iteration 150 for the 6-card poker game, the number of policies with positive weight was 29.14

13More formal, the set S(nN) grows monotonically.

14The algorithm was not fully converged at this point, as the compound policy still received

Worst-case (w.c.) and est. opp. model (e.o.m.) payoff - P1 Worst-case (w.c.) and est. opp. model (e.o.m.) payoff - P2

Figure 7.5: The tradeoff between security and higher payoff for 8-card poker. The estimated opponent model is uniform random.

For the larger games there seem to be more oscillations in worst-case payoff. This can probably be explained in the following way: because the game-tree for these games is larger and the horizon is deeper, more actions affect later stages of the game. Therefore the relatively small adjustment of the mixed policy can influence the realization weights of a lot of information sets. When a particular set of information sets is given more weight, but the policy specified for this set is not optimal, this can be exploited by the opponent.

7.5.3 Security vs. best-response payoff

As argued before, playing a Nash-equilibrium is too conservative, when the opponent is not expected to play optimal. On the other hand playing a best-response policy may present risks, as the opponent model may be inaccurate. In this experiment a way to find a tradeoff between potential winnings and security is examined.

The idea is as follows. The opponent model delivers two estimated opponent policies, one gambler and one dealer policy.1 5 First, the best-response policies against these estimated opponent policies are calculated. These best-response policies are used to initialize the Nash-memory mechanism, which then is run until convergence. The result is a series of mixed policies (for both gambler and dealer), starting with the best-response against the estimated opponent policy and ending with a Nash-equilibrium.

Each of these resulting mixed policies, however, can also be evaluated against the estimated opponent policy. When we do this for all of them, we know the worst-case payoff and the outcome against the estimated opponent model.

Figure 7.5 shows this evaluation for 8-card poker. It also shows a line that is a weighted average between the worst-case payoff and that obtained against the estimated opponent model. One should interpret the weights for this line (0.85 : 0.15 in this case) as the amount of trust in the opponent model versus a worst case payoff of -0.027 instead of 0.

15Expressed differently, it delivers an estimated opponent policy for the compound game.

Worst-case (w.c.) and est. opp. model (e.o.m.) payoff - P1 Worst-case (w.c.) and est. opp. model (e.o.m.) payoff - P2

Worst-case (w.c.) and est. opp. model (e.o.m.) payoff - P1 Worst-case (w.c.) and est. opp. model (e.o.m.) payoff - P2

Iteration Iteration

Figure 7.6: The security / potential winnings tradeoff for another estimated opponent. Especially for player 2 there are no useful peak values between the best-response and Nash-policy.

Iteration Iteration

Figure 7.6: The security / potential winnings tradeoff for another estimated opponent. Especially for player 2 there are no useful peak values between the best-response and Nash-policy.

the amount of trust that the opponent plays a best-response against the mixed policy.

Given such a trust ratio, any existing peeks in the weighted average identify those mixed policies that have a beneficial estimated- and worst-case outcome with respect to the amount of trust. As a consequence these policies should be considered a candidate. We have not considered which of these mixed policies should actually be used. One idea would be to randomly choose between them.

Unfortunately, whether these peek policies exist depends very much on the estimated opponent model. An example in which these peeks are missing is shown in figure 7.6. In particular the procedure seems to fail to identify useful mixed policies, when the best-response (or some other 'good'-response) against the estimated opponent model is not in the support of a Nash equilibrium.

Another issue observed is that the payoff against the estimated opponent is much larger for the first (best-response) policy than for any of the mixed policies.

7.6 Discussion

When comparing the Nash-memory approach with solving the sequence form (as in Gala) with respect to performance there are a couple of interesting differences. At this point, calculating a Nash-equilibrium using the Nash-memory approach consumes more time. However, it spends its time differently: mostly on constructing and solving the POMDP models, to calculate the best response, and determining outcomes between the encountered pure policies. Far less time is spent on linear programming, as the size of the linear programs to be solved is generally smaller. E.g. for the 2-round 6-card poker experiment the maximum size of the matrix was 150 x 150 versus 2162 x 2162 for solving sequence form. Also, the linear programs solved have a simpler constraint matrix (a row matrix, forcing the weights of the pure policies to sum to 1).

We expect that considerable speed-up can be obtained by streamlining the of implementation of POMDP model construction and solving. Moreover, approximate methods could be used for both solving the POMDP and evaluating the rewards. This might lead to this approach becoming competitive the sequence form solving in terms of performance. The anytime nature of the Nash-memory approach makes it even more appropriate for a lot of domains.

We will make a few remarks regarding the tradeoff as explained in section 7.5.3. Perhaps the best-response heuristic is not the most appropriate to use during the operation of the Nash-memory with as goal to search for a suitable candidate policy that trades off potential gain for security. There is a large gap between a failing opponent model and the opponent predicting our policy acting to minimize our profit. Put differently, perhaps the worst-case payoff is a too negative measure and we need to search for a weaker form of security.

A direction for this could be to analyze the type and magnitude of errors made by an opponent model. When this knowledge is available it could be possible to generate other opponent policies that fall within the expected bounds of error for the opponent model. The Nash-memory mechanism can than be employed to construct policies that are secure against all of them.

A different question regarding the Nash-memory mechanism that needs more research is the following. Currently the Nash memory is based on mixed policies. Would it be possible to directly use stochastic policies, or policies expressed in terms of realization weights? In this case we would not need to convert between mixed and stochastic policies as explained in section 7.4.

Another direction of future research would be to try to avoid solving a linear programming from the start in each iteration. There might be an approach to adjust the weights of the mixed policy without solving a complete linear program.

A final pointer is to focus on extending this approach to games with multiple players or games that are not zero-sum. A form of symmetrization might also be possible in this case. Calculating a best-response against two (or more) fixed opponents can be done by transforming the game to a POMDP, finding a secure mixture of policies could be done using any of the methods described in [45]. Perhaps an incremental weight-adjusting algorithm, as mentioned above, will also provide opportunities for these directions.

0 0

Post a comment