Now Reading
Echo Chess: The Quest for Solvability

Echo Chess: The Quest for Solvability

2023-08-30 14:44:28


Let’s make Chess extra enjoyable in Single-Participant. How laborious may or not it’s? ☠️

That is the story of venturing too deep, head-first, into the unknown. It began with a doodle on a chunk of paper. This doodle to be actual.

Some pics are solely value log(1k) phrases.

I first conceived of this recreation on a whim as a part of many technique and puzzle video games I’d been designing for enjoyable. I used to be musing with the thought of a chess-inspired Flip-Primarily based Technique (TBS) recreation that nobody would acknowledge as chess-based. My first makes an attempt purposefully stored steering the theme away from chess. Why go vanilla when there are such a lot of wilder thematic flavors and bolder mechanics to discover?

Uneven rewards, alternating motion guidelines, stochastic obstacles, stealth, morphing, and many others. A smart buddy and fellow technique recreation nerd* then stated to me:

The sport’s dope. However what’s flawed with individuals associating it with Chess? No must innovate that kernel away – their psychological load can be taken up by re-learning methods to transfer. Chess items are a common language. They assist them overcome the activation power and work out what’s occurring.

So I re-designed the principles, mechanics, and first few ranges with a sharpie and a whiteboard Figma and a LLaMA sandwich paper and a pencil. Then I began testing it with associates by organising this outdated picket board I discovered in storage. Quickly “the sport” seemed extra like this.

Fallen items = obstacles. Inverted ones = me working out of items (4 black rooks above).
Ballpens are for board resizing. No, there’s no “recreation engine”. Sure, I personally replace the sport state for you each time you progress. No, Hollywood hasn’t texted me but.

One thing shocking began occurring rapidly. Anybody who tried the sport bought hooked. Folks would maintain coming again making an attempt to beat ranges they couldn’t resolve. They’d ask me to manually reset the board to particular checkpoints in a puzzle so they might strive once more.

Chess masters and n00bs alike would describe feeling a rush of pleasure each time they’d attain the ‘Aha’ second of a maze. Then individuals wished to play “the sport” at house and present it to their associates. I’d ship them images of my scribbled stage designs so they might reproduce them on their very own with a bodily chess board or one thing like lichess.

In the future I simply figured this was getting comically unsustainable. So I made a decision to construct “the sport” into a correct factor on-line and name it Echo Chess.

And thus started a deep, deep rabbit gap…????

Echo Chess “Basic”

Gameplay Design

Three easy guidelines of Echo Chess — nonetheless related from v0 to today:

  1. You might be enjoying White, there isn’t any opponent. It’s essential to seize all items to win.

  2. You grow to be the “echo” of any piece you seize. Captured a bishop? Grow to be that bishop.

  3. You’ll be able to’t go via crimson obstacles. Discover the very best transfer order to clear the board.

And so it was that Echo Chess graduated from a sandwich paper to a Mechanical Turk mannequin to a minimalist net app, the place you simply hit a hyperlink and resolve the maze. No downloads, no installs, no load screens, no tutorials, no handholding. Simply faucet the hyperlink, work out the puzzle. An excessive experiment in ‘present, don’t inform’ design.

A brand new that means for ‘eat what you kill’.

What’s the catch? Since you’re reworked with each seize, the maze is successfully altering as you play.

Stage 6 of “Basic” v1. The upper ones get considerably tougher.

In later ranges, you unlock squad-like gameplay the place you get to maneuver and coordinate a number of items — and even cannibalize one in all your personal to clear the best way, if it comes all the way down to it.

The enemy of my enemy is one other particular person blocking the best way.

Recreation Mechanics Implications

I feel one of many causes Echo Chess is intriguing to so many numerous participant personas is that it’s (laughably) straightforward to study, but (frustratingly) laborious to grasp.

To provide you however a glimpse as to why that’s the case:

  • The second you get pleasure from any seize by any means, you lose all of your present skills. In case you had been a knight galavanting round obstacles, you simply misplaced all that by capturing a rook.

  • Landed your self a robust queen? Nice, you possibly can solely use it as soon as earlier than it’s gone. Assume it’s best to’ve saved capturing it for later once you’re in a pinch? Who says the choice would nonetheless be there if you happen to take one other path? Capturing order issues. A lot.

  • Have to clear the passage to cross to the opposite aspect of the board? Don’t neglect you’ll grow to be no matter blocker you’re clearing. Pawns = fewer levels of freedom. Maintain them until the tip, or higher get them over with rapidly?

Oh you assume it’d be a lot simpler if you happen to solely had extra white items in your aspect? Why don’t you strive fixing, say, Stage 13 beneath? (jk pls don’t truly strive it chilly turkey. It’s just about unimaginable.)

Stage 13, or the “puzzle from hell” as one of many early playtesters referred to as it.

In a short time you’ll understand Echo Chess is a puzzle recreation that’s deceivingly proof against brute drive. Click on-spamming gained’t get you far right here.

I truly had so as to add a ‘Give Up’ button as an early termination state for many who get stumped however need to save their excessive rating (in some methods, this makes puzzle video games completely different from the arcade style as a result of you possibly can by no means simply… die).

Balancing, Ranges and Recreation Design

Conventional recreation design tells us we should always intention to breed, each as strictly and as loosely as attainable, this type of sentiment when setting the problem of a recreation:

The ‘Difficulty Saw’ curve most designers swear their recreation has. It doesn’t.

A great recreation feels prefer it’s getting simpler as you begin mastering its mechanics, till a brand new mechanic/problem/twist is launched, and the problem shoots up once more. You strive making use of the outdated methods you’ve perfected nevertheless it’s no use. Then you definately get it: that you must snap out of the consolation zone and study some new SKilLz it’s pushing you to study. So that you truly #gitgud and the sport feels simpler once more, up till you hit the subsequent bump. Rinse and repeat.

In principle, that is the recipe for a really enjoyable recreation the place gamers enter the coveted ‘Circulate’ state of gaming.

In apply, this kinda works a tiny bit, till it doesn’t. Gamers include all types of prior expertise (having performed many video games of the identical style, or simply being wired a sure method), a large spectrum of (im)persistence, and all types of expectations as to what qualifies as a problem spike, or the place they would’ve drawn the road between offensively straightforward and unfairly brutal. You in some way find yourself each scaring the n00bs and boring the veterans.

Extra individuals will price your recreation as top-left-red than bottom-right-red.
For the reason that dumbing down of recent society, the Senseless zone has by no means been so fertile.

What’s the reply, then? Effectively, if you happen to’re pondering of constructing a recreation that’s ‘straightforward to study, laborious to grasp’, I feel it goes one in all two methods:

  • In case you’re studying this after the yr 2020 2010 2000, don’t make this recreation. Critically, don’t. Nobody will play it. Social media and OF tradition have in some way managed to turn strategy games into trash TV. You’ll be able to’t compete with that. Discover another factor as an alternative. You’ve been warned.

  • In case you’re somebody who’s 100% proof against good recommendation and you continue to insist on making this recreation, then the gods be with you. On the very least, although, please do these easy issues:

    • (1) let individuals skip to any stage they need, each time they need. Allow them to gauge their very own spice tolerance.

    • (2) if you happen to actually want to sprinkle new mechanics right here and there, add them in some micro-saw-tooth method.

    • (3) make your macro issue ramp a slow-building exponential curve as an alternative, nearer to this subsequent one. Newcomers will Dunning–Kruger their strategy to the early enjoyable half, and bold gamers will energy via to achieve the actually enjoyable stuff.

      • (Non-compulsory) provide sufficient replayability and variability for every issue tier in order that newbies can thrive in cuteness land without end, whereas GMs and PhDs play a math meta of their very own.

A extra fascinating curve for easy-to-learn-hard-to-master video games.

Cool. Let’s deliver it again to Echo Chess. Buddies who know me effectively will attest that, relating to any type of artistic expression, I ascribe to the (more-controversial-than-it-has-to-be) No Spoilers college of thought.

As a service to fellow nospoilerists on the market, I’ll solely break down one (fictional) instance intimately right here. Then I’ll allow you to benefit from the precise ranges your self.

In case your ranges don’t have anything to say, your recreation’s language is just too primitive.

Take into account the 4×4 board above. You’re enjoying King, and must seize two pawns, a knight, and a rook. There are 2 impassable obstacles. Which piece do you seize first? And is that this a great tutorial stage?

In case you’re new to Echo Chess, it means you fall in one in all two camps:

  • (1) You see a cute small stage. Solely 4 items. You brute-force your method into capturing each piece to see what occurs. All hail the #yolo king.

  • (2) You’ve binge-watched The Queen’s Gambit and have been slogging via, as a result of Covid. Or perhaps you’re an precise Chess participant. So you understand higher. You intend your 1st, 2nd, Nth transfer in your head and assign values to sure squares.

The issue if you happen to’re a type-2 particular person is that in Chess, you have already got some built-in instinct as to what usually makes a transfer good vs dangerous. You’ve spent so a few years training superior ways and openings that utilizing your instinct as a heuristic in your ‘ahead go’ exploration makes it much less brute-forcy than for type-1s. However in Echo Chess, your heuristics may be simply deceived. Is r > n or r < n?

Right here’s how an Echo Chess participant would sometimes resolve the board above utilizing a ‘backward go’ as an alternative (clearly there’s a couple of “proper” strategy to do these, however FWIW):

Step 1. Determine the highest pawn as an ‘finish state’ piece that has 0 legitimate strikes.
Step 2. Determine the underside pawn as solely ever in a position to seize one piece: the highest pawn.
Conclusion: LAST 2 captures MUST be backside p, then high p.
Step 3. Determine the underside pawn as non-capturable by a knight (no L’s can attain it).
Step 4. Backside p is non-capturable by p or n. However it’s our penultimate seize, so Okay can’t get it.
Conclusion: the rook MUST seize backside p. LAST 3 captures MUST be: r, p, p.
Step 5: r is non-capturable by the p’s. However it’s our Third-to-last seize, so Okay can’t get it.
Step 6: n should seize r. It wants a number of L-jumps however it might probably do it.
Conclusion: LAST 4 captures MUST be: n, r, backside p, then high p.
Step 7: All that’s left is for Okay to seize n. Straightforward.
Step 8: Recap of the ‘ahead go’ capturing order to clear the board.
Conclusion: backprop does the smoothest moonwalk.

In actuality, an ‘Echo Chess participant’ (wtv that’s) would instinctively have noticed the two pawns as finishers from the primary look, and certain computed Steps 3-5 fairly rapidly. I’d say the L-jumps of Step 6 are the one cryptic bit on this maze, assuming the suitable instinct has already been developed. Is it a great tutorial stage? Nope.

The onus of nurturing one of these game-mechanic-specific instinct is on the sport designer, not the participant. It ought to be (and is) skilled in a prior maze to this one. There are at the very least a dozen related heuristics to what we simply noticed that you simply’re certain to choose up whereas enjoying Echo Chess. I’ve designed every stage to hopefully convey the significance of a brand new one. In case you guess a few of them, please drop me a notice.

In case you’re curious concerning the tougher ‘Aha’s, strive ranges 11-15 on You’ll be able to change to any of them anytime by clicking the Stage dropdown on the high.

Scoring System

Let’s discuss scores. I’ve at all times believed {that a} well-designed scoring system ought to be optimized to correctly incentivize higher gameplay, not cheesing for top scores or APM. So naturally I baked that into the design of Echo Chess.

  • You get factors for every seize (larger per piece sort, larger general for tougher ranges).

  • Gameplay-wise, there’s (virtually) no distinction between a Q and a Okay. Why? As a result of single-player. If it’s at all times your flip, space-time will get bent. Okay * strikes = Q.

1operate updateScoreForCapture(pieceType) {

2 var pieceScores = {

3 'p': 10,

4 'n': 30,

5 'b': 30,

6 'r': 50,

7 'q': 80,

8 'ok': 100

9 }

10 runningScore += currentLevel * pieceScores[pieceType]


  • You get a “stage bonus” each time you resolve a maze. This will increase quadratically as ranges go up.

  • The extent bonus is penalized (in a compounding method) for each transfer you make. So transfer effectivity issues a ton —particularly in larger ranges the place the compound penalty hurts an even bigger base. You’d nonetheless at all times get some constructive bonus, nbd.

1operate updateScoreForLevelBonus(numMoves) {

2 var levelBase = 200 * Math.pow(currentLevel, 2)

3 // penalize by compounding -2% for each transfer used on this stage

4 var levelBonus = Math.ground(levelBase * Math.pow(0.98, numMoves))

5 levelBonus = Math.spherical(levelBonus / 5) * 5

6 runningScore += levelBonus


  • Leaping ranges utilizing the dropdown is allowed at any time, however the rating resets mechanically to keep away from dishonest the reside leaderboard.

  • You get a “time bonus” when ending the sport primarily based on how lengthy it took you to beat the entire thing. Essentially the most you might ever rating (even at hyperspace pace) is capped at a a number of of base, then it begins dropping as a reciprocal of time spent.

1operate updateScoreForGameCompletion() {

2 // time bonus drops reciprocally however is at all times in [0, +30%]

3 var cappedMultiplier = 0.3 / (1 + 0.005 * secondsElapsed)

4 timeBonus = Math.ground(cappedMultiplier * runningScore)

5 runningScore += timeBonus


Principally the scoring operate rewards maze-solving first, transfer effectivity second, and pace of completion final.

This incentivizes gamers to embrace the enjoyment of adventuring and truly play the maze till they really ‘get it’ with out untimely optimization. Then they will return and seek for extra artistic answer paths (which brings a pleasure of its personal).

Gamers at all times behave in methods the sport reinforces them to, whether or not they understand it or not. And a meta will emerge from any scoring system (LeetCode, anybody?). This follows instantly from evolutionary psychology —or from RL if you happen to favor robo-speak. Oh and speedrunners gonna speedrun it doesn’t matter what you do.

“Be certain that the successful technique can be the funnest one to play.” –Mark Rosewater, technique recreation designer and puzzlemaker.

Recreation Full Stack Improvement

So how did Echo Chess go from picket boards to pixels?

The frontend is, consider it or not, good outdated Javascript, jQuery, HTML, CSS. The backend is a Flask server hooked right into a ReplitDB. Async calls are made with good outdated AJAX. Echo Chess is retro via and thru.

Two open supply libraries helped fairly a bit for the earliest prototype’s client-side plumbing: chess.js and chessboard.js. In fact, an unlimited quantity of labor went in to go from a barebone chess strikes validator to a totally fledged chess variant puzzle recreation.

In truth, a ton of code in these libraries needed to be overwritten and prolonged to include chess variants mechanics. Listed here are however a number of examples:

  • King actions, checks, checkmate

  • Relative pins, absolute pins, immobilized items

  • Restrictions on kings per participant (< or > 1)

  • Piece promotions

  • Castling rights

  • Participant turns, or lack thereof

  • Half-moves, full strikes, attracts

  • Pawn double-step disabling

  • En-passant dynamics

  • Squares having obstacles or boundaries

  • Motion pathing with obstacles

  • Seize-based transformation

  • Stopping transformation reversion

  • Self-sacrifice dynamics

  • (you get the purpose)

The consumer aspect retains monitor of the sport state, stage state, strikes effectivity, scoring, timing and consistency throughout stage jumps, retries, restarts, switching recreation modes, and many others.

The server handles saving and retrieving scores, reside leaderboards, stage information validation and prediction (extra on that afterward). The most recent saved excessive scores get cached on the consumer and solely get up to date by the server when related.

Let’s check out a easy implementation instance from Echo Chess for instance how a standard chess engine can be utilized as a constructing block for chess variants.

Right here we’d like to focus on in inexperienced all of the squares a chunk can transfer to, considering the idea of obstacles we’ve outlined for this recreation. We begin by formalizing how an impediment can block stuff.

Left: all 3 on similar line, AND impediment between them.
Proper: all 3 on similar line, however impediment NOT between them.

Let’s write some easy code to seize this.

1operate obstacleOnPath(course, supply, goal, impediment) {

2 let [i1, j1] = getRankAndFile(supply)

3 let [i2, j2] = getRankAndFile(goal)

4 let [ik, jk] = getRankAndFile(impediment)

5 if (course == 'h') {

6 // horizontal block

7 if (i1 == i2 && i1 == ik && isBetween(jk, j1, j2)) {

8 return true

9 }

10 }

11 else if (course == 'v') {

12 // vertical block

13 if (j1 == j2 && j1 == jk && isBetween(ik, i1, i2)) {

14 return true

15 }

16 }

17 else if (course == 'd') {

18 // diagonal block

19 if (Math.abs(i2 - i1) == Math.abs(j2 - j1) &&

20 Math.abs(ik - i1) == Math.abs(jk - j1) &&

21 Math.abs(i2 - ik) == Math.abs(j2 - jk) &&

22 isBetween(ik, i1, i2) && isBetween(jk, j1, j2)) {

23 return true

24 }

25 }

26 return false


Now we will mix these ideas with the utilization of the chess engine to put in writing a easy highlighting operate for attainable strikes.

1import { Chess } from 'chess.js'

2const chess = new Chess()


4operate blockedTarget(supply, goal) {

5 var obstacles = ranges[currentLevel].obstacles

6 if (obstacles.contains(goal)){

7 return true

8 }

9 var pieceType = chess.get(supply).sort

10 if (pieceType == 'b' || pieceType == 'q') {

11 for (let impediment of obstacles) {

12 if (obstacleOnPath('d', supply, goal, impediment)) {

13 return true;

14 }

15 }

16 }

17 if (pieceType == 'r' || pieceType == 'q') {

18 for (let impediment of obstacles) {

19 if (obstacleOnPath('h', supply, goal, impediment) ||

20 obstacleOnPath('v', supply, goal, impediment)) {

21 return true;

22 }

23 }

24 }

25 return false;



28operate colorPossibleMoves(supply) {

29 var strikes = chess.strikes({

30 sq.: supply,

31 verbose: true

32 })

33 if (strikes.size == 0) return

34 greenSquare(supply)

35 for (var i = 0; i < strikes.size; i++) {

36 if (!blockedTarget(supply, strikes[i].to)) {

37 greenSquare(strikes[i].to)

38 }

39 }


From then on, we will name the colorPossibleMoves operate each time we decide up a white piece from a given sq., and it’ll shade all of the corresponding squares in inexperienced.

1operate onDragStart(supply, piece) {

2 // participant can solely decide up white items

3 if (^b/) != -1) {

4 $("#wrong-piece")[0].play();

5 return false

6 }

7 colorPossibleMoves(supply)

8 ...



11var board = new Chessboard('board', {

12 place: ranges[currentLevel].fen,

13 draggable: true,

14 onDragStart: onDragStart

15 ...


Typically it’s straightforward to neglect how constrained a cornered queen may be.

Quick ahead a number of thousand traces of code, and shortly sufficient, Echo Chess ‘Basic’ was reside! Buddies had been in a position to get pleasure from it on the go.

Since I had already meticulously crafted 15 puzzle ranges, there was sufficient in there to maintain everybody busy. For some time, the toughest puzzles remained unsolved. Dozens would strive, day in, time out, however only a few would be capable to attain the very finish. Weeks handed. After which it occurred. Folks began ending the sport.

That’s legit road cred proper right here.

Like a rush of feelings after binge-watching the final season of a favourite present, they celebrated, sighed, stared into the abyss, then cracked their knuckles, and got here again asking for extra. MOAR LEVELS. ASAP.

Echo Chess “ENDLESS”

As soon as once more, I used to be confronted with the bittersweet realization that the customers’ urge for food for Echo Chess had exceeded my means to manually sustain with “DM-ing” it. I knew I wanted to automate stage creation to make it, as soon as and for all, solely impartial from my involvement as a mammal.

Procedural Technology

And so was born the brand new ‘Countless’ mode for Echo Chess! A real “Tetris”-like puzzle recreation, utterly separate from ‘Basic Mode’, togglable in url params.

Countless Mode (right-side) even has its personal model, compliments of Angela

Infinite ranges, all randomly generated, all in actual time. Anytime you end a stage, you carry over your successful piece to the subsequent, and the board of the upcoming puzzle pivots round your present piece’s place.

Ranges mechanically swivel round you.

So what do we actually imply by ‘Countless’?

We randomly generate every stage from parametrized distributions of items, obstacles and limits (1). This offers us full management over adjusting the variability and issue of generations (2), and lets us tailor the start of every stage to coincide with the ending of the prior one (3).

That brings us to the time mechanic. How will we converge an infinite stage development? We make it a countdown race:

100 seconds. Time’s working out! (4) Wanna keep alive? Maintain fixing new ranges! You get extra time again for fixing larger ones. (5)

Coincidentally, with this new fast-paced enjoying model comes a shift from a pure technique/puzzle fixing to a full-on arcade mode. The scoring operate will get adjusted accordingly with bonuses scaling up or down with stage sizes and issue —and we get a brand new leaderboard that’s stored absolutely separate.

Encoding, Decoding

Now that we’re coping with so many ranges, we’ll want an environment friendly and lossless strategy to conveniently learn, write, encode, decode, ship, obtain, retailer, retrieve, remodel, analyze, increase, and render any stage configuration.

Unforunately my picket system falls a bit in need of Shannon entropy.

Chess gamers studying this are most likely already pondering of a number of such viable methods, like PGN, SAN, or UCI. These are all nice for chess strikes encoding and decoding, and will presumably be adjusted for our wants. However there’s truly a greater system we may reuse for the aim of recreation states specifically: the elegant Forsyth–Edwards Notation (FEN).

One row of a chess board transformed to a string utilizing FEN code.

The simplest strategy to perceive a FEN string is as follows:

  • Learn it left to proper, one row at a time, ranging from the highest row

  • Lowercase letters = black chess items

  • Uppercase letters = white chess items

  • Digits = variety of consecutive empty squares (ignore their shade)

  • Okaying♚, Queen♛, Rook♜, Bishop♝, Pawn♟, okNight♞

All rows of the board transformed and mixed to generate the board’s FEN string.

First order of enterprise: repurpose and increase the FEN encoding system for Echo Chess. Along with the whole lot else FEN does, I wanted a model that takes under consideration issues just like the places of obstacles, the dimensions and form of a stage, the place the boundary squares are, and so forth.

Creating a New Formal Notation

So I got here up with a brand new encoding that expands on FEN, which I’m providing up right here for anybody who may discover it useful for his or her work. I name it the “compoundFEN”.

One row of an Echo Chess board transformed to a string utilizing compoundFEN code.

Let’s think about the next instance from Echo Chess to see how the compoundFEN can be derived. To raised visualize this, we’ll overlay a grid over the extent to make the canonical 8×8 chess board pop.

Stage 7 of the unique Basic mode. Seems like all in all, it was all simply bricks within the wall.

We undergo every row, high to backside, ranging from the 0th to the seventh row — positive, we may as an alternative rely from 1 to eight like sapiens commoners, however what would the robots say?

  • (0) Nothing however a board boundary on this row displaying the confines of this specific stage measurement. Can’t actually consider these squares as ‘empty’ as a result of no motion is allowed on them. Let’s name every of those boundary squares an ‘X’. The royal Okay hasn’t expropriated the X letter but so we ought to be good to go.

  • (1) Okay so we begin with an ‘X’ for a boundary. Then we now have an precise empty sq., in order that’s a 1. Now a bunch of obstacles. They’re technically just like boundaries in that they block the participant’s movement, however they’ve a barely completely different use. Let’s name these impediment squares lowercase ‘x’ simply in case. Okay so we now have 5 of these ‘x’, adopted by one other ‘X’.

  • (2) This ought to be simpler. We now have ‘X’ adopted by an ‘x’, adopted by a bunch of black items (so all lowercase: q, p, b, ok, and r), then a last ‘X’, with 0 correct ‘empty’ squares to be famous anyplace.

  • (3) Similar factor right here: we begin with an ‘X’ adopted by an ‘x’, then we get a white piece (so it’s uppercase) and it’s an ‘N’, not a ‘Okay’, as a result of knights can’t even maintain their initials in an absolute monarchy. Then we proceed as earlier than with the alternating black pawns ‘p’ and obstacles ‘x’ till hitting the final boundary ‘X’.

  • (4) Easy row: ‘X’ then ‘x’, then a black king ‘ok’, then 4 ‘x’, and a last ‘X’.

  • (5) Attention-grabbing, a barely completely different one. ‘X’ and a pair of ‘x’, positive. Then 3 precise empty squares. Okay so we’ll characterize these as ‘3’ based on the bottom FEN conference. Then a black knight ‘n’ and a last ‘X’.

  • (6) Again to acquainted ones, we’ve bought this. ‘X1xbnrxX’.

  • (7) We already know this one, nothing however boundaries. ‘XXXXXXXX

Now we put all these rows collectively, separated by ‘/’, to get the complete ‘piece placement’ portion of the compoundFEN:

Let’s check this to see if our compoundFEN for Stage 7 appears to be like proper. Right here’s a fast decoding operate you need to use to transform from compoundFEN to a 2D board.

1def fen_to_board(compound_fen):

2 board = np.empty((8, 8), dtype=str)

3 board.fill(' ')

4 # seize 'piece placement' rows, ignore the remaining

5 fen_parts = compound_fen.break up(' ')

6 ranks = fen_parts[0].break up('/')

7 rank_index = 0

8 file_index = 0

9 for i in vary(len(ranks)):

10 rank = ranks[i]

11 for j in vary(len(rank)):

12 char = rank[j]

13 if char.isdigit():

14 # consecutive empty squares

15 file_index += int(char)

16 else:

17 board[rank_index][file_index] = char

18 file_index += 1

19 rank_index += 1

20 file_index = 0

21 return board

1compound_fen = 'XXXXXXXX/X1xxxxxX/XxqpbkrX/XxNpxpxX/XxkxxxxX/Xxx3nX/X1xbnrxX/XXXXXXXX w - - 0 1'

2board = fen_to_board(compound_fen)


1[['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']

2 ['X' ' ' 'x' 'x' 'x' 'x' 'x' 'X']

3 ['X' 'x' 'q' 'p' 'b' 'k' 'r' 'X']

4 ['X' 'x' 'N' 'p' 'x' 'p' 'x' 'X']

5 ['X' 'x' 'k' 'x' 'x' 'x' 'x' 'X']

6 ['X' 'x' 'x' ' ' ' ' ' ' 'n' 'X']

7 ['X' ' ' 'x' 'b' 'n' 'r' 'x' 'X']

8 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']]

Seems nice. Go forward and evaluate each cell of this board decoded from our compoundFEN to the Stage 7 screenshot above. Keep in mind: lowercase ‘x’ refers to obstacles, and uppercase ‘X’ refers to boundary squares.

And now right here’s an encoding operate to automate the conversion of boards to compoundFENs transferring ahead.

1def board_to_fen(board):

2 compound_fen = ''

3 for i in vary(len(board)):

4 row = board[i]

5 empty_count = 0

6 for j in vary(len(row)):

7 piece = row[j]

8 # rely consecutive empty squares

9 if piece == ' ':

10 empty_count += 1

11 else:

12 # add previous empty squares

13 if empty_count > 0:

14 compound_fen += str(empty_count)

15 empty_count = 0

16 # add this sq.'s content material

17 compound_fen += piece

18 # add row's trailing empty squares

19 if empty_count > 0:

20 compound_fen += str(empty_count)

21 if i < len(board) - 1:

22 compound_fen += '/'

23 compound_fen += ' w - - 0 1'

24 return compound_fen


2print(compound_fen == board_to_fen(board))

1XXXXXXXX/X1xxxxxX/XxqpbkrX/XxNpxpxX/XxkxxxxX/Xxx3nX/X1xbnrxX/XXXXXXXX w - - 0 1


If as an alternative you wished to examine the fundamental ‘customary’ FEN encoding for this similar Stage 7 instance, you might merely ignore all ‘x’ or ‘X’ codes within the compoundFEN and assume they’re a part of the consecutive empty areas. That’s as a result of customary FENs think about something that’s not a chess piece to be empty. That is what you’d get: 8/8/2qpbkr1/2Np1p2/2k5/6n1/3bnr2/8 w - - 0 1, which you’ll conveniently verify with this online chess board (simply take into account that it gained’t look as recognizable with out the Echo Chess elements).

The Drawback of Solvability

So now we now have an environment friendly, dependable strategy to procedurally generate random ranges, manipulate them, and serve them to our gamers. However one factor’s for positive: we don’t need to be serving unsolvable mazes. Powerful puzzles may be thrilling, damaged puzzles are simply trash.

How will we examine that for a given generated maze? Is it even theoretically attainable to drive randomly generated Echo Chess mazes to be solvable? To know what we’re up towards, let’s begin by looking at why fixing an Echo Chess maze is so difficult to begin with.

Every of those trivial examples illustrates a easy configuration the place the maze is utterly unsolvable.

Prime 2 configurations don’t have any legitimate first strikes as a consequence of obstacles or boundaries.
Backside 2 don’t have any legitimate first seize — successfully forming disconnected subgraphs.

Now let’s have a look at some much less-trivial examples of unsolvable configurations. See if you happen to can spot the problem earlier than trying on the solutions.

Prime-left: the rook is unimaginable to achieve. The capturable bishop acts as a blocker.
Prime-right: no piece exists that would attain the cornered queen.
Backside-left: if q is captured, can’t get b; but when b is captured, can’t get q.
Backside-right: Two bishops don’t have any legitimate strikes; seize one and also you’re caught.

At this stage, you might be beginning to develop some instinct for what made these mazes unsolvable. And but, right here’s an instance of a stage that feels unsolvable the primary dozen instances you strive it, however that does even have identified options.

In case you can’t see the answer, head to and take a look at it out your self.
It’s Stage 8 in Basic mode, from again after I was nonetheless designing each puzzle manually.

You is perhaps asking your self: okay, so fixing a maze on the fly isn’t that easy, however given sufficient prep time, what’s the large deal? As an alternative of producing utterly random ranges in actual time, why doesn’t this man simply serve a random stage from a pre-generated record of solvable ranges that he’s already curated beforehand?

Good query —that was my preliminary try. The rationale why that seems to be not such a good suggestion is that we now have no management over which carry-over piece and on which carry-over sq. the participant can be beginning any given stage, seeing that it relies upon solely on their prior answer path. Nonetheless, you may argue, you might both:

  • (1) make certain every stage has just one attainable finish level
    => Nope, meaning we’d be again to guide designs, beating the entire level of the Countless mode

  • (2) pre-generate and curate ranges for each mixture of piece sort and sq. the participant may find yourself in, simply in case

    => Actually? You need me to manually check and curate 8 x 8 x 6 = 384 combos for each stage?

  • or (3) abandon the entire idea of a carry-over piece and easily serve ranges from a curated pregen record no matter how the prior stage ended

    => That might truly break the core vibe of Countless that every one customers at the moment love: it actually charges like one steady, coherent, infinite stage being performed. To be truthful, manually modifying a pregen record remains to be a lot sooner than manually designing from scratch (MidJourney, anybody?) so this might nonetheless turn out to be useful for additional assistance on future Basic mode ranges design, nevertheless it gained’t give us what we want for Countless.

In any case, this nonetheless doesn’t resolve the first matter at hand, which is discovering an automatic methodology for producing mazes which have options.

One strategy to method this may be to strive utilizing Pathfinding Algorithms like Depth-First Search (DFS) with Backtracking, or variants of A* and heuristic-based approaches.

An instance of a maze generator utilizing DFS with recursive backtracking.
Supply: one thing I constructed for enjoyable as I used to be iterating on Echo Chess.

However earlier than we hit one other iceberg of delusion, let’s pause and assume a bit about what we’re actually making an attempt right here:

  1. To resolve an Echo Chess stage, we now have to clear the complete board.

  2. Clearing the board means capturing each piece.

  3. To seize a chunk, we now have to maneuver to its location.

  4. As soon as a chunk is captured, we will’t seize it once more.

  5. Subsequently, to resolve an Echo Chess stage, we now have to ‘go to’ each bit precisely as soon as.

Does this ring a bell? Dangerous information, you guys. It appears to be like like the issue of Echo Chess Solvability (ECS) could possibly be mappable to the Hamiltonian Path Drawback (HPP) of visiting every metropolis precisely as soon as in a graph.

If solely fixing a Hamiltonian Path Drawback was as straightforward as tracing a identified answer.
Supply: Combinatorica

In case you keep in mind your Thoeretical Pc Science class from again within the day, the HPP is an NP-complete downside.

And if you happen to additionally need to account for Echo Chess’s transfer effectivity mechanic, like discovering the shortest Hamiltonian path visiting all cities, then you definitely’re even risking NP-hard territory.

In case you don’t recall the HPP, don’t fear about it – you might be extra conversant in its shut cousin TSP since the Traveling Salesman Problem is such a infamous one. Any method you have a look at it, it’s best to most likely be un-pumped that we’re coping with these beasts.

The Travelly Boi. My cash’s on the Travelly Boi.
Supply: reddit, clearly.

Granted, with sufficient compute, we may seemingly try some approaches like Monte Carlo Tree Search, alpha-beta pruning, or others, however there are different features of the Echo Chess solvability downside that may be non-trivial to take care of — even when we developed a reproducible mapping from a stage’s beginning compoundFEN+motion guidelines to a graph principle illustration.

Issues like disappearing ‘cities’ (as a consequence of captures), ‘roads’ altering upon each metropolis go to (because of the echoing mechanism), a morphing topology because the graph is being explored: we would want to account for every of those meticulously.

Sure, some heuristic-based backtracking algorithm with intelligent pruning might put up an honest combat with all of this. However on the finish of the day, let’s not lose monitor of what we truly care about.

In the end, the primary purpose we’re after right here is answering this easy query: is that this stage we’re serving to the participant prone to be “solvable” or “not solvable”?

Keep in mind: we’re probably not searching for any precise options to a maze, that might be our gamers’ job —they will maintain all of the enjoyable to themselves. We simply want a easy boolean to inform us YES or NO for solvability. So on the very core of it, we’re coping with a Classification downside!

And that’s one thing we should always be capable to, in principle, deal significantly better with, assuming we set issues up appropriately. Why don’t we give {that a} shot? ????

Information mining

The factor that involves thoughts straight away once we consider any supervised studying method is information. Do we now have any, can we get any, what form does it are available, will we get it labeled, will or not it’s clear, will we now have sufficient, you understand the drill.

Fortunately, the preliminary Basic mode of Echo Chess has already achieved a really humble stage of traction and engagement, solely due to word-of-mouth from gamers having fun with the sport around the globe.

IIRC the sport’s being performed in 300+ cities throughout 60+ nations.

Okay so we now have some actual customers and we will serve them some recent ranges information, that’s nice. Let’s assume for a second that we will maintain getting related traction and engagement for Countless mode as we bought so removed from Basic. How will we get labeled information out of this?

Effectively if somebody will get to the subsequent stage, meaning they need to’ve discovered an accurate answer. Corollary: that stage they had been enjoying should be solvable. Simply by successful on the recreation, they’re labeling it for us! Now we simply must get our Genki Dama donors to maintain enjoying and fixing whereas having enjoyable.

Wait —isn’t {that a} catch-22? How are we going to:

  • (a) persuade sufficient individuals to play that recreation mode so we will

  • (b) crowdsource labeled information, which we’ll use to

  • (c) prepare a mannequin for predicting solvability, which is able to enable us to

  • (d) guess which generated ranges are solvable, in order that we will

  • (e) make one thing that’s enjoyable to play, which is vital with a purpose to

  • (f) persuade sufficient individuals to play that recreation mode?

I ponder the place Aquinas stands on scissors packaging.

The delicate trick is in (e). If we will make one thing that’s simply sufficient enjoyable to play to kickstart the primary cycle, we’ll have our prime mover. We’ll want it to be playable with out the person reaching a lifeless finish anytime an unsolvable stage exhibits up. That might kill the entire vibe.

Say whats up to the SHUFFLE button: your infinite Get-Out-of-Jail-Free card to shuffle these pesky lifeless ends away!

Okay, I assume we may acknowledge upfront there are imperfections and lifeless ends and hope somebody will nonetheless give the janky prototype a strive? Nah, we will do insanely higher than that. We make it a function, not a bug! Keep in mind Minesweeper from the 90s?

A pothole is a tragic accident. A mine is a strategic lure.

Introducing ENDLESS Mode. Feeling caught at any stage? Strive once more, or just ‘SHUFFLE’ it away and transfer on to the subsequent one ????????‍♀️

Be careful! Some ranges are purposefully sprinkled in there as unsolvable traps to be SHUFFLED away ????

Hone the ability of instantly recognizing dead-ends otherwise you’ll get caught in a treacherous maze ????

Do not be shy with SHUFFLING! You’ve gotten limitless shuffles – simply keep watch over the countdown otherwise you’ll be racing to Recreation Over⏱️????

Unsolvables are a function, not a bug. Quickly, till the ML’s prepared.

To supply that very same adrenaline rush old-school arcade video games are well-known for, we additionally cap the additional time that’s winnable from efficiently fixing ranges again on the countdown’s beginning 100sec. This lets our gamers maintain their guard up as they progress — and retains our ranges being diligently solved labeled.

And the very best half, no information by any means about any person is required. Echo Chess merely tracks its personal generated stage configs, and tags whether or not they had been solved or not. All information assortment is 100% nameless, not even ‘anonymized’. Proud to be supporting, a real privacy-obsessed, cookie-less, GDPR-native, open supply analytics platform.

Subsequent, I wanted to GTM with this crowdsourcing Trojan Horse. A savvy advertising buddy advisable I submit a promo video on Instagram (or, God forbid, Tiktok) to achieve newcomers from the informal gaming scene.

Not too happy with what occurred subsequent however, hey, information ain’t low-cost ngl.

I gained’t say I had to make use of the robotic voice, however I gained’t deny it both.
No clue who the face is on the finish btw – is {that a} Gen Z factor?

Imagine it or not, that lame promo + some trending track actually hit it off with the social gods. Quickly sufficient, Echo Chess grew to 1000s of WAUs solely via phrase of mouth. I began getting tagged in Twitter conversations from strangers in different languages.

Spannend again at you, brothers. Thank Godheid for Google Translate.

In order that was nice — it meant I now had actual customers (fortunately) playtesting the randomly generated mazes totally free, making an attempt their greatest to beat every one, and mechanically tagging a brand new maze as ‘solvable’ within the background each time they leveled up. Win-win-win. Growth.

Clearly although, the problem remained with figuring out all of the ‘unsolvable’ ranges. Making an attempt to crowdsource the labeling of those ones is extraordinarily prone to result in tons of false unfavorable labels that soiled the information, so it’s a nasty concept to delegate it. Simply because somebody on the market couldn’t resolve a stage doesn’t imply it’s unsolvable. Solely the other is true.

What’s one of the best ways to ensure your information is clear? Acquire it your self.

To recover from the hump of preliminary bootstrapping, I went forward and manually examined+tagged 500 completely different stage generations. Yep, 500. Manually.

Was it time-consuming? Conform to disagree. Did my head damage? Ice cream helped. Did I get pleasure from it? You wager. I imply I’m actually fixing a puzzle recreation, it is not like I am labeling dry paint.

Professional tip: gamify your ML information assortment. Video games are a great drug.

And I’ll allow you to in on just a little secret, however please don’t do that in prod. I used to be too lazy involved about getting carpal tunnel from clicking round 100s of ranges on my native dev env with a desktop mouse. So I pushed an easter-egg div to the reside cellular model of Echo Chess and I used it to tag all ‘unsolvables’ by tapping an invisible pixel on the display screen whereas on the go —the ‘solvables’ continued being tracked mechanically upon leveling up.

Which brings us to right this moment. Right here’s the information mining pipeline because it stands: consumer app -> Flask server -> ReplitDB -> export to csv -> Jupyter nb for evaluation + mannequin coaching.

From right here on out, as soon as we attain some fashions we really feel ‘adequate’ about, we will export them, import them into our Flask server, and deal with the Inference stage from there via common client-server communication.

However let’s not get forward of ourselves but. We proceed our post-data-mining journey on the Jupyter aspect.

Lastly enjoying with some information

We begin by pulling within the labeled ranges information right into a Jupyter pocket book. I’m utilizing Kaggle right here, however we will simply change it up with Colab, our personal field, or actually something.

1import pandas as pd

2df = pd.read_csv('levels-dataset.csv')


5548 rows × 7 columns

That’s 5,548 labeled, distinctive, procedurally generated ranges. Each single one in all them manually performed and licensed by a human participant as a solvable or unsolvable maze. Now we’re speaking????

We appear to have ~40 ranges (0.7%) that bought mined with an undefined compoundFEN. Let’s simply do away with these to maintain issues clear. No tears shed.

1df = df[~(df['compoundFen'] == 'undefined')]

2df.reset_index(drop=True, inplace=True)



Completed. 5.5k. That’s a pleasant spherical quantity.

Practice-Take a look at Splits

Earlier than we do anything, we should always randomly break up the information into three separate units: Coaching, Validation and Take a look at.

Keep in mind taking the SAT? Our buddy Invoice right here does and he’ll take us down reminiscence lane.

Sorry Invoice, we’re about to dissect your previous trauma.

It solely took Invoice a number of tries to nail these apply sheets that include solutions within the again. Positive, his first two makes an attempt had been abysmal. However that’s as a result of the check format is bizarre. As soon as Invoice noticed the right solutions he instantly bought what the fuss is about. Quickly he was acing each single query. Invoice’s mother was proud. Her Billy was at all times such a quick learner.

Then Invoice tried a ‘practical’ check. Downloaded a recent check financial institution, sight-unseen —even timed himself and all. However Invoice didn’t get so fortunate this time round. No sweat, he’d been there earlier than. Higher flunk these than the precise examination. A minimum of Invoice was now used to the ‘actual’ circumstances and the significance of guessing when getting unfamiliar questions.

By his tenth check, Invoice had actually perfected his guessing recreation. He taught his associates suggestions and methods, preaching about multiple-choice patterns and Bayesian chances of B’s and C’s in sections following alternating A’s and D’s. He confirmed them charts he’d been plotting of check financial institution Q&As. Invoice had lastly conquered the SAT.

Actual life generally is a slap within the face. Powerful classes can actually cook you through.

When Invoice lastly sat down for the actual factor on D-Day, he couldn’t shake that bizarre feeling that the questions he was getting appeared a bit… completely different. Nonetheless, Invoice had an unfair benefit: his genius ‘probablistic guessing’ method. He may’ve sworn it’s infallible —even considered patenting it. But Invoice by no means bombed so laborious as he did that day.

Might or not it’s that Invoice was finding out for the flawed model all alongside?

You, my buddy, had been majorly overfitting. You bought good at gaming ‘a’ system, simply not the one which mattered.

As an alternative of focusing his studying on universal concepts that generalize broadly, Invoice simply wasted his time developing with two-bit parlor methods to impress a handful of apply checks. And the saddest half: if he hadn’t gone via the three sobering phases, Invoice would’ve by no means even identified he sucked.

Let’s not repeat Invoice’s mistake. We wish our fashions to study truly helpful stuff that’s relevant to information they’ve by no means seen earlier than. The entire level is to check how effectively we will predict solvability for brand new unkown ranges. So let’s make sure to set some apart that we’ll maintain 100% untouched.

1# Cut up 10% into Take a look at, preserving the `solvability` distribution

2df, tst_df = train_test_split(df, test_size=0.1, random_state=42, stratify=df['solvability'])


4# Cut up the non-test portion (90%) into 70% Coaching, 20% Validation

5trn_df, val_df = train_test_split(df, test_size=0.2/0.9, random_state=42, stratify=df['solvability'])

1print(len(trn_df), len(val_df), len(tst_df))

13855 1102 551

Nice. We’ve now randomly break up our ranges into:

  • 70% Coaching Set in trn_df (3,855/5,508)

  • 20% Validation Set in val_df (1,102/5,508)

  • 10% Take a look at Set in tst_df (551/5,508)

Information Preprocessing and Characteristic Engineering

Given the data of EchoChess mechanics, we already know there are a lot of options of a stage that we may analyze and which may present some solvability predictive energy: issues like partitions, beginning items, variety of queens, and so forth. All this info may be extracted from a stage’s compoundFen, which suggests new options may be simply engineered and populated for each row.

With out boring you with the deets, I wrote a preprocessing operate that takes in a pandas df, and returns it with all the additional options which may be of curiosity. This operate is then referred to as like this.




Listed here are a few of the new columns added by proc_data:

  • levelSize

  • levelGeoShape

  • levelOrientation

  • numObstacleSquares

  • numConnectedWalls

  • lengthOfLongestWall

  • startingWhitePiece

  • isStartingPieceBlocked

  • numEndStates

  • numBlackKnights

  • numBlackBishops

  • numBlackRooks

  • numBlackQueens

  • numBlackKings

  • numEmptySpaces

  • emptySquaresProportion

  • blackPiecesProportion

  • obstacleSquaresProportion

  • +to encode a sure understanding of the relative spatial positioning of each board state, we generate the next 64 options (for every of the squares within the 8×8 board):

It is a pattern of what our Coaching set trn_df appears to be like like proper now.

3855 rows × 67 columns

I’ll solely name out the implementation of some attention-grabbing options we added that I feel showcase effectively the ability and suppleness of the compoundFen format.

1def get_starting_white_piece(fen):

2 return subsequent((c for c in fen if c.isupper() and c != 'X'), '')

1def count_black_pieces(fen):

2 excluded_chars = ['x', 'w']

3 return sum(1 for c in fen if c.islower() and c not in excluded_chars)

1def count_empty_spaces(fen):

2 piece_placement = fen.break up(' ')[0]

3 digits = [int(char) for char in piece_placement if char.isdigit()]

4 return sum(digits)

1def hasValidMove(rank, file, board):

2 piece = board[rank][file].decrease()

3 if piece not in ['r', 'b', 'n', 'k', 'q']:

4 return 0

5 def outOfBounds(i, j):

6 return (i < 0) or (i >= 8) or (j < 0) or (j >= 8)

7 def blockedMove(dx, dy):

8 if outOfBounds(rank+dx, file+dy):

9 return 1

10 # examine if it hits an impediment or boundary

11 return (board[rank+dx][file+dy].decrease() == 'x')

12 blockedCross = all(blockedMove(dx, dy) for dx, dy in [(1, 0), (-1, 0), (0, 1), (0, -1)])

13 blockedDiagonal = all(blockedMove(dx, dy) for dx, dy in [(1, 1), (-1, 1), (1, -1), (-1, -1)])

14 l_jumps = [(2, 1), (2, -1), (-2, 1), (-2, -1), (1, 2), (1, -2), (-1, 2), (-1, -2)]

15 blockedLJump = all(blockedMove(dx, dy) for dx, dy in l_jumps)

16 if piece == 'r' and blockedCross:

17 return 0

18 elif piece == 'b' and blockedDiagonal:

19 return 0

20 elif piece == 'n' and blockedLJump:

21 return 0

22 elif (piece == 'ok' or piece == 'q') and blockedCross and blockedDiagonal:

23 return 0

24 return 1

1def blocked_starting_piece(fen):

2 board = fen_to_board(fen)

3 white_pieces = ['R', 'B', 'N', 'K', 'Q']

4 for rank in vary(8):

5 for file in vary(8):

6 if board[rank][file] in white_pieces and not hasValidMove(rank, file, board):

7 return 1

8 return 0

1def numEndStates(fen):

2 board = fen_to_board(fen)

3 black_pieces = ['r', 'b', 'n', 'k', 'q']

4 blocked_count = 0

5 for rank in vary(8):

6 for file in vary(8):

7 if board[rank][file] in black_pieces and not hasValidMove(rank, file, board):

8 blocked_count += 1

9 return blocked_count

Every one of many quite a few features like these is then simply utilized to the person rows of the dataframe to create new options. As an example:

1df['numEndStates'] = df['compoundFen'].apply(numEndStates)

2df['blackPieces'] = df['compoundFen'].apply(getBlackPieces)

3df['numRooks'] = df['blackPieces'].apply(lambda x: countSpecificPiece(x, 'r'))

And so forth and so forth.

With a purpose to simplify our evaluation down the road, let’s additionally make a distinction, inside every of the three 90/20/10 datasetstrn_df, val_df, and tst_df, between the impartial variables (the X’s) and the dependent variable (the Y). Y is the column we’re making an attempt to foretell, i.e. solvability. X’s are all of the columns we’re utilizing to foretell Y.

Observe that this can be a column break up, not a row break up like train-val-test. So the quantity of knowledge in every set gained’t change.

1def xs_y(df):

2 ...

3 # `cats`, `conts` are names of categorical & steady impartial variables

4 xs = df[cats+conts].copy()

5 return xs,df['solvability'] if 'solvability' in df else None

1trn_xs, trn_y = xs_y(trn_df)

2val_xs, val_y = xs_y(val_df)

3tst_xs, tst_y = xs_y(tst_df)

Cool, that was straightforward. From right here, anytime we need to analyze, mannequin or predict the connection between the X’s and the Y for any of the three prepare/val/check units, we will merely use the corresponding pair.

Exploratory Information Evaluation

We are able to now carry out some preliminary EDA on the coaching set. As you already guessed, we’ll be utilizing trn_xs and trn_y for these plots.

Give a bishop an enormous open area, they usually’ll nonetheless miss out on (precisely) half its glory.
If life provides you bishops, commerce them for anything.
If there are sufficient queens to morph into on the market someplace, you’ll most likely be tremendous.
The smaller the extent, the extra constrained the exploration paths grow to be.
Extra items to seize -> extra levels of freedom -> extra paths to victory.
GMs ought to actually strive sprinkling some obstacles of their Chess matches.
6×6 ‘core’ stage measurement for Countless mode (excluding boundaries on recordsdata a & h or ranks 1 & 8).
Can you see the 4 ‘quad’ facilities the place knights have 0 strikes? (c3, f3, c6, f6)

Machine Studying

Primarily based on these preliminary EDA observations, let’s arrange a dummy classifier to get an excellent fast baseline. We’re simply eye-balling the above trn_df plots for now.

Keep in mind, we prepare a classifier on trn_df —or extra exactly on this case, we ‘eye-ball’ a dummy classifier on trn_df, then we validate our classifier’s predictive energy on val_df. The check set tst_dfis purposefully by no means touched! We’ll solely come again to it after we’ve chosen the successful mannequin utilizing val_df and we’re achieved with completely the whole lot.

1def naive_predict(df):

2 predictions = []

3 for _, row in df.iterrows():

4 # having queens on the board helps

5 if row['numQueens'] > 0:

6 predictions.append(True)

7 # having fewer obstacles helps

8 elif row['obstaclesPortion'] < 0.25:

9 predictions.append(True)

10 # assume the whole lot else unsolvable

11 else:

12 predictions.append(False)

13 return predictions

1naive_preds = naive_predict(val_xs)

2print(mean_absolute_error(val_y, naive_preds))


Right here’s methods to interpret this info:

  • ‘Imply Absolute Error’ (MAE) = how far off are we, on common, from appropriately predicting the solvability (val_y) of every of the 1,102 Echo Chess ranges in val_xs. We bought MAE=0.14 right here.

  • ‘Confusion Matrix’ = fancy time period to imply this 2×2 desk above which provides a way of how ‘confused’ our classifier was in its predictions. For each group of solvability actuals (vertical axis), it exhibits us what our predictions had been (horizontal axis).

    • We appropriately predicted ‘unsolvable’ for 27 true unsolvables (True Negatives, TN). Good.

    • We incorrectly predicted ‘unsolvable’ for 60 ranges that had been truly solvable (False Negatives, FN). Not so good.

    • We appropriately predicted ‘solvable’ for 920 true solvables (True Positives, TP). Good.

    • We incorrectly predicted ‘solvable’ for 95 ranges that had been truly unsolvable (False Positives, FP). Not good.

Our purpose now could be to hopefully enhance on this naive predictor which shall be used as a mininmum baseline for efficiency.

Let’s first evaluate these outcomes to a correct, however easy, Resolution Tree to examine whether or not our EDA-eye-balling predictor is that far off. As an alternative of us splitting hairs on the place the nice-looking teams ought to be separated on visible plots, a tree mannequin will recursively partition our ranges primarily based on the options that greatest separate them into distinct teams —maximizing each heterogeneity throughout the teams and homogeneity inside every.

1from sklearn.tree import DecisionTreeClassifier, export_graphviz

2import graphviz


4m = DecisionTreeClassifier(max_leaf_nodes=4).match(trn_xs, trn_y);


6def draw_tree(t, df, measurement=10, ratio=0.6, precision=2, **kwargs):

7 s=export_graphviz(t, out_file=None, feature_names=df.columns, crammed=True, rounded=True,

8 special_characters=True, rotate=False, precision=precision, **kwargs)

9 return graphviz.Supply(re.sub('Tree {', f'Tree {{ measurement={measurement}; ratio={ratio}', s))


11draw_tree(m, trn_xs, measurement=10)

Okay, so it’s first selecting up on the proportion of obstacles to board measurement, identical to we did. It has a extra exact cut-off —positive, we kinda winged that one. Then it’s trying on the variety of capturable items on the board, and at last the kind of the beginning piece. That makes lots of sense, the EDA plots for these had been fairly attention-grabbing. Not too dangerous.

1print(mean_absolute_error(val_y, m.predict(val_xs)))


MAE at 0.11. Some enchancment already. Perhaps we’ll see even higher outcomes with an even bigger tree that has extra leaves. Feels seasonal sufficient.

1m = DecisionTreeClassifier(max_leaf_nodes=16)

2m.match(trn_xs, trn_y)

3draw_tree(m, trn_xs, measurement=16)

1print(mean_absolute_error(val_y, m.predict(val_xs)))


Nope, our common error truly elevated a tiny bit. Let’s see if we get enhancements with a full-on Random Forest as an alternative.

Consider an RF as a crew of particular person Resolution Timber working collectively to make higher selections. Every one takes a random chunk of our ranges (rows) and a random subset of their options (columns) and tries to provide you with some intelligent prediction. Then they get collectively and mix their efforts. It’s like touring in massive teams, solely truly gratifying, not solely ineffective, and the place selections do get made.

1from sklearn.ensemble import RandomForestClassifier


3rf = RandomForestClassifier(100, min_samples_leaf=5, random_state=42)

4rf.match(trn_xs, trn_y)


6rf_preds = rf.predict(val_xs)

7print(mean_absolute_error(val_y, rf_preds))


1confusion_matrix(val_y, rf_preds)

Attention-grabbing. MAE appears to be like higher at 0.10, however we now appear to virtually at all times be predicting the ‘solvable’ class, resulting in a ridiculous quantity of False Positives: 111 unsolvables incorrectly categorised as solvable ????????‍♂️.

Keep in mind once we information mined all these labeled ranges, and the way crowdsourcing via gaming was solely letting us confidently collect solvables?

Effectively this was certain to occur, and we noticed it coming, however now we’ve ended up with an imbalanced stage dataset the place unsolvables are a lot much less widespread (~1:8 ratio).

Would possibly as effectively mindlessly predict ‘solvable’ throughout the board.

He’s not flawed, you understand.

Perhaps we will strive choosing a threshold larger than 50% for the mannequin to foretell True. That’d be, loosely talking, like asking it to extend its confidence in a stage’s solvability earlier than saying sure.

Hmm. What will we select as a threshold: 60%, 75%, 90%… even larger? They are saying ML is an iterative science. Let’s throw issues on the wall begin with an informed guess to get a really feel for the impression.

1threshold = 0.75

2raw_rf_preds = rf.predict_proba(val_xs)[:, 1]

3rf_preds = (raw_rf_preds >= threshold).astype(int)


5print(mean_absolute_error(val_y, rf_preds))


1confusion_matrix(val_y, rf_preds)

Take a look at us ping-ponging. We now have a barely worse MAE, however the quantity of False Positives has considerably improved (virtually halved) as a result of we at the moment are demanding the next confidence threshold.

I really feel like there’s a lacking half.

Eval Metrics

Earlier than we go any additional with ML mannequin iterations to see what works greatest, we want a crystal clear definition of what it means to be ‘greatest’ for our particular use case, or we would find yourself optimizing the flawed factor proper handed it.

“In case you intention at nothing, you’ll hit it each time.” —somebody who aimed toward sounding deep.

It’s clear there are trade-offs right here for enhancing a few of the metrics on the expense of others. However is there a wiser and extra systematic strategy to tune this threshold hyperparameter? Positive factor, we’ll want to decide on a intelligent level on the Precision-Recall and AUC-ROC curves, as we’ll see beneath. Defining that time is trickier than it feels.

Precision vs Recall

Provided that the last word purpose of all that is to reduce unsolvable ranges being served to the person, and on condition that the procedural technology of recent ranges is pretty environment friendly and ‘low-cost’ in some sense, one argument could possibly be to maximise ‘Precision’ (i.e. the proportion of ranges categorised as ‘solvable’ when they’re, certainly, solvable in actuality).

No, I gained’t embrace LaTeX formulation for Precision and Recall.
They ++affectation and –intuition.

You’ll be able to consider aiming for top precision as requiring a excessive confidence within the ranges we’re judging to be ‘solvable’, even when we occur to even be overconservatively discarding different ranges that had been borderline solvable however that we most well-liked classifying as ‘unsolvable’ simply in case.

In different phrases, we’d be keen to optimize for top precision even on the expense of a possible drop in ‘Recall’ (i.e. even when many probably legitimate ranges get missed out on via misclassification in our overzealous quest for purity of ‘solvables’).

Recall and True Constructive Fee (TPR) are actually the identical factor. Gandalf-level renaming.

FPR vs F1

Equally, we additionally definitely need to reduce the “False Positives Fee” or FPR ratio (i.e. proportion of ‘unsolvable’ ranges that mistakenly get categorised as ‘solvable’, thus ruining our purity of ‘solvables’). We are able to both plot FPR instantly for each threshold of our binary classifier, or use the Receiver Working Attribute (ROC) curve to research the FPR vs TPR (True Constructive Fee) trade-off. We’ll try this shortly beneath.

Nonetheless, given the category imbalance and that almost all of our dataset comes with constructive labels (‘solvable’ ranges), we will seemingly anticipate Precision on this dataset to be barely much less delicate than FPR to an ‘unsolvable’ stage being misclassified as ‘solvable’. It is because precision (and recall) would mirror principally the flexibility of prediction of the constructive class (solvables), not the unfavorable class (unsolvables) which is able to naturally be tougher to detect because of the smaller variety of samples.

On this case, we will anticipate FPR to be extra delicate than Precision to the mannequin’s precise efficiency on condition that the unfavorable class is the minority one within the dataset.

Subsequently, we’ll search for the optimum threshold vary that minimizes FPR, and maximize the F1 Rating inside that vary.

Why F1, for that secondary purpose? Making an attempt to maximise Precision even additional inside an already minimized FPR vary may result in undesirable extremes like extremely low Recall and/or TPR values. Conversely, it could possibly be tempting to go looking inside that vary for a ‘reasonable-recall’ level alongside the Precision-Recall curve, in order that fewer procedurally generated solvables get unnecessarily tossed out. However that is just about what F1 is doing for us underneath the hood both method, so F1 will get us two birds, one stone.

The trick is to maintain the grip free sufficient, otherwise you’ll find yourself letting go of each final stage.

We then outline a max % FPR we’re keen to just accept for our use case, say, 5%. Keep in mind, this may characterize how most of the generated ranges that we predict to be solvable – all else equal – find yourself being, sadly, unsolvable. We’ll nonetheless have some leeway in methods to make use of those generated ranges and their predictions in prod, however for now this looks like an affordable start line.

To recap our method:

Right here’s the code we’ll must implement it —minus trivia for plots and the like. We’ll be plotting a bunch of cool curves for various fashions nevertheless it’s all primarily based on the kernel beneath.

1from sklearn.metrics import roc_curve, precision_recall_curve


3def threshold_tune_binary_classifier(val_xs, val_y, mannequin, max_fpr_tolerance=0.05):


5 raw_preds = mannequin.predict_proba(val_xs)[:, 1]

6 fpr, tpr, thresholds_roc = roc_curve(val_y, raw_preds, drop_intermediate=False)

7 precision, recall, thresholds_pr = precision_recall_curve(val_y, raw_preds)

8 f1_scores = 2 * (precision * recall) / (precision + recall)

9 min_fpr_thresholds_range = thresholds_roc[fpr <= (min(fpr) + max_fpr_tolerance)]


11 # Discover thresholds, precisions, recollects inside min-FPR vary

12 sorted_idx = np.argsort(thresholds_pr)

13 idx_range = np.the place((thresholds_pr[sorted_idx] >= min(min_fpr_thresholds_range)) & (thresholds_pr[sorted_idx] <= max(min_fpr_thresholds_range)))

14 pr_idx_in_range = sorted(sorted_idx[idx_range])

15 pr_thresholds_in_range = thresholds_pr[pr_idx_in_range]

16 precision_in_range = precision[pr_idx_in_range]

17 recall_in_range = recall[pr_idx_in_range]


19 # Maximize F1 inside min-FPR vary

20 f1_in_range = 2 * (precision_in_range * recall_in_range) / (precision_in_range + recall_in_range)

21 min_fpr_max_f1_threshold = pr_thresholds_in_range[np.argmax(f1_in_range)]

22 min_fpr_max_f1_index = int(np.the place(thresholds_pr == min_fpr_max_f1_threshold)[0])

23 roc_target_index = int(np.the place(thresholds_roc == min_fpr_max_f1_threshold)[0])


25 return {

26 'threshold': min_fpr_max_f1_threshold,

27 'precision': precision[min_fpr_max_f1_index],

28 'recall': recall[min_fpr_max_f1_index],

29 'f1': f1_scores[min_fpr_max_f1_index],

30 'fpr': fpr[roc_target_index],

31 'tpr': tpr[roc_target_index]

32 }

Mannequin Choice and Tuning

Nice. So now we now have information, options, metrics, and a correct strategy to choose appropriate fashions. Let’s spin up some correct tabular fashions and see what’s up.

Tabular fashions

Random Forest

Keep in mind our rf? Let’s retune this mannequin to enhance the metrics we truly care about.

1rf_targets = threshold_tune_binary_classifier(val_xs, val_y, rf)

1Threshold: 0.9505685425685425

2Precision: 98.63%

3Recall: 43.98%

4F1 Rating: 60.83%

5True Constructive Fee: 43.98%

6False Constructive Fee: 4.92%

And I’ve bought some snazzy plots for this rf for us to take a look at. Don’t fear if issues look a bit complicated, we’ll undergo them collectively in a sec.

Left: FPR for each prediction threshold we may set. We’re virtually hitting our 5% max tolerance.
Proper: F1 for each attainable threshold. That’s the max we may get given our FPR constraint.
(X-axis the identical for each)
Left: Precision for each attainable threshold. As anticipated, FPR will get us a great worth right here ‘totally free’.
Proper: Corresponding Recall at every Precision worth. Once more, F1 lands us on a strong level ‘totally free’.
(Y-axis the identical for each)
Bonus: Corresponding TPR (i.e. Recall) for each FPR worth. This one particularly illustrates the trade-off of constructing overconservative predictions. However it’s just a little redundant for what we want.

What does this all imply?

It implies that for this mannequin and dataset, if we set an excessively conservative prediction threshold (i.e. must be >0.95 to be thought-about solvable, the whole lot else is assumed not), we will drop the False Constructive price all the way down to 4.9% and get a fairly excessive Precision of 98.6%. In different phrases, we’d be fairly assured that the solvable ranges we select are literally solvable.

Evidently this comes at a value. We’d solely be getting a True Constructive price of 44%, so we’d be throwing out 1 of each ~2.2 truly solvable ranges, say half of them. However that’s the very best that this mannequin may give us for our information given the metrics we care to optimize. Not too shabby tbh.

Left: orginal RF earlier than tuning a threshold. Proper: after tuning for min-FPR, max-F1.
We utterly reversed virtually all of the FPs at the price of solely ~half the TPs.
This TP loss would look scary if we hadn’t purposefully outlined our targets.

All in all, this appears to be like promising. We’ll most likely find yourself utilizing a number of fashions that may decide up on various things and ensemble them collectively. Let’s add this mannequin to our models_predictions ensembling dict.

1raw_preds = rf.predict_proba(val_xs)[:, 1

2rf_threshold_target = rf_targets['threshold']

3predictions = (raw_preds >= rf_threshold_target).astype(int)

4models_predictions['Random-Forest-tuned'] = {

5    'tuned_threshold': rf_threshold_target,

6    'raw_predictions': raw_preds,

7    'class_predictions': predictions


Balanced Random Forest

One other method is to make use of a Balanced Random Forest (BRF) which may additionally assist scale back the speed of false positives brought on by our imbalanced dataset.

1from imblearn.ensemble import BalancedRandomForestClassifier


3brf = BalancedRandomForestClassifier(n_estimators=100, min_samples_leaf=5, random_state=42)

4brf.match(trn_xs, trn_y)

This has a lot much less FPs than the PRE-tuning unbalanced Random Forest, nevertheless it doesn’t beat the threshold-tuned model of our unbalanced rf, given our metrics targets. Let’s have a look at if we will mix each approaches by threshold-tuning the BRF.

1brf_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, brf, max_fpr_tolerance)['threshold']

1Threshold: 0.6988891533303299

2Precision: 98.67%

3Recall: 45.51%

4F1 Rating: 62.29%

5True Constructive Fee: 45.51%

6False Constructive Fee: 4.92%

This tuned-BRF has very comparable FPR and Precision to the unbalanced tuned-RF, whereas additionally having barely TPR and F1. We’ll add it to the ensemble as effectively —similar code with brf as an alternative of rf.

XGBoost (eXtreme Gradient Boosting)

I’m pondering it’s value making an attempt to coach an XGBoost mannequin as an alternative —XGBoost tends to deal with unbalanced datasets barely higher, so it could possibly be a great match on this case.

Keep in mind the crew analogy we touched on for random forests? XGBoost is like that good child who added some construction across the decision-making crew. You’re nonetheless making a bunch of timber, however not solely randomly —as an alternative you attempt to repair the errors of the earlier timber as you go, paying additional consideration to the errors and adjusting every new tree accordingly. Therefore the boosting identify.

1import xgboost as xgb


See Also

3xgb_model = xgb.XGBClassifier(n_estimators=100, min_child_weight=5, random_state=42, scale_pos_weight=sum(trn_y==0)/sum(trn_y==1))

4xgb_model.match(trn_xs, trn_y)

This primary try of XGBoost appears to have a reasonably excessive FPR. Let’s strive threshold-tuning it see if it might probably match our wants higher.

1xgb_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, xgb_model, max_fpr_tolerance)['threshold']

1Threshold: 0.9594321846961975

2Precision: 98.92%

3Recall: 46.94%

4F1 Rating: 63.67%

5True Constructive Fee: 46.94%

6False Constructive Fee: 4.10%

Take a look at that beautiful child. Higher vitals throughout the board: decrease FPR, larger Precision, larger Recall/TPR, larger F1. Offered to the ensemble within the again.

1raw_preds = xgb_model.predict_proba(val_xs)[:, 1]

2predictions = (raw_preds >= xgb_threshold_target).astype(int)

3models_predictions['XGBoost-tuned'] = {

4 'tuned_threshold': xgb_threshold_target,

5 'raw_predictions': raw_preds,

6 'class_predictions': predictions


Characteristic Significance

One other attention-grabbing method we will strive is to solely choose crucial options based on the random forest, after which prepare our mannequin solely on these options, disregarding all others. The rationale this works in lots of circumstances is as a result of it might probably assist scale back noise, multicollinearity with much less related options, and make the mannequin much less susceptible to overfitting.

Random Forests are normally good at figuring out function significance in a dataset. Let’s have a look utilizing our rf mannequin.

1def get_top_features_importance(mannequin, trn_xs, min_imp=0, num_top_features=40):

2 plt.determine(figsize=(10, 8))

3 feature_importances = mannequin.feature_importances_

4 df = pd.DataFrame(dict(cols=trn_xs.columns, imp=feature_importances))

5 df = df[df['imp'] > min_imp].sort_values(by='imp').tail(num_top_features)

6 ax = df.plot('cols', 'imp', type='barh', legend=False, xlabel='', ylabel='')

7 plt.present()

8 return df.sort_values(by='imp', ascending=False)


10rf_top_features = get_top_features_importance(rf, trn_xs, min_imp=0.02)

Now let’s retrain our random forest solely on these high options. We’ll make certain to threshold-tune it as effectively.

1imp_trn_xs = trn_xs[rf_top_features.cols]

2imp_val_xs = val_xs[rf_top_features.cols]


4rf_top_feats = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, random_state=42)

5rf_top_feats.match(imp_trn_xs, trn_y)


7top_feats_rf_threshold_target = threshold_tune_binary_classifier(imp_val_xs, val_y, rf_top_feats, max_fpr_tolerance)['threshold']

1Threshold: 0.9614293206793206

2Precision: 98.72%

3Recall: 47.14%

4F1 Rating: 63.81%

5True Constructive Fee: 47.14%

6False Constructive Fee: 4.92%

Anoter mannequin giving nice outcomes. Throw it within the ensemble bag. the drill.

Alternatively, we may examine if there’s a greater function choice subset based on the function significance interpretation of an XGBoost mannequin versus the Random Forest. On the whole, these ought to be shut sufficient nevertheless it doesn’t damage to examine for our information.

1def xgb_top_features_importance(xgb_model, importance_type, min_imp=0, num_top_features=40):

2 xgb.plot_importance(xgb_model, importance_type=importance_type, max_num_features=num_top_features,

3 title=f"Characteristic {importance_type} significance", xlabel=f"{importance_type}s")

4 xgb_imps_dict = xgb_model.get_booster().get_score(importance_type=importance_type)

5 xgb_imps_dict = dict(sorted(xgb_imps_dict.gadgets(), key=lambda merchandise: merchandise[1], reverse=True))

6 df = pd.DataFrame(record(xgb_imps_dict.gadgets()), columns=['features', f"{importance_type}_importance"]).head(num_top_features)

7 df = df[ df[f"{importance_type}_importance"] >= min_imp ]

8 return df


10xgb_new = xgb.XGBClassifier(n_estimators=100, min_child_weight=5, random_state=42, scale_pos_weight=sum(trn_y==0)/sum(trn_y==1))

11xgb_new.match(trn_xs, trn_y)

12xgb_top_features = xgb_top_features_importance(xgb_new, importance_type="acquire", num_top_features=15)

13x_imp_trn_xs = trn_xs[xgb_top_features.features]

14x_imp_val_xs = val_xs[xgb_top_features.features]

XGB assigns a considerably completely different rating right here and there, however the general assortment is analogous.

Apparently, once we retrain the XGB mannequin on xgb_top_features as an alternative of rf_top_features, we get barely worse outcomes. Even various the importance_type XGB makes use of to pick the highest options (throughout acquire, cowl or weight) doesn’t enhance a lot over the nice outdated rf_top_features method. We’ll simply maintain it easy and keep on with the unique appraoch we had.

I’d say we now have an honest begin thus far with these tabular fashions. There’s one thing far more intriguing we is perhaps lacking out on, although. I’m positive a few of you studying this might need been ready for it from the second we uttered the phrase ‘classification’.

CNNs and picture classification

CNN classifier utilizing chess board arrays

Taking inspiration from picture recognition approaches, since 2D chess boards are very structured and at all times are available the identical kind with a cell containing one in all a predefined set of lessons, we will prepare a Convolutional Neural Community (CNN) on the 2D chess board states extracted from each stage’s FEN.

  • (1) We’d want 13 classes for the content material of a cell on condition that pawns are excluded right here (Okay, Q, B, N, R, ok, q, b, n, r, _, x, X).

  • (2) We then one-hot-encode every of those 13 to keep away from any implicit relative rating between them.

  • (3) Each one-hot-encoding of the 13 can now be thought-about a separate channel for the CNN.

  • (4) So our enter tensor turns into (numSamples, 13, 8, 8) for each 8×8 chess board. Observe that for the reason that largest ranges in Countless mode are of measurement 6×6, we may additionally preconvert all 2D boards to 6×6 and make the enter tensor (numSamples, 13, 6, 6) if obligatory.

A chess board is actually this. However as an alternative of RGB, it’s KQRBN. Similar stuff, extra channels.
It’s all tensors all the best way down regardless.

This method may even seize the essence of the spatial information and relative positioning of items in a stage in a higher method than img classification of the rendered stage from the Echo Chess app, as a result of solely the ‘substance’ of a stage’s config is being educated on, with out all of the noise and variability linked to pointless visible components.

On the flip aspect, although, this will additionally make it difficult to easily fine-tune a pretrained SOTA CNN or img classifier from the net for the reason that enter format may grow to be considerably completely different from what is anticipated.

In any case, we will see how effectively a primary neural web performs on this method, after which take it from there. We’ll want to arrange the enter information accordingly.

1trn_X = np.array([fen_to_board(x) for x in trn_df['compoundFen'].values])

2one_hot_trn_X = np.array([one_hot_encode_2D_array(x, char_to_index) for x in trn_X])

3trn_X_tensors = torch.tensor(one_hot_trn_X)

We’ll additionally do the identical for val_X, trn_y, and val_y.

To decide on a CNN structure, let’s first begin with a easy baseline and we will iterate from there. On the whole, most CNNs may be regarded as just about some conv layers, normalization, and activation features, rinse and repeat. We’ll use the next easy arch as a begin.

1class ChessCNN(nn.Module):

2 def __init__(self):

3 tremendous(ChessCNN, self).__init__()

4 # Convolutional Layer 1

5 self.conv1 = nn.Conv2d(in_channels=13, out_channels=16, kernel_size=3, stride=1)

6 # Max Pooling Layer 1

7 self.maxpool1 = nn.MaxPool2d(kernel_size=2, stride=2)

8 # Convolutional Layer 2

9 self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1)

10 # Absolutely Related Layer 1

11 self.fc1 = nn.Linear(in_features=32, out_features=64)

12 # Absolutely Related Layer 2 (Output Layer)

13 self.fc2 = nn.Linear(in_features=64, out_features=1)


15 def ahead(self, x):

16 x = F.relu(self.conv1(x))

17 x = self.maxpool1(x)

18 x = F.relu(self.conv2(x))

19 x = torch.flatten(x, 1)

20 x = F.relu(self.fc1(x))

21 x = torch.sigmoid(self.fc2(x))

22 return x

That is what our primary cnn appears to be like like:


2 (conv1): Conv2d(13, 16, kernel_size=(3, 3), stride=(1, 1))

3 (maxpool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)

4 (conv2): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))

5 (fc1): Linear(in_features=32, out_features=64, bias=True)

6 (fc2): Linear(in_features=64, out_features=1, bias=True)


We’ll prepare this cnn mannequin for 8 epochs with the Binary Cross-Entropy (BCE) because the Loss Perform, Adam because the optimizer, and a Studying Fee of 0.002. Why these selections? That’s what labored greatest in my experiments on this information.

1cnn = ChessCNN()

2criterion = nn.BCELoss()

3optimizer = optim.Adam(cnn.parameters(), lr=0.002)

4num_epochs = 8

5batch_size = 8

6for epoch in vary(num_epochs):

7 cnn.prepare()

8 for i in vary(0, len(trn_X_tensors), batch_size):

9 inputs = trn_X_tensors[i:i+batch_size].float()

10 labels = torch.tensor(trn_y[i:i+batch_size], dtype=torch.float32).view(-1, 1)

11 optimizer.zero_grad()

12 outputs = cnn(inputs)

13 loss = criterion(outputs, labels)

14 loss.backward()

15 optimizer.step()

Now that we now have a neural web classifier spun up, let’s tune its prediction threshold like we did for the opposite classifer fashions.

1def get_cnn_raw_preds(cnn, X_train, X_val):

2 cnn.eval()

3 with torch.no_grad():

4 trn_raw_predictions = cnn(X_train.float())

5 trn_raw_predictions = trn_raw_predictions.squeeze().numpy()

6 val_raw_predictions = cnn(X_val.float())

7 val_raw_predictions = val_raw_predictions.squeeze().numpy()

8 raw_preds_dict = {

9 "trn_raw_preds": trn_raw_predictions,

10 "val_raw_preds": val_raw_predictions

11 }

12 return raw_preds_dict


14trn_raw_predictions = get_cnn_raw_preds(cnn, trn_X_tensors, val_X_tensors)['trn_raw_preds']

15val_raw_predictions = get_cnn_raw_preds(cnn, trn_X_tensors, val_X_tensors)['val_raw_preds']

1cnn_threshold_target = threshold_tune_binary_classifier(val_X_tensors, val_y, cnn, max_fpr_tolerance, val_raw_preds=val_raw_predictions)['threshold']

Okay, now we’re speaking. However earlier than we rush and add this mannequin to our ensemble, it’s at all times good to examine our curves to see if something appears to be like off.

Hmm. This mannequin’s curves for FPR, Precision and (particularly) F1 are extremely steep within the neighborhood of the brink goal. I don’t really feel so nice about this information.

It implies that the mannequin could possibly be extraordinarily delicate to noise or slight variations within the information. And it’s particularly the case for the FPR worth that would fluctuate simply between 0 and 40% if the check information has sufficient variation in comparison with our coaching+validation units. Will probably be vital to maintain this in thoughts when deciding whether or not/methods to incorporate this mannequin’s predictions within the ensemble.

With all this experimentation we’ve been doing to take care of our imbalanced information, we nonetheless haven’t tried including the key sauce. Let’s rectify this subsequent.

Resampling to steadiness class distribution

Provided that solely a small proportion of ranges within the information are of the ‘unsolvable’ class, what if we simply tried to… oversample from this minority class to get a extra balanced dataset?

If we consider the unsolvable ranges we at the moment have pretty characterize the underlying distribution of unsolvables, let’s pattern extra of them! This has the potential to enhance mannequin efficiency by tackling the imbalanced information concern on the root.

We proceed to making an attempt varied resampling strategies like:

So now we return to the very starting, oversample, retrain, retune, and re-evaluate. For every resampling methodology and every mannequin. Enjoyable enjoyable enjoyable ???? Don’t fear I’ll fast-forward you to the attention-grabbing stuff.

From all of the experimentation I did, the resampling methodology that led to the very best efficiency was the common random oversampling with out interpolation utilizing the RandomOverSampler. SMOTE and its variant resampling strategies had been disappointingly not useful in enhancing outcomes on this dataset. In truth, efficiency truly suffered when making an attempt out fancier resampling strategies on this information.

1trn_xs_rnd_resampled, trn_y_rnd_resampled = resample_to_balance('RandomOverSampler', trn_xs,trn_y)

Good. Our coaching set is now at 6,856 ranges with 50-50 solvables.
It’s loopy to assume that every of those is an precise Echo Chess stage you possibly can play!

Oversampling with tuned-RF

Now that we’ve up to date our coaching set with oversampling, we will retrain (and re-tune) our promising random forest mannequin. Discover we’re utilizing the brand new variations right here: trn_xs_rnd_resampled and trn_y_rnd_resampled.

1rf_rnd_rs = RandomForestClassifier(100, min_samples_leaf=5, random_state=42)

2rf_rnd_rs.match(trn_xs_rnd_resampled, trn_y_rnd_resampled)

3rf_rnd_rs_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, rf_rnd_rs, max_fpr_tolerance)['threshold']

1Threshold: 0.8903113063407183

2Precision: 98.60%

3Recall: 43.06%

4F1 Rating: 59.94%

5True Constructive Fee: 43.06%

6False Constructive Fee: 4.92%

Positive, not dangerous, I’ll take it. Dump it in our ensemble.

1models_predictions['RandomOversampling-Random-Forest-tuned'] = {

2 'tuned_threshold': rf_rnd_rs_threshold_target,

3 'raw_predictions': rf_rnd_rs.predict_proba(val_xs)[:, 1] ,

4 'class_predictions': (rf_rnd_rs.predict_proba(val_xs)[:, 1] >= threshold).astype(int)


Oversampling with tuned-BalancedRF

1brf_rnd_rs = BalancedRandomForestClassifier(n_estimators=100, min_samples_leaf=5, random_state=42)

2brf_rnd_rs.match(trn_xs_rnd_resampled, trn_y_rnd_resampled)

3brf_rnd_rs_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, brf_rnd_rs, max_fpr_tolerance)['threshold']

1Threshold: 0.8999404166095343

2Precision: 98.50%

3Recall: 40.10%

4F1 Rating: 57.00%

5True Constructive Fee: 40.10%

6False Constructive Fee: 4.92%

Okay, perhaps I’m getting just a little too choosy given the stellar choices we now have, however this one is perhaps pushing it a bit.

I imply we’re already oversampling, and utilizing a balanced RF to deal with imbalanced information, and tuning a threshold to be tremendous conservative (all on the threat of generalizability points). A minimum of we should always anticipate some loopy efficiency. I feel we go on this one.

Oversampling with tuned-XGBoost

I’ll spare you a similar traces of code. This one additionally bought higher with oversampling. Guess the place it went? Into the ensemble.

I feel this was truly our greatest mannequin thus far —laborious to maintain up with all these good goodies. Actual-talk, although, subsequent time I’m taking place such a deep rabbit gap, I’m clearly utilizing an MLOps platform like Weights & Biases to maintain monitor of all these mannequin iterations. IYKYK. You reside and also you study.

Ensembling all of the promising fashions

Up to now we’ve been including every promising mannequin to a models_predictions dict of dicts. Let’s convert it to a dataframe to work with it simply.

1models_predictions_df = pd.DataFrame.from_dict(models_predictions, orient='index')


1['Random-Forest-tuned' 'Balanced-Random-Forest-tuned' 'XGBoost-tuned' 'Top-features-RF-tuned' 'CNN-tuned' 'RandomOversampling-Random-Forest-tuned' 'RandomOversampling-XGBoost-tuned']

We are able to now experiment with completely different approaches of ensembling utilizing the Validation set to choose the successful one.

As an example, we may get the imply uncooked predictions throughout the ensemble after which use our present threshold_tune_binary_classifier operate to pick a great ensemble_raw_threshold_target prediction threshold for the avg uncooked preds that minimizes FPR and maximizes F1 scores inside our 5% tolerance.

1avg_models_raw_preds = models_predictions_df['raw_predictions'].imply()


3ensemble_raw_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, None, max_fpr_tolerance=max_fpr_tolerance, val_raw_preds=avg_models_raw_preds)['threshold']


5avg_raw_to_tuned_preds = (avg_models_raw_preds >= ensemble_raw_threshold_target).astype(int)

Equally, we may ensemble by averaging the category prediction themselves (as an alternative of uncooked preds), then choosing a great ensemble_classes_threshold_target for the avg of the category preds themselves.

1avg_models_class_preds = models_predictions_df['class_predictions'].imply()


3ensemble_classes_threshold_target = threshold_tune_binary_classifier(val_xs, val_y, None, max_fpr_tolerance=max_fpr_tolerance, val_raw_preds=avg_models_class_preds)['threshold']


5avg_class_to_tuned_preds = (avg_models_class_preds >= ensemble_classes_threshold_target).astype(int)

Or we may require that each these approaches agree (relative to every one’s tuned threshold respectively), if we wished to be much more overconservative on solvability.

1comb_preds = (avg_raw_to_tuned_preds * avg_class_to_tuned_preds > 0).astype(int)

However usually, every of those approaches dangers getting us into harmful overfitting territory as we squeeze greedily extra of the tuning juice out of the validation set. Apparently, the easier, non-tuned method, finally ends up being as performant too: ensembling via majority voting of the category predictions.

1majority_needed = math.ceil(num_models_in_ensemble / 2)

2majority_class_preds = (num_positive_class_preds >= majority_needed).astype(int)

Listed here are the ultimate outcomes, side-by-side, for every of those 4 ensembling methods, post-threshold-tuning (if relevant) on the validation set:

When selecting amongst gems, decide the colour you want.

Wanting on the TPs and FPs of every, regardless of which ensembling methodology is chosen, when judging a generated stage as solvable, we’d nonetheless guess proper ~99% of the time. For this and the explanations above, we’ll go together with the straightforward majority method as our ensembling method.

Extra particularly, our purpose when productizing this shall be to choose the “most-promisingly solvable” stage from a candidate record of randomly generated ranges. So we’ll kind first by class majority vote predictions, then by avg uncooked preds inside every class —this fashion even when no solvables are discovered, at the very least we’d be serving the most promising stage among the many unsolvable ones.

1# Instance utilization for inference

2ensemble_preds = predict_solvability(level_df, 'fashions')

3most_promising_levels = ensemble_preds.sort_values(by=['ensemble_preds', 'avg_raw_preds'], ascending=[False, False])

Inference on the Exterior Take a look at

Keep in mind that 10% of knowledge we utterly stashed away in tst_df ages in the past? We are able to now lastly examine how our ensemble is doing on it. If it sucks, we’re sort of again to sq. one like our buddy Invoice. Second of fact.

FPR of 1.2% on the untouched check set. I’m not crying, you’re crying.

Growth! And that is it! Extraordinarily promising outcomes on each the validation and check units. Once we’re able to push this to prod, we will load these best-performing fashions into the server and cache them, seize their pre-tuned thresholds, generate particular person predictions on the inference stage, and ensemble them collectively via majority voting.

Nice stuff. 6 classical ML + 1 DL mannequin. All easy architectures. Preprocessing, function engineering, hyperparameter tuning and a few utilization of oversampling right here and there to deal with the imbalanced dataset. A number of different minor issues however all in all only a classic ML modeling routine.

Genies had been the primary AGIs.

However wait! We nonetheless have two last methods up our sleeve to enhance this additional. Chess gamers will like these.

Trick #1: Information Augmentation

Taking inspiration from different area areas like picture processing, we will rapidly discover a chance to considerably increase our dataset by creating new legitimate, labeled samples from our present labeled ones:

  • Each rectangle stage may be mirrored both horizontally or vertically on the opposing aspect (relying on its form) to create one new, successfully equivalent, stage for classification.

  • Equally, each quad stage may be mirrored throughout the three opposing corners (both horizontally, vertically, or each horizontally and vertically) to create 3 new, successfully equivalent, ranges.

Prime-left: authentic stage. Prime-right: augmentation via vertical mirroring.
Backside-left: aug. via horizontal mirroring. Backside-right: aug. via v+h mirroring.

Every of those 4 ranges above are just about the identical and will command the identical solvability label, but they every have a unique compoundFEN enter. Importantly, additionally they precisely characterize the kind of ranges that get generated by our floor fact distribution —which we may be positive of, on condition that it’s a parametrized distribution we invented ourselves for the random stage generator.

Good information augmentation alternative + straightforward low hanging fruits. Let’s implement them rapidly.

1def mirror(compoundFen, course):

2 desk = fenToTable(compoundFen)

3 if course == 'h':

4 return tableToFen(desk[[0, 4, 5, 6, 1, 2, 3, 7]])

5 elif course == 'v':

6 return tableToFen(desk[:, [0, 4, 5, 6, 1, 2, 3, 7]])


8def increase(rows, mirroring_direction):

9 augmented_rows = rows.copy()

10 augmented_rows['compoundFen'] = augmented_rows['compoundFen'].apply(mirror, args=(mirroring_direction,))

11 augmented_rows['levelOrientation'] = augmented_rows['levelOrientation'].apply(switch_orientation, args=(mirroring_direction,))

12 return augmented_rows


14def mirror_augmentation(df):

15 rectangle_rows = df[df['levelGeoShape'] == 'rectangle']

16 quad_rows = df[df['levelGeoShape'] == 'quad']

17 h_rectangle_rows = rectangle_rows[rectangle_rows['levelOrientation'].isin(['h-top', 'h-bottom'])]

18 v_rectangle_rows = rectangle_rows[rectangle_rows['levelOrientation'].isin(['v-left', 'v-right'])]

19 augmented_rows = pd.concat([

20 augment(h_rectangle_rows, 'h'),

21 augment(v_rectangle_rows, 'v'),

22 augment(quad_rows, 'h'),

23 augment(quad_rows, 'v'),

24 # mirror quad diagonally across

25 augment(augment(quad_rows, 'v'), 'h')

26 ], ignore_index=True)

27 augmented_df = pd.concat([df, augmented_rows], ignore_index=True)

28 return augmented_df

And right here is an easy instance in motion.

1fen_test = "XXXXXXXX/XqbxBb1X/Xx1xnxrX/X1n1kq1X/XXXXXXXX/XXXXXXXX/XXXXXXXX/XXXXXXXX w - - 0 1"


1[['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']

2 ['X' 'q' 'b' 'x' 'B' 'b' ' ' 'X']

3 ['X' 'x' ' ' 'x' 'n' 'x' 'r' 'X']

4 ['X' ' ' 'n' ' ' 'k' 'q' ' ' 'X']

5 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']

6 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']

7 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']

8 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']]

1mirrored_fen = mirror(fen_test,'h')


1[['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']

2 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']

3 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']

4 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']

5 ['X' 'q' 'b' 'x' 'B' 'b' ' ' 'X']

6 ['X' 'x' ' ' 'x' 'n' 'x' 'r' 'X']

7 ['X' ' ' 'n' ' ' 'k' 'q' ' ' 'X']

8 ['X' 'X' 'X' 'X' 'X' 'X' 'X' 'X']]

It could look foolish, however these things truly permits us to 2X our ‘rectangle’ row samples and 4X our ‘quad’s, considerably rising the dimensions of our dataset.

1augmented_trn_df = mirror_augmentation(trn_df)

2print(len(trn_df), len(augmented_trn_df))

13855, 7877

That’s greater than a 2X improve within the measurement of our information! 7,877 Echo Chess ranges. I ponder how lengthy it’d take for somebody to play all of them.

It’s value mentioning too that we may equally implement additional information augmentations via easy 90-degree rotations, in addition to inside reflections, of any stage to generate new FENs out of each quad, rectangle, and even full stage. It’s not proven within the present model however I’ll simply go away you with this for inspiration:

A board is a board is a board.

Trick #2: Pre-predicting ‘assured unsolvables’

We are able to once more use our data of the Echo Chess recreation mechanics to instantly filter out any ‘assured’ unsolvable ranges earlier than even passing the baton to the ensemble mannequin – for instance any stage with:

  • No legitimate first transfer

  • No black items on board

  • Extra ‘end-state-pieces’, or black items with no legitimate strikes, than beginning white items

  • ‘Untethered’ beginning bishops —i.e. beginning with a bishop on a white sq. when all black items are on a black sq., or vice versa

  • And so forth…

We already noticed methods to examine for a lot of of those within the Characteristic Engineering part, so I gained’t bore you with the implementation once more. The important thing factor right here is that once we add all of the little tweaks like these, in addition to the information augmentation half, the oversampling, function engineering and have choice, and retrain our ensemble mannequin on the augmented set, the enhancements rapidly add up even additional.

We’ll see subsequent how that is all faring right this moment in prod on the ⭐️LIVE⭐️ app on

The FINAL End result on the LIVE APP

I can’t consider I’m saying this… However we’re FINALLY on the stage the place we will put all this in prod. It’s right here. It’s reside. Persons are utilizing it. With ML.

I do know you’re dying to learn how this all ends, so I’ll spare you the small print. That is the gist of it.

The live Echo Chess Endless mode in prod generates a solvable maze 99%+ of the time. You’ll be able to check it your self ????????

How? Each single time we have to serve a brand new stage to the participant, 50 (sure, FIFTY) completely different candidate ranges get procedurally generated utilizing parametrized random distributions.

On the server, we generate predictions for every of those 50 random gens utilizing every of the tuned ML/DL fashions we’ve chosen. We then ensemble them via majority voting, plus maintain monitor of every stage’s common uncooked predictions from each mannequin.

We then kind the 50 candidate gens by majority class first, and avg uncooked preds second. The “most-promisingly solvable” stage from these is then despatched to the consumer, rendered and served to be performed. The entire thing occurs in a split-second whereas the extent transition sound impact is enjoying.

IT WORKS, YOU GUYS! We made a enjoyable Single-Participant Chess ????

Future work, what didn’t make the minimize

I’d be remiss if I didn’t point out a few of the loopy concepts I wished to strive with Echo Chess earlier than I noticed this rabbit gap was getting a tad bit too deep.

If any of those resonates with you (or particularly when you have higher concepts than these), drop me a notice and perhaps I’ll strive it out within the upcoming model.

RE:Enhancing Predictions

Transformer-based NLP classification of compoundFENs

LLM k-shot in-context studying

  • As an alternative of utilizing a task-specific classifier, feed in all of the labeled FENs from the coaching set as few-shot examples in a immediate to an LLM, and supply it with the unlabeled validation+check units to make solvability predictions on (can do the inference one FEN at a time or in bulk via the LLM supplier’s API or a wrapper like LangChain).

  • Given the dimensions of the coaching units, we’d definitely want to make use of a large-context-window LLM like Anthropic’s Claude2 100k – and presumably even some artistic model of Retrieval Augmented Technology (RAG) and vector dbs.

JUMBO information augmentation

After EVERY participant transfer (which ends up in a brand new board state), save every interim compoundFEN as its personal distinctive ‘new stage’.

  • When the precise stage is solved (or is tagged as unsolvable), assign the identical solvability for all these distinctive interim FENs

  • That’s as a result of if a path exists from begin to finish, then definitionally all of the interim states of that path can attain the tip, in the event that they comply with that very same path.

Artificial technology of Unsolvables

  • For each stage with out a legitimate first transfer, generate N related ranges with the identical startingWhitePiece on that very same beginning sq., however with ALL different NON-obstacle squares changed with random combos of {black piece, impediment sq., empty sq.}. All of them gained’t have legitimate first strikes both.

  • Do the identical for each stage with extra “end-state” items than white items. Even shuffle the white items themselves in sort and placement, however maintain their similar rely. All these ranges shall be unsolvable too.

Precise Picture classification of ranges

  • Generate jpg screenshots of each stage utilizing its compoundFEN, then prepare a picture classifier on the labeled photos

  • Can fine-tune a pretrained SOTA classifier on this img dataset to get higher efficiency than coaching from scratch. And on condition that chess boards most likely don’t look as just like what ImageNet offers with normally as one thing like ‘cats’ vs ‘canine’, we’d seemingly decide one of many architectures from the PyTorch Picture Fashions (timm) that’s extra appropriate to datasets that aren’t very ‘ImageNet-like’.

  • has achieved some great analysis on such fashions and datasets.

Letting AI truly play utilizing Reinforcement Studying

I ponder how they’ll drink these beers.

From a ‘enterprise’ use case and user-centered design standpoint, we’re strong. We’ve additionally just about saturated the returns from enhancements in solvability prediction tbh. So let’s flip our eyes as an alternative to general Product and Recreation Design.

RE:Enhancing the Product expertise

  • Alter issue curve as participant solves extra procedurally-generated ranges (by way of various distribution parametrization, classification thresholds, chosen percentile from the sorted preds, and many others.)

  • Let gamers determine what gens appear to be utilizing an LLM interface: ‘I need extra knights, 4-square partitions, pawns with promotion, however maintain it straightforward’

  • Participant can unlock power-ups like:

    • Name reinforcements so as to add additional random white items on the board (begin with 0 out there, win additional name each X ranges)

    • Improve your white piece to Queen at any time (begin with 2 prompt promotions out there, win additional one each Y ranges)

    • Set off regional explosions by touchdown on a goal highlighted sq. that exhibits up each X ranges 

    • Portal squares to teleport the participant’s present piece to a different a part of the board

  • A community-driven stage maker for Echo Chess. Anybody may design their very own loopy stage and add it to the neighborhood to check out.

I’m positive there are tons of different attention-grabbing concepts as effectively. Curious what others will discover.


The sport is reside, it’s being performed by 1000s, each Basic and Countless modes are a success of their distinctive methods. ML actually saved the day for Countless. And I’ll at all times get pleasure from manually crafting hardcore puzzles for Basic.

So simply go forward and take a look at it out already ????

For all you chess specialists searching for a problem, strive beginning with ranges 8+ in Basic. And for particularly bold chess veterans, see what number of you possibly can resolve beginning 11+ (heads-up: it will get actually tough, actually rapidly!)

Hopefully this little rabbit gap of a narrative evokes different builders to create extra with chess, video games, ML, or the entire above. In case you come throughout something like that that piques your curiosity, please drop a hyperlink beneath. Would like to strive it out.

And if you happen to made it to the tip and really loved any of this (and even if you happen to didn’t), please let me know by sharing your ideas or offended rants. Suggestions is at all times welcome anytime.

Joyful puzzling ????️


Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top