{"id":2861,"date":"2020-10-09T10:32:55","date_gmt":"2020-10-09T08:32:55","guid":{"rendered":"http:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/?p=2861"},"modified":"2022-09-07T10:59:54","modified_gmt":"2022-09-07T08:59:54","slug":"deep-reinforcement-learning-with-the-snake-game","status":"publish","type":"post","link":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/10\/09\/deep-reinforcement-learning-with-the-snake-game\/","title":{"rendered":"Deep Reinforcement Learning with the Snake Game"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">here we want to show our achievements and fails with deep reinforcement learning. The paper [1] describes an algorithm for training atari games such as the game breakout. Standford university tought this algorithm in its deep learning class <em>lecture 14 reinforcement learning<\/em>. Since the results looked pretty promising we tried to recreate the methods with our own implementations. We decided to try it with another game which is the snake game. Here the player controls a snake with four action keys (RIGHT, LEFT, UP, DOWN) moving on the screen. The player has to control the snake in a way it eats food and he has to avoid the frame&#8217;s borders otherwise the snake dies, see Picture 1 how the snake is going to die. In case it runs out the frame&#8217;s border the game terminates. Each time the snake eats food, the body length of the snake is increased. During the game the player has also to avoid the snake&#8217;s body otherwise the game terminates, as well. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/09\/snakegame.png\" alt=\"\" class=\"wp-image-2871\" width=\"311\" height=\"301\" srcset=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/09\/snakegame.png 523w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/09\/snakegame-300x290.png 300w\" sizes=\"auto, (max-width: 311px) 100vw, 311px\" \/><figcaption>Picture 1: Snake Game: image of the game window<\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Picture 2 shows how the snake is moving from one position to the next towards the food (white box). The snake game we used has been created by Rajat Biswas and can be downloaded from <a href=\"https:\/\/gist.github.com\/rajatdiptabiswas\/bd0aaa46e975a4da5d090b801aba0611\">here<\/a>. However to access the images of the game and the action keys, we had to modify the code, which is copied into this post. Modifications had been very substantial so the code might not be recognizable anymore with exception of some variable names.     <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"369\" height=\"124\" src=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/09\/snakeflow.png\" alt=\"\" class=\"wp-image-2869\" srcset=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/09\/snakeflow.png 369w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/09\/snakeflow-300x101.png 300w\" sizes=\"auto, (max-width: 369px) 100vw, 369px\" \/><figcaption>Picture 2: Three sequential images<\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Basic Setups<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In the beginning we tried a frame size of 100&#215;100 for each image. We soon figured out that the graphics card reached its capacity and the computer crashed. The authors of the paper [1] used frame sizes of 84&#215;84, however we then tried with 50&#215;50 because of our failing experiences. The objects of the game (snake, food) displayed in the images were still recognizable after resizing to 50&#215;50, so we kept this frame size, see <em>dim<\/em> in the code below.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The snake in Rajat&#8217;s game was drawn in color green. Here is now the problem: Reinforcement Learning is based on the theory of Markov Decision Process (MDP). In MDP we are dealing with states, and the state tells you what action you going to do for the future. Only the current state is relevant, but not the past states. So if you take a look at the snake game, Picture 1, you find that we have drawn the snake&#8217;s head in color magenta (R: 255, G: 0, B:255). Now you always know as player where the head and where the tail of the snake is by just looking at one image. In the original game the creator colored the complete snake in green. In the atari game example [1] the authors faces the problem with not knowing from which direction the flying ball was coming. They solved it by capturing four sequential images and adding their grayscale images into the layers of a new image. Now they were able to see the dynamics of a flying ball.  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A good lecture in Markov Decision Process can be found <a href=\"https:\/\/www.youtube.com\/watch?v=i0o-ui1N35U\">here<\/a>.  Actually it is very recommend to take a look into MDP theory before programming anything in reinforcement learning, otherwise it will be hard to understand the programming. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The code below shows some necessary directories, such as the data directory containing a large sample of images of the game (also called replay memory), the model directory, containing the model&#8217;s checkpoint with the name <em>modelweightname<\/em>. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Rajat&#8217;s game used the string identifiers <em>RIGHT, LEFT, UP DOWN<\/em> to indicate a keystroke. The dictionaries <em>actionstonum<\/em> and <em>numtoactions<\/em> convert the string identifiers to numbers, and vice versa.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">dim = (50,50) \npathname = r\"\/home\/inf\/Dokumente\/Reinforcement\"\ndatadirname = \"data\"\nvaliddirname = \"valid\"\nmodeldirname = \"model\"\ndatacsvname = \"data.csv\"\nmodeljsonname=\"model-regr.json\"\nmodelweightname=\"model-regr.h5\"\n\nactionstonum = {\"RIGHT\": 0,\n           \"LEFT\": 1,\n           \"UP\" : 2,\n           \"DOWN\" : 3,\n          }\nnumtoactions = {0: \"RIGHT\",\n           1: \"LEFT\",\n           2: \"UP\",\n           3: \"DOWN\",\n          }<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">The Model<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The model comes originally from [1], however we copied it from the kera&#8217;s reinforcement <a href=\"https:\/\/keras.io\/examples\/rl\/deep_q_network_breakout\/\">website<\/a>, because the author&#8217;s did not describe the model thoroughly enough to be able to reconstruct it. The function code <em>create_q_model<\/em> is copied in the code section below. It consists of five neural network layers. Three convolution layers and two fully connected layers. The last layer is a fully connected layer with four outputs. Each output represents one Q state of one action from the action space <em>RIGHT, LEFT, UP DOWN<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A Q state is a variable, which tells us how much food the snake will eat in future, if a certain action is taken. Eating food will give in our programming the player a reward of one. An Example: we assume the Q state for action RIGHT is two. This means, if the player chooses for action RIGHT, the prediction is that the player eats food two more times.<br><\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def create_q_model():\n\n    inputs = layers.Input(shape=(dim[0], dim[1], 3,))\n\n    layer1 = layers.Conv2D(32, 8, strides=4, activation=\"relu\")(inputs)\n    layer2 = layers.Conv2D(64, 4, strides=2, activation=\"relu\")(layer1)\n    layer3 = layers.Conv2D(64, 3, strides=1, activation=\"relu\")(layer2)\n\n    layer4 = layers.Flatten()(layer3)\n\n    layer5 = layers.Dense(512, activation=\"relu\")(layer4)\n    action = layers.Dense(4, activation=\"linear\")(layer5)\n\n    return keras.Model(inputs=inputs, outputs=action)<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Breaking down the code<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We have written the main code in class <em>Game<\/em>, where we incorporated Rajat&#8217;s code. By now his code is probably very hard to recognize. The class <em>Game<\/em> became pretty huge, so we need to break it down. Below you see the class definition, the constructor <em>__init__<\/em> and the method <em>initialize<\/em>. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The constructor sets the frame size of the game to 200&#215;200. Later, frame images are resized to 50&#215;50. The game needs four colors: back, white, green and magenta. the variables <em>imgresh1<\/em> and <em>imgresh2<\/em> are two sequential images of the game, e.g. two images in Picture 2. The first image is taken before the player takes an actions and the second image is taken after the player takes action. Basically these two images represent the current state and the future state. The constants <em>MAXREWARD<\/em> is assigned to reward the player as soon as the snake eats food. The program assigns <em>PENALTY<\/em> to penalize the player when the snake hits the border or its own body. If an action just moves the snake, then <em>MOVEPENALTY<\/em> is assigned to reward. We actually have set this to zero, so it has no effect.<br>We have set the <em>BATCHSIZE<\/em> of training data to 20. Below it shows actually the number 19, but a one data is added during training which makes it to 20.<br><em>DISCOUNT, ALPHA<\/em> and <em>EPSILON<\/em> are values of the Bellman equation. This is part of the MDP theory, so we skip to explain it. The MDP lecture is really recommended to study here. <em>REPLAYSIZE<\/em> is also explained in [1], so we took the concept from there. It is a maximum number for storing current and future images in a replay memory. So the code here stores 40.000 current and future images on disk and uses the images to retrain the model. Two models are created with the <em>create_q_model <\/em>function: <em>model<\/em> and <em>model_target<\/em>. <em>model<\/em> is the current model, and <em>model_target<\/em> is the future model. The current model is retrained all the time during the game, and the future model is updated once after playing numerous games.  We use the Adam optimizer and the loss function Huber. Huber is a combination of a square function and a absolute function. It punishes huge differences less than a square function. The remaining important code in the constructor is the loading the weights to <em>model_target<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The method <em>initialize<\/em> initializes the pygame environment, which is part of the library we used in the code. Here the clock is set, and the frame size is set to 200&#215;200. The first positions of the snake (<em>snake_pos<\/em>) and the food (f<em>ood_pos<\/em>) are randomly set. The variable <em>changeto<\/em> holds the current action (RIGHT, LEFT, UP DOWN) hit by the player. The variable <em>direction<\/em> holds mostly the same value as <em>changeto<\/em>. However, in some cases it does not. E.g. if the snake runs right, the player hits left, so the snake continues to run right. This means that<em> changeto <\/em>holds now the value left, but <em>direction<\/em> the value right. The method<em> initialize<\/em> completes with loading the weights into the current model <em>model<\/em>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">class Game:\n\n    def __init__(self, lr=1e-3, checkpointparname=modelweightname):\n        \n        self.speed = 80\n        self.frame_size_x = 200\n        self.frame_size_y = 200\n\n        self.black = pygame.Color(0, 0, 0)\n        self.white = pygame.Color(255, 255, 255)\n        self.green = pygame.Color(0, 255, 0)\n        self.mag = pygame.Color(255, 0, 255)\n        \n        self.imgresh1 = None\n        self.imgresh2 = None\n        \n        self.reward = 0\n        self.MAXREWARD = 1.0\n        self.PENALTY = -1.0\n        self.MOVEPENALTY = 0.0\n        \n        self.BATCHSIZE = 19\n\n        self.DISCOUNT = 0.99\n        self.ALPHA = 0.3\n        if manual == True:\n            self.EPSILON = 0.999\n        else:\n            self.EPSILON = 0.3\n        \n        self.REPLAYSIZE = 40_000\n        self.overall_score = 0\n        self.overall_numbatches = 0\n\n        self.model = create_q_model()\n        self.model_target = create_q_model()\n\n        self.learningrate = lr\n        self.optimizer = keras.optimizers.Adam(learning_rate=self.learningrate, clipnorm=1.0)\n        self.loss_function = keras.losses.Huber()\n\n        self.checkpointname = os.path.join(pathname, modeldirname,checkpointparname)\n        print(f\"loading checkpoint: {self.checkpointname}\")\n        self.model_target.load_weights(self.checkpointname)\n        \n        self.overall_scores=[]\n        self.checkpoint_counter=0\n        \n        self.shufflelist = []\n        \n    def initialize(self, i, j):\n\n        status = pygame.init()\n\n        if status[1] &gt; 0:\n            print(f'Number of Errors: {status[1]} ...')\n            sys.exit(-1)\n\n        pygame.display.set_caption(f\"{i}-{j}\")\n        self.game_window = pygame.display.set_mode((self.frame_size_x, self.frame_size_y)) \n\n        self.controller = pygame.time.Clock()\n   \n        posx = (random.randint(40,160)\/\/10)*10\n        posy = (random.randint(40,160)\/\/10)*10\n           \n        self.snake_pos = [posx, posy]\n        self.snake_body = [[posx, posy], [posx-10, posy], [posx-(2*10), posy]]\n\n        self.food_pos = [random.randrange(1, (self.frame_size_x\/\/10)) * 10, random.randrange(1, (self.frame_size_y\/\/10)) * 10]\n        self.food_spawn = True\n\n        self.direction = 'RIGHT'\n        self.changeto = self.direction\n\n        self.score = 0\n        self.numbatches = 0\n\n        self.event_happened = False\n        \n        self.model.load_weights(self.checkpointname)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The code below shows the methods <em>run, get_maxi, game_over<\/em> and <em>get_direction<\/em> of the class <em>Game<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The method<em> run<\/em> executes one iteration of the snake game. This means <em>run<\/em> draws the current image, checks for a key stroke, updates the graphical objects such as snake head, snake body and food and draws the next image. While executing run, the images are saved to disk with a unique number for identification. To prevent overwriting images, the method <em>get_maxi <\/em>reads the maximum identification number and sets a counter <em>i<\/em> to it.  <br>In the beginning the method <em>run<\/em> saves the first image of  the game window to the opencv image<em> imgresh1<\/em>. We use the numpy method <em>frombuffer<\/em> and <em>reshape <\/em>for retrieving the image and the opencv method <em>resize<\/em> to lower the image size to 50&#215;50. The image <em>imgresh1<\/em> is then moved into the neural  network to get a prediction of the four Q states with <em>model<\/em>. The maximum of the four Q states (tensorflow method <em>argmax<\/em>) gives you the predicted action. It must be a number between zero and three. This is also part of the Bellman equation from the MDP theory.<br>The run method reads the key strokes and moves the information into the attribute<em> changeto<\/em>. With the help of the constant<em> EPSILON<\/em>, we bring some variation between prediction, randomness or direction into the game. If the constant <em>EPSILON<\/em> is close to one, then the run method is rather on the prediction side (Exploitation). If the constant <em>EPSILON<\/em> is close to zero, then the method <em>run<\/em> is rather of the randomness\/direction side (Exploration). Direction is given by the method <em>get_direction<\/em>. Again, the constant <em>EPSILON<\/em> is also part of the MDP theory.  <br>As mentioned above, <em>changeto<\/em> is an attribute telling you in which direction the player wants to move the snake. However if e.g. the snake moves right, and player strikes the left key, then the snake should still move right. For this we need the attribute <em>direction<\/em>. It tells the actual direction of the snake independent of the key stroke.<br>The method <em>run<\/em> is then checking if the snake has eaten the food. If yes, a new food position is drawn and the attribute <em>reward <\/em>is set to <em>MAXREWARD<\/em> and the body of the snake is enlarged. If no food or no border has been hit, than the attribute reward is set to <em>MOVEPENALTY<\/em>. In case the snake hits the border or hits its own body, then the attribute <em>reward<\/em> is set to <em>PENALTY<\/em>. This is done in the method <em>game_over<\/em>.<br>Finally the snake is redrawn and an image of the game window is moved into <em>imgresh2<\/em>. The images <em>imgresh1<\/em> and <em>imgresh2<\/em> form the current state and the future state. <br>At the end, the method <em>run<\/em> retrains the model with the method <em>train<\/em>. We are doing this only with every fourth current state and future state. We are doing this to emphasize more the scoring and the terminations and to de-emphasize the moving.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If the snake hits the border, the method <em>game_over <\/em>is executed and it terminates the method <em>run<\/em>. Here the model is retrained as well with the current state (<em>imgresh1<\/em>), future state (<em>imgresh2<\/em>), action taken (<em>changeto<\/em>), reward (<em>reward<\/em>) and the termination information.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">class Game:\n\n    # constructur and initializer, see above\n      \n    def run(self, i_index):\n        \n        i = i_index + self.get_maxi() + 1\n        j = 0\n\n        while True:\n            \n            img1 = np.frombuffer(pygame.image.tostring(self.game_window, \"RGB\"), dtype=np.uint8)\n            self.imgresh1 = np.reshape(img1,(self.frame_size_x,self.frame_size_y, 3))\n            self.imgresh1 = cv2.resize(self.imgresh1, dim, interpolation = cv2.INTER_NEAREST )\n\n            current_state = np.array(self.imgresh1, dtype=np.float32)\/255.0\n            state_tensor = tf.convert_to_tensor(current_state)\n            state_tensor = tf.expand_dims(state_tensor, 0)\n            action_probs = self.model(state_tensor, training=False)\n            theaction = tf.argmax(action_probs[0]).numpy()\n            \n            for event in pygame.event.get():\n                if event.type == pygame.QUIT:\n                    pygame.quit()\n                    return\n                # Whenever a key is pressed down\n                elif event.type == pygame.KEYDOWN:\n\n                    if event.key == pygame.K_UP or event.key == ord('w'):\n                        self.changeto = 'UP'\n                    if event.key == pygame.K_DOWN or event.key == ord('s'):\n                        self.changeto = 'DOWN'\n                    if event.key == pygame.K_LEFT or event.key == ord('a'):\n                        self.changeto = 'LEFT'\n                    if event.key == pygame.K_RIGHT or event.key == ord('d'):\n                        self.changeto = 'RIGHT'\n                    \n                    # Esc -&gt; Create event to quit the game\n                    if event.key == pygame.K_ESCAPE:\n                        pygame.event.post(pygame.event.Event(pygame.QUIT))\n\n\n            if np.random.random() &gt; self.EPSILON:\n                self.changeto = numtoactions[theaction]\n            else:\n                if manual != True:\n                    #self.changeto = numtoactions[np.random.randint(0, len(actionstonum))]\n                    self.changeto = self.get_direction();\n                    \n            if self.changeto == 'UP' and self.direction != 'DOWN':\n                self.direction = 'UP'\n            if self.changeto == 'DOWN' and self.direction != 'UP':\n                self.direction = 'DOWN'\n            if self.changeto == 'LEFT' and self.direction != 'RIGHT':\n                self.direction = 'LEFT'\n            if self.changeto == 'RIGHT' and self.direction != 'LEFT':\n                self.direction = 'RIGHT'\n\n            if self.direction == 'UP':\n                self.snake_pos[1] -= 10\n            if self.direction == 'DOWN':\n                self.snake_pos[1] += 10\n            if self.direction == 'LEFT':\n                self.snake_pos[0] -= 10\n            if self.direction == 'RIGHT':\n                self.snake_pos[0] += 10\n\n            self.snake_body.insert(0, list(self.snake_pos))\n            if self.snake_pos[0] == self.food_pos[0] and self.snake_pos[1] == self.food_pos[1]:\n                self.score += 1\n                self.reward = self.MAXREWARD\n                self.food_spawn = False\n            else:\n                self.snake_body.pop()\n                self.reward = self.MOVEPENALTY\n\n            if not self.food_spawn:\n                self.food_pos = [random.randrange(1, (self.frame_size_x\/\/10)) * 10, random.randrange(1, (self.frame_size_y\/\/10)) * 10]\n            self.food_spawn = True\n\n            self.game_window.fill(self.black)\n            n = 0\n            for pos in self.snake_body:\n\n                if n == 0:\n                    pygame.draw.rect(self.game_window, self.mag, pygame.Rect(pos[0], pos[1], 10, 10))\n                else:\n                    pygame.draw.rect(self.game_window, self.green, pygame.Rect(pos[0], pos[1], 10, 10))\n                n=+1\n                \n\n            pygame.draw.rect(self.game_window, self.white, pygame.Rect(self.food_pos[0], self.food_pos[1], 10, 10))\n\n            if self.snake_pos[0] &lt; 0 or self.snake_pos[0] &gt; self.frame_size_x-10:\n                self.game_over(i,j)\n                return\n            if self.snake_pos[1] &lt; 0 or self.snake_pos[1] &gt; self.frame_size_y-10:\n                self.game_over(i,j)\n                return\n\n            for block in self.snake_body[1:]:\n                if self.snake_pos[0] == block[0] and self.snake_pos[1] == block[1]:\n                    self.game_over(i,j)\n                    return\n\n            pygame.display.update()\n\n            img2 = np.frombuffer(pygame.image.tostring(self.game_window, \"RGB\"), dtype=np.uint8)\n            self.imgresh2 = np.reshape(img2,(self.frame_size_x,self.frame_size_y, 3))\n            self.imgresh2 = cv2.resize(self.imgresh2, dim, interpolation = cv2.INTER_NEAREST )\n            \n            self.controller.tick(self.speed)\n\n            if j &gt; 0:\n                if self.reward == self.MAXREWARD:\n                    self.train(i,j, False)\n                elif j%4 == 0:\n                    self.train(i,j, False)\n      \n            j += 1\n                  \n    def game_over(self,i,j):\n        self.reward = self.PENALTY\n\n        img2 = np.frombuffer(pygame.image.tostring(self.game_window, \"RGB\"), dtype=np.uint8)\n        self.imgresh2 = np.reshape(img2,(self.frame_size_x,self.frame_size_y, 3))\n        self.imgresh2 = cv2.resize(self.imgresh2, dim, interpolation = cv2.INTER_NEAREST )\n        \n        self.train(i,j, True)\n        \n        self.overall_score += self.score\n        \n        self.game_window.fill(self.black)\n        pygame.display.flip()                         \n        pygame.quit()\n\n    def get_maxi(self):\n        \n        maxi = 0\n        \n        for item in self.shufflelist:\n            curr = item[0]\n            s = re.findall(r'\\d+', curr)[0]\n            if int(s) &gt; maxi:\n                maxi = int(s)\n        \n        return maxi\n\n    def get_direction(self):\n\n        x = self.snake_pos[0] - self.food_pos[0]\n        x1 = self.snake_body[1][0] - self.food_pos[0]\n        \n        y = self.snake_pos[1] - self.food_pos[1]\n        y1 = self.snake_body[1][1] - self.food_pos[1]\n        \n\n        direction = None\n        direction_h = None\n        direction_v = None\n\n        if x &gt; 0:\n            direction_h = 'LEFT'\n        else:\n            direction_h = 'RIGHT'\n\n        if y &gt; 0:\n            direction_v = 'UP'\n        else:\n            direction_v = 'DOWN'\n                           \n\n        if abs(x) &gt; abs(y):\n            direction = direction_h\n            \n            if y == y1 and (abs(x) &gt; abs(x1)):\n                #print(f\"  hit v x: {abs(x)} x1: {abs(x1)} y: {y} y1: {y1}\")\n                direction = direction_v\n        else:\n            direction = direction_v\n            if x == x1 and (abs(y) &gt; abs(y1)):\n                #print(f\"  hit h x: {abs(y)} x1: {abs(y1)} y: {x} y1: {x1}\")\n                direction = direction_h\n            \n        return direction<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The code below shows again the class <em>Game<\/em>, however without the methods<em> run, game_over, get_direction<\/em> and <em>get_maxi<\/em>, which were described above.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the code below we implemented the methods from the description in [1] and from the deep reinforcement theory. In [1] the authors suggested a  replay memory to avoid similarities of current states within a batch of training data. Their argument was, that this can cause feedback loops which are bad for convergence.<br>Game&#8217;s method <em>load_replay_memory<\/em> loads in all current states (<em>currentpicname<\/em>), actions (<em>action<\/em>), future states (<em>nextpicnam<\/em>e) and termination information into the list<em> shufflelist<\/em>. The list<em> shufflelist<\/em> is then randomly reordered. It represents now the replay memory.<br>The method <em>save_replay_memory<\/em> saves a pandas datasheet (<em>datacsvname<\/em>)  to disk with all information to reload <em>shufflelist<\/em> at another time again.<br>The method <em>pop_batch<\/em> picks the first set of entries from the replay memory. Each entry of the list <em>shufflelist<\/em> holds the current state and future states as files names. The method <em>pop_batch<\/em> loads in both image (<em>img1, img2<\/em>) and adds them to a batch list which is finally returned. The entries of the list are used for training the neural network <em>model<\/em>. The method <em>push_batch<\/em> is doing the opposite. It adds a batch list to the end of the list <em>shufflelist<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We are now moving to the next method <em>get_X<\/em>. It prepares the batch list for training the neural network. The images (current state indicated by 0 and future state indicated by 3) in the batch list are converted to a numpy array and normalized with division by 255.0.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The method <em>backprop<\/em> does the training on the neural network. Its code was basically taken from the <a href=\"https:\/\/keras.io\/examples\/rl\/deep_q_network_breakout\/\">keras site<\/a>. It also shows the Bellman equation which is handled in the deep reinforcement learning theory. In short, it predicts the future rewards with the <em>model_target <\/em>(<em>future_rewards<\/em>) with the future states (<em>Xf<\/em>). It uses the Bellman equation to calculate the new Q states (<em>updated_q_values<\/em>) to be trained into model. Finally it calculates the loss value (<em>loss<\/em>) which is needed to train the model with the current batch list (<em>X<\/em>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <em>Game<\/em>&#8216;s method <em>train<\/em> is simply a wrapper. It pulls with method <em>pop_batch<\/em> a batch list. The current and future images (<em>imgresh1, imgresh2<\/em>) are saved to disk (<em>write<\/em>) and added to the batch list, so the batch list size is increased by one (so the batch list size is not 19 but 20 now). Then it trains the model with the method <em>backprop<\/em>, and pushes back the batch list to the <em>shufflelist<\/em>. <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">class Game:\n    # constructur and initialize, see above\n    # run, game_over, get_direction, get_maxi, see above \n\n    def load_replay_memory(self):\n\n        f = open(os.path.join(os.path.join(pathname, datadirname, datacsvname)), \"r\")\n        \n        df = pd.read_csv(f, index_col = 0) \n\n        for index, row in df.iterrows():\n\n            currentpicname = row[\"currentstate\"]\n            action = actionstonum[row[\"action\"]]\n            reward = row[\"reward\"]\n            nextpicname = row[\"nextstate\"]\n            terminated = row[\"terminated\"]\n            sxpos = row[\"sxpos\"]\n            sypos = row[\"sypos\"]\n            fxpos = row[\"fxpos\"]\n            fypos = row[\"fypos\"]\n            \n            self.shufflelist.append([currentpicname,action,reward,nextpicname, terminated, sxpos, sypos, fxpos, fypos])\n\n        random.shuffle(self.shufflelist)\n    \n        f.close()\n        \n        return\n\n    def save_replay_memory(self):\n           \n        data = []\n        \n        if len(self.shufflelist) == 0:\n            return\n        \n        if len(self.shufflelist) &gt; self.REPLAYSIZE:\n            \n            self.numbatches = len(self.shufflelist) - self.REPLAYSIZE\n            self.overall_numbatches += self.numbatches\n            \n            for i in range(len(self.shufflelist) - self.REPLAYSIZE):\n                item = self.shufflelist.pop(0)\n                os.remove(os.path.join(self.path,item[0]))\n                os.remove(os.path.join(self.path,item[3]))\n                \n        for (cs, act, rew, fs, term, sx, sy, fx, fy) in self.shufflelist:\n            \n            data.append({'currentstate': cs, 'action': numtoactions[act], 'reward': rew, 'nextstate': fs, 'terminated': term, 'sxpos': sx, 'sypos': sy, 'fxpos': fx, 'fypos': fy})\n            \n        df = pd.DataFrame(data) \n        \n        df.to_csv(os.path.join(self.path, datacsvname)) \n        \n        return\n    \n    \n    def pop_batch(self, batchsize):\n       \n        batch = []\n        files = []\n    \n        for i in range(batchsize):\n            \n            item = self.shufflelist.pop(0)\n            \n            img1 = cv2.imread(os.path.join(self.path, item[0]),cv2.IMREAD_COLOR )\n            img2 = cv2.imread(os.path.join(self.path, item[3]),cv2.IMREAD_COLOR )\n\n            batch.append([img1, item[1], item[2], img2, item[4], item[5], item[6], item[7], item[8]])\n            files.append((item[0],item[3]))\n\n        return batch, files\n\n    def push_batch(self, batch, files):\n       \n        for index,item in enumerate(batch):\n\n            self.shufflelist.append([files[index][0], item[1], item[2], files[index][1], item[4], item[5], item[6], item[7], item[8]])\n    \n        return\n    \n    def get_X(self, batch, state):\n\n        assert state == 0 or state == 3 # 0 is currentstate, 3 is future state\n        \n        X = [item[state] for item in batch]\n        X = np.array(X, dtype=np.float32)\n        X \/= 255.0\n        \n        return X\n    \n\n    def backprop(self, batch):\n\n        rewards_sample = [batch[i][2] for i in range(len(batch))]\n        action_sample = [batch[i][1] for i in range(len(batch))]\n        done_sample = tf.convert_to_tensor([float(batch[i][4]) for i in range(len(batch))])\n\n        X =  self.get_X(batch, 0)\n        Xf = self.get_X(batch, 3)\n        future_rewards = self.model_target.predict(Xf)\n\n        updated_q_values = rewards_sample + 0.99 * tf.reduce_max(future_rewards, axis=1)\n        updated_q_values = updated_q_values * (1 - done_sample) - done_sample*abs(self.PENALTY)\n\n    \n        masks = tf.one_hot(action_sample, 4)\n\n        with tf.GradientTape() as tape:\n\n            q_values = self.model(X)\n\n            q_action = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)\n\n            loss = self.loss_function(updated_q_values, q_action)\n            \n        grads = tape.gradient(loss, self.model.trainable_variables)\n        self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))\n        \n    def train(self, i, j, term):\n        \n        # https:\/\/pythonprogramming.net\/training-deep-q-learning-dqn-reinforcement-learning-python-tutorial\/\n        \n        currentstate = \"current_{}_{}.png\".format(i,j)\n\n        nextstate = \"next_{}_{}.png\".format(i,j)      \n        \n        batch, files = self.pop_batch(self.BATCHSIZE)\n            \n        batch.append([self.imgresh1, actionstonum[self.changeto], self.reward, self.imgresh2, term, self.snake_pos[0], self.snake_pos[1], self.food_pos[0], self.food_pos[1]])\n        files.append((\"current_{}_{}.png\".format(i,j), \"next_{}_{}.png\".format(i,j)))\n        \n        self.write(i,j)\n         \n        self.backprop(batch)\n        \n        self.numbatches += 1\n            \n        self.push_batch(batch, files)   \n  \n        return\n    \n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The remaining methods of <em>Game<\/em> are utility methods. The first utility method is <em>print_benchmark<\/em>. After training models, we needed some kind of indication if the training went into a positive direction or into a negative direction. With the method <em>print_benchmark <\/em>we get this indication. In the method <em>print_benchmark<\/em> we first create two lists: <em>maxlist<\/em> and <em>penaltylist<\/em>. The list <em>maxlist<\/em> contains all states where the snake ate food. The list <em>penaltylist<\/em> contains all states where the snake went into the window frame border or into its own body. Secondly we are testing the trained model with the content of <em>maxlist<\/em> and <em>penaltylist<\/em> and check if the predictions are correct. In case of incorrect predictions a statistical value <em>pmerror<\/em> is updated, which indicates how often the model predicts a move of the snake towards the food position wrongly. So ideally this value must be zero. The value <em>pterror<\/em> indicates how often the snake will go into a termination state. This number should be ideally zero as well. Other benchmarks are the averaged Q states. These are very important to consider, because they should not explode and converge to the same number.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The methods <em>print_score<\/em> and <em>print_overall_score<\/em> print out the score of the current game or the score of a large set of games. Scores can also be used as benchmarks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After all, we have the method <em>run_replay_memory<\/em>. It loads in the complete replay data into the list<em> shufflelist<\/em> and runs the method <em>backprop<\/em> on the content of shufflelist. Here is no need to use method <em>run<\/em> to train the model. However we still need the method <em>run<\/em> to generate more training data and refresh the replay memory.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def class Game:\n    # constructur and initializer, see above\n    # run, game_over, get_direction, get_maxi, see above \n    # load_replay_memory, save_replay_memory, pop_batch, push_batch, get_X, backprop, train, see above\n  \n    def print_benchmark(self):\n\n        maxlist = []\n        penaltylist = []\n        averagestates = [0,0,0,0]\n        averagepenalty = [0,0,0,0]\n        pmerror = 0\n        pterror = 0\n\n        for (cs, act, rew, fs, term, sx, sy, fx, fy) in self.shufflelist:\n            if rew == self.MAXREWARD or rew == 30.0:\n                maxlist.append((cs,act,rew,fs,term))\n            if rew == self.PENALTY:\n                penaltylist.append((cs,act,rew,fs,term))\n        print(f\"Number of maxrewards in shufflelist: {len(maxlist)}, perc: {100*len(maxlist)\/len(self.shufflelist)}\")\n        print(f\"Number of terminations in shufflelist: {len(penaltylist)}, perc: {100*len(penaltylist)\/len(self.shufflelist)}\")\n        \n        count = 0\n        \n        print(\"Testing maxlist\")\n        for i in range(len(maxlist)):\n            img = cv2.imread(os.path.join(pathname, datadirname, maxlist[i][0]),cv2.IMREAD_COLOR )\n            states = self.model.predict(np.array([img])\/255.0, batch_size=1, verbose=0)[0]\n            averagestates += states\n            if np.argmax(states) != maxlist[i][1]:\n                count += 1\n            pmerror = 100*count\/len(maxlist)\n        print(f\"Number of predicted errors in maxlist: {count}, perc: {pmerror}\")\n        print(f\"Q Values for max: {averagestates\/len(maxlist)}\")\n        \n        count = 0\n        \n        print(\"Testing penaltylist\") \n        for i in range(len(penaltylist)):\n            img = cv2.imread(os.path.join(pathname, datadirname, penaltylist[i][0]),cv2.IMREAD_COLOR )\n            states = self.model.predict(np.array([img])\/255.0, batch_size=1, verbose=0)[0]\n            averagepenalty += states\n            if np.argmax(states) == penaltylist[i][1]:\n                count += 1\n            pterror = 100*count\/len(penaltylist)\n        print(f\"Number of predicted terminations in penaltylist: {count}, perc: {pterror}\")\n        print(f\"Q Values for penalty: {averagepenalty\/len(penaltylist)}\")\n        \n        return pmerror, averagestates\/len(maxlist), averagepenalty\/len(penaltylist)\n    \n    def print_score(self):\n        print(f\" ----&gt; TIME IS {datetime.now():%Y-%m-%d_%H-%M-%S}\")\n        print(f\" ----&gt; SCORE is {self.score}\")\n        print(f\" ----&gt; NUM OF BATCHES is {self.numbatches}\")\n        return self.score, self.numbatches\n    \n    def print_overall_score(self):\n        print(f\"--&gt; TIME IS {datetime.now():%Y-%m-%d_%H-%M-%S}\")\n        print(f\"--&gt; OVERALL SCORE is {self.overall_score}\")\n        print(f\"--&gt; OVERALL NUM OF BATCHES is {self.overall_numbatches}\")\n        return self.overall_score, self.overall_numbatches     \n    \n    def run_replay_memory(self, epochs = 5):\n        self.model.load_weights(self.checkpointname)\n        self.load_replay_memory()\n        for j in range(epochs):\n            \n            for i in range(int(len(self.shufflelist)\/\/(self.BATCHSIZE+1))):\n                if i%500 == 0:\n                    print(i)\n                batch, files = self.pop_batch(self.BATCHSIZE+1)\n                self.backprop(batch)\n                self.push_batch(batch,files)\n\n            self.print_benchmark()\n            self.save_checkpoint()\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The class <em>Game<\/em> has been need sufficiently explained now. Below the code to execute instances of <em>Game<\/em> to generate training data and to train the model. In the code below the function <em>run_game<\/em> instantiates<em> Game<\/em> several times (<em>range(iterations)<\/em>) within a loop with a given learning rate as a parameter. In a underlying loop it initializes and runs instances of <em>Game<\/em> with<em> initialize<\/em> and <em>run<\/em> and prints the benchmark with <em>print_benchmark<\/em>. In case there is an improvement concerning one of the benchmarks, a checkpoint of the model is saved. During the iterations the list <em>shufflelist<\/em> is saved back to disk with the method <em>save_replay_memory<\/em>. <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def run_game(learning_rate = 1.5e-06, epochs = 5, benchmin = 68.0):\n    manual = False\n    lr = [learning_rate for i in range(epochs)]\n\n    iterations = len(lr)\n    benches = []\n    qms = []\n    qps = []\n    counter = 0\n\n    for i in range(iterations):\n        print(f\"{i}: learning rate: {lr[i]}\")\n        game = Game(lr[i], \"model-regr.h5\")\n        k = 150\n        game.load_replay_memory()\n        for j in range(k):\n            game.initialize(i, j)\n            game.run(j)\n        bench, qm, qp = game.print_benchmark()\n        benches.append(bench)\n        qms.append(qm)\n        qps.append(qp)\n        game.save_replay_memory()\n        game.save_checkpoint(f\"model-regr_{i}_{lr[i]:.9f}_{bench:.2f}.h5\")\n        if bench &lt; benchmin:\n            benchmin = bench\n            game.save_checkpoint()\n        else:\n            counter += 1\n        if counter == 3:\n            counter = 0\n            lr *= 0.5 \n            \n        overallscore = game.print_overall_score()\n        overallscores.append(overallscore)\n    return benches, qms, qps\n     <\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Executing the code<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The function <em>run_game<\/em> can be executed like shown in the code below. Parameters are learning rate, epochs and benchmark threshold.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">run_game(1.5e-06, 5, 60.0)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To run training on the complete replay memory we execute the code below. The first parameter of the <em>Game<\/em> class is the learning rate, the second parameter is the name of the checkpoint with the neural networks weights to be updated. The method<em> run_replay_memory<\/em> trains the neural network with a number of epochs. In this case it is five.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">game = Game(6.0e-07, \"model-regr.h5\")\ngame.run_replay_memory(5)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The function <em>run_game<\/em> and the method <em>run_replay_memory<\/em> both execute <em>print_benchmark<\/em>. It prints out an indication how successful the training was. Below you find an output of<em> print_benchmark<\/em>. It shows how many times the snake eats food (<em>maxrewards<\/em>) and how many terminations we find in the replay memory . The lists <em>maxlist<\/em> and <em>penaltylist<\/em> (see description of <em>print_benchmark<\/em>) are used to indicate how well the model predicts the eating of food and terminations. In the example below you find that the eating of food are 51% of all cases badly predicted and terminations are 41% of all cases badly predicted. The print below shows averaged values of the Q states for each action (RIGHT, LEFT, UP, DOWN). Ideally they should contain numbers in a close range. You can interpret a Q state in such a way: If you take an action RIGHT, the reward will be in future 0.51453086. This number should go up while training, but also as its limits. The score of the snake game will always be finite, because the snake grows larger each time it eats food. The limit is indirectly given by the size of the frame window. Important to know is, that the Q states of each action must be on average  the same number, because statistically you must always get the same reward if the snake goes in either direction.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Number of maxrewards in shufflelist: 2783, perc: 6.9575 Number of terminations in shufflelist: 2042, perc: 5.105 Testing maxlist Number of predicted errors in maxlist: 1437, perc: 51.63492633848365 Q Values for max: [0.51453086 0.5192453  0.50427304 0.48402   ] Testing penaltylist Number of predicted terminations in penaltylist: 856, perc: 41.9196865817 Q Values for penalty: [0.21559233 0.16576494 0.22125778 0.210446  ] saving checkpoint: \/home\/inf\/Dokumente\/Reinforcement\/model\/model-regr.h5<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We have run the code for days (in specific <em>run_game<\/em> and <em>run_replay_memory<\/em>) and tried with modifying the hyper parameters. During the run we produced millions of frames. In general we set the <em>EPSILON<\/em> value first to 0.3. This means that about 30% of the moves went to the direction of the food with the help of <em>Game<\/em>&#8216;s method <em>get_direction<\/em>. The remaining moves were predicted by the model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We were able to see that all average Q state values in <em>maxlist<\/em> of printed out by <em>print_benchmark<\/em> went above 1.6. This matches our observations that the number of scores went to around 270 after playing 150 games.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After running the game further we set the<em> EPSILON<\/em> value to 0.2, so more predictions were made by the game. Now the overall score went down. This time we made 100 scores after playing 150 games. You were able to see how the average Q state values went down after some time.  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">On the <a href=\"https:\/\/keras.io\/examples\/rl\/deep_q_network_breakout\/\">keras site<\/a>, there is a note, that it takes 50 million frames to receive very good results. However at some point of time we stopped experimenting anything further. We were seeing that progress is extremely slow or none existent. We are inclined to say, that other methods but deep reinforcement learning are more promising. Despite the simpleness of the snake game, which is comparable to atari games, we were a little discouraged after reading [2].<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Outlook<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Applied research must give a solution for problems for a programmer who wants to develop an application. So the programmer must get some benefit from it. However we find that there are too many hyper parameters to turn (size of replay memory, learning rate, batch size, EPSILON value, DISCOUNT value, dimension of status images, the list does not end here) and progress is still extremely slow or not existent. A programmer wants to have results within a defined period of time. We have the feeling we do not have the development time under control with deep reinforcement learning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Maybe there was just not enough training time for experimentation. Maybe the model with its five layers is not appropriate and needs fine tuning. Anyway, we think that other ways of machine learning (supervised learning) are much more successful than deep reinforcement learning. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We are thinking about estimating the head position and food position with a trained neural network by giving an image of the window frame. Then we calculate the moving direction for the snake. In the code above we actually record the head and food position and save it into the replay memory&#8217;s csv file for each current state and future state. So this information comes for free. Small experiments showed that the head and food position can be predicted very well with the same neural network model (instead of Q states we estimate positions of snake head and food).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Anyway, we have not completely given it up, but we just put a break in here. So there might come more in future.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Acknowledgement<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Special thanks to the University of Applied Science Albstadt-Sigmaringen for providing the infrastructure and the appliances to enable this research <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.cs.toronto.edu\/~vmnih\/docs\/dqn.pdf\">[1]:  Playing Atari with Deep Reinforcement Learning,  Volodymyr Mnih et. al.<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[2]: <a href=\"https:\/\/www.alexirpan.com\/2018\/02\/14\/rl-hard.html\">Deep Reinforcement Learning Doesn&#8217;t Work Yet<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[3]: <a href=\"https:\/\/keras.io\/examples\/rl\/deep_q_network_breakout\/\">Deep Q-Learning for Atari Breakout<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[4]: <a href=\"https:\/\/gist.github.com\/rajatdiptabiswas\/bd0aaa46e975a4da5d090b801aba0611\">Snake Game.py<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>here we want to show our achievements and fails with deep reinforcement learning. The paper [1] describes an algorithm for training atari games such as the game breakout. Standford university tought this algorithm in its deep learning class lecture 14 reinforcement learning. Since the results looked pretty promising we tried to recreate the methods with &hellip; <a href=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/10\/09\/deep-reinforcement-learning-with-the-snake-game\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Deep Reinforcement Learning with the Snake Game<\/span><\/a><\/p>\n","protected":false},"author":24,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[4,3,13,5,7,12],"class_list":["post-2861","post","type-post","status-publish","format-standard","hentry","category-allgemein","tag-ai","tag-deep-learning","tag-deepq","tag-ki","tag-neural-network","tag-reinforcement-learning"],"_links":{"self":[{"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/posts\/2861","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/users\/24"}],"replies":[{"embeddable":true,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/comments?post=2861"}],"version-history":[{"count":465,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/posts\/2861\/revisions"}],"predecessor-version":[{"id":4836,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/posts\/2861\/revisions\/4836"}],"wp:attachment":[{"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/media?parent=2861"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/categories?post=2861"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/tags?post=2861"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}