{"id":2589,"date":"2020-08-20T11:58:18","date_gmt":"2020-08-20T09:58:18","guid":{"rendered":"http:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/?p=2589"},"modified":"2022-09-07T11:02:00","modified_gmt":"2022-09-07T09:02:00","slug":"street-scene-segmentation-with-a-unet-model","status":"publish","type":"post","link":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/08\/20\/street-scene-segmentation-with-a-unet-model\/","title":{"rendered":"Street Scene Segmentation with a UNET-Model"},"content":{"rendered":"\n<p>For the summer semester 2020 we offered an elective class called <em>Design Cyber Physical Systems<\/em>. This time only few students enrolled into the class. The advantage of having small classes is that we can focus on a certain topic without loosing the overview during project execution. The class starts with a brain storming session to choose a topic to be worked on during the semester. There are only a few constraints which we ask to apply. The topic must have something to do with image processing, neural networks and deep learning. The students came up with five ideas and presented them to the instructor. The instructor commented on the ideas and evaluated them. Finally the students chose one of the idea to work on until the end of the semester.<\/p>\n\n\n\n<p>This time the students chose to process street scenes taken from videos while driving a car. A street scene can be seen in Figure 1. The videos were taken in driving direction through the front window of a car.  The assignment the students have given to themselves is to extract street, street marking, traffic signs, and cars on the video images. This kind of problem is called semantic segmentation, which has been solved in a similar fashion as in the previous <a href=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/08\/15\/image-processing-for-an-autonomous-guided-vehicle-in-a-learn-factory\/\">post<\/a>. Therefore many functions in this post and in the last post are pretty similar.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/streetscene-1024x720.png\" alt=\"\" class=\"wp-image-2626\" width=\"597\" height=\"420\" srcset=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/streetscene-1024x720.png 1024w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/streetscene-300x211.png 300w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/streetscene-768x540.png 768w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/streetscene.png 1312w\" sizes=\"auto, (max-width: 597px) 100vw, 597px\" \/><figcaption>Figure 1: Street scene<\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Creating Training Data<\/h2>\n\n\n\n<p>The students took about ten videos while driving a car to create the training data. Since videos are just a sequence of images, they selected randomly around 250 images from the videos. Above we mentioned that we wanted to extract street, street marking, traffic signs and cars. These are four categories. There is another one, which is neither of those (or the background category). This makes it five categories. Below you find code of python dictionaries. The first is the dictionary <em>classes<\/em> which assigns text to the category numbers (such as the number <em>1<\/em> is assigned to <em>Strasse<\/em> (Street in English). The dictionary <em>categories<\/em> assigns it the way around. Finally the dictionary <em>colors<\/em> assigns a color to each category number. <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">classes = {\"Strasse\" : 1, \n           \"strasse\" : 1,\n          \"Streifen\" : 2,\n           \"streifen\" : 2,\n          \"Auto\" : 3, \n          \"auto\" : 3,\n          \"Schild\" : 4,\n           \"schild\" : 4,\n         }\n\ncategories = {1: \"Strasse\", \n              2: \"Streifen\",\n              3: \"Auto\", \n              4: \"Schild\",           \n         }\n\ncolors = {0 : (0,0,0), \n          1 : (0,0,255), \n          2 : (0,255,0),\n          3 : (255,0,0), \n          4 : (0,255,255),         \n         }\n\ndim = (256, 256) <\/pre>\n\n\n\n<p>To create training data we decided to use the tool<em> labelme<\/em>, which is described <a href=\"http:\/\/labelme.csail.mit.edu\/Release3.0\/\">here<\/a>. You can load images into the tool, and mark regions by drawing polygons around them, see Figure 2. The regions can be categorized with a names which are the same as from the python dictionary c<em>ategories<\/em>, see code above. You can save the polygons, its categories and the image itself to a json file. Now it is very easy to parse the json file with a python parser.  <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"645\" src=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/labeling-1-1024x645.png\" alt=\"\" class=\"wp-image-2607\" srcset=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/labeling-1-1024x645.png 1024w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/labeling-1-300x189.png 300w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/labeling-1-768x484.png 768w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/labeling-1-1536x968.png 1536w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/labeling-1.png 1616w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 2: Tool labelme<\/figcaption><\/figure>\n\n\n\n<p>The image itself is stored in the json file in a base-64 representation. Here python has tools as well, to parse out the image. Videos generally do not store square images, so we need to convert them into a square image, which is better for training the neural network. We have written the following two functions to fulfill this task: <em>makesquare2<\/em> and <em>makesquare3<\/em>. The difference of these functions is, that the first handles grayscale images and the second RGB images. Both functions just omits evenly the left and the right portion of the original video image to create a square image.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def makesquare2(img):\n    \n    assert(img.ndim == 2) \n    \n    edge = min(img.shape[0],img.shape[1])\n        \n    img_sq = np.zeros((edge, edge), 'uint8')\n    \n    if(edge == img.shape[0]):\n        img_sq[:,:] = img[:,int((img.shape[1] - edge)\/2):int((img.shape[1] - edge)\/2)+edge]\n    else:\n        img_sq[:,:,:] = img[int((img.shape[0] - edge)\/2):int((img.shape[0] - edge)\/2)+edge,:]\n\n    assert(img_sq.shape[0] == edge and img_sq.shape[1] == edge)\n    \n    return img_sq<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def makesquare3(img):\n    \n    assert(img.ndim == 3)\n    \n    edge = min(img.shape[0],img.shape[1])\n        \n    img_sq = np.zeros((edge, edge, 3), 'uint8')\n    \n    if(edge == img.shape[0]):\n        img_sq[:,:,:] = img[:,int((img.shape[1] - edge)\/2):int((img.shape[1] - edge)\/2)+edge,:]\n    else:\n        img_sq[:,:,:] = img[int((img.shape[0] - edge)\/2):int((img.shape[0] - edge)\/2)+edge,:,:]\n\n    assert(img_sq.shape[0] == edge and img_sq.shape[1] == edge)\n    \n    return img_sq<\/pre>\n\n\n\n<p>Below you see two functions <em>createMasks<\/em> and <em>createMasksAugmented<\/em>. We need the functions to create training data from the json files, such as original images and mask images. Both functions have been described in the <a href=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/08\/15\/image-processing-for-an-autonomous-guided-vehicle-in-a-learn-factory\/\">post<\/a> before. Therefore we dig only into the differences. <\/p>\n\n\n\n<p>In both functions you find a special handling for the category <em>Strasse<\/em> (Street in English). You can find it below the code snippet:<\/p>\n\n\n\n<p class=\"has-text-align-center\"><em>classes[shape[&#8216;label&#8217;]] != classes[&#8216;Strasse&#8217;]<\/em><\/p>\n\n\n\n<p>Regions in the images, which are marked as <em>Strasse<\/em> (Street in English) can share the same region as <em>Streifen<\/em> (Street Marking in English) or <em>Auto<\/em> (Car in English). We do not want that street regions overrule the street marking or car regions, otherwise the street marking or the cars will disappear. For this we create first a separate mask <em>mask_strasse<\/em>. All other category masks are substracted from <em>mask_strasse<\/em>, and then <em>mask_strasse<\/em> is added to the finalmask. Again, we just want to make sure that the street marking mask and the other masks are not overwritten by the street mask.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def createMasks(sourcejsonsdir, destimagesdir, destmasksdir):\n\n    assocf = open(os.path.join(path,\"assoc_orig.txt\"), \"w\")\n    count = 0\n    directory = sourcejsonsdir\n    for filename in os.listdir(directory):\n        if filename.endswith(\".json\"):\n            print(\"{}:{}\".format(count,os.path.join(directory, filename)))\n            \n            f = open(os.path.join(directory, filename))\n            data = json.load(f)\n            img_arr = data['imageData']  \n            imgdata = base64.b64decode(img_arr)\n\n            img = cv2.imdecode(np.frombuffer(imgdata, dtype=np.uint8), flags=cv2.IMREAD_COLOR)\n            \n            assert (img.shape[0] &gt; dim[0])\n            assert (img.shape[1] &gt; dim[1])\n           \n            finalmask = np.zeros((img.shape[0], img.shape[1]), 'uint8')\n        \n            masks=[]\n            masks_strassen=[]\n            mask_strasse = np.zeros((img.shape[0], img.shape[1]), 'uint8')\n            \n            for shape in data['shapes']:\n                assert(shape['label'] in classes)\n\n                vertices = np.array([[point[1],point[0]] for point in shape['points']])\n                vertices = vertices.astype(int)\n\n                rr, cc = polygon(vertices[:,0], vertices[:,1], img.shape)\n                mask_orig = np.zeros((img.shape[0], img.shape[1]), 'uint8')\n\n                mask_orig[rr,cc] = classes[shape['label']]\n                if classes[shape['label']] != classes['Strasse']:\n                    masks.append(mask_orig)\n                else:\n                    masks_strassen.append(mask_orig)\n                    \n            for m in masks_strassen:\n                mask_strasse += m\n                    \n            for m in masks:\n                _,mthresh = cv2.threshold(m,0,255,cv2.THRESH_BINARY_INV)\n                finalmask = cv2.bitwise_and(finalmask,finalmask,mask = mthresh)\n                finalmask += m\n\n            _,mthresh = cv2.threshold(finalmask,0,255,cv2.THRESH_BINARY_INV)\n            mask_strasse = cv2.bitwise_and(mask_strasse,mask_strasse, mask = mthresh)\n            finalmask += mask_strasse  \n                \n            img = makesquare3(img)\n            finalmask = makesquare2(finalmask)\n\n            img_resized = cv2.resize(img, dim, interpolation = cv2.INTER_NEAREST)\n            finalmask_resized = cv2.resize(finalmask, dim, interpolation = cv2.INTER_NEAREST)\n            \n            filepure,extension = splitext(filename)\n            \n            cv2.imwrite(os.path.join(destimagesdir, \"{}o.png\".format(filepure)), img_resized)\n            cv2.imwrite(os.path.join(destmasksdir, \"{}o.png\".format(filepure)), finalmask_resized)\n\n            assocf.write(\"{:05d}o:{}\\n\".format(count,filename))\n            assocf.flush()\n            count += 1\n\n        else:\n            continue\n    f.close()<\/pre>\n\n\n\n<p>In <em>createMask<\/em> you find the usage of <em>makesquare3<\/em> and <em>makesquare2<\/em>, since the video images are not square. However we prefer to use square image for neural network training.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def createMasksAugmented(ident, sourcejsonsdir, destimagesdir, destmasksdir):\n\n    assocf = open(os.path.join(path,\"assoc_{}_augmented.txt\".format(ident)), \"w\")\n\n    count = 0\n    directory = sourcejsonsdir\n    for filename in os.listdir(directory):\n        if filename.endswith(\".json\"):\n            print(\"{}:{}\".format(count,os.path.join(directory, filename)))\n\n            f = open(os.path.join(directory, filename))\n            data = json.load(f)\n\n            img_arr = data['imageData']  \n            imgdata = base64.b64decode(img_arr)\n\n            img = cv2.imdecode(np.frombuffer(imgdata, dtype=np.uint8), flags=cv2.IMREAD_COLOR)\n            \n            assert (img.shape[0] &gt; dim[0])\n            assert (img.shape[1] &gt; dim[1])\n            \n            zoom = randint(75,90)\/100.0\n            angle = (2*random()-1)*3.0\n            img_rotated = imutils.rotate_bound(img, angle)\n            \n            xf = int(img_rotated.shape[0]*zoom)\n            yf = int(img_rotated.shape[1]*zoom)            \n            \n            img_zoomed = np.zeros((xf, yf, img_rotated.shape[2]), 'uint8')\n            img_zoomed[:,:,:] = img_rotated[int((img_rotated.shape[0]-xf)\/2):int((img_rotated.shape[0]-xf)\/2)+xf,int((img_rotated.shape[1]-yf)\/2):int((img_rotated.shape[1]-yf)\/2)+yf,:] \n        \n\n            finalmask = np.zeros((img_zoomed.shape[0], img_zoomed.shape[1]), 'uint8')\n            mthresh = np.zeros((img_zoomed.shape[0], img_zoomed.shape[1]), 'uint8')\n            masks=[]\n            masks_strassen=[]\n            mask_strasse = np.zeros((img_zoomed.shape[0], img_zoomed.shape[1]), 'uint8')\n\n            for shape in data['shapes']:\n                assert(shape['label'] in classes)\n\n                vertices = np.array([[point[1],point[0]] for point in shape['points']])\n                vertices = vertices.astype(int)\n\n                rr, cc = polygon(vertices[:,0], vertices[:,1], img.shape)\n                mask_orig = np.zeros((img.shape[0], img.shape[1]), 'uint8')\n                mask_orig[rr,cc] = classes[shape['label']]\n                \n                mask_rotated = imutils.rotate_bound(mask_orig, angle)\n                \n                mask_zoomed = np.zeros((xf, yf), 'uint8')\n                mask_zoomed[:,:] = mask_rotated[int((img_rotated.shape[0]-xf)\/2):int((img_rotated.shape[0]-xf)\/2)+xf,int((img_rotated.shape[1]-yf)\/2):int((img_rotated.shape[1]-yf)\/2)+yf] \n                \n                if classes[shape['label']] != classes['Strasse']:\n                    masks.append(mask_zoomed)\n                else:\n                    masks_strassen.append(mask_zoomed)\n\n            for m in masks_strassen:\n                mask_strasse += m\n                    \n            for m in masks:\n                _,mthresh = cv2.threshold(m,0,255,cv2.THRESH_BINARY_INV)\n                finalmask = cv2.bitwise_and(finalmask,finalmask,mask = mthresh)\n                finalmask += m\n\n            _,mthresh = cv2.threshold(finalmask,0,255,cv2.THRESH_BINARY_INV)\n            mask_strasse = cv2.bitwise_and(mask_strasse,mask_strasse, mask = mthresh)\n            finalmask += mask_strasse    \n    \n            # contrast-&gt; alpha: 1.0 - 3.0; brightness -&gt; beta: 0 - 100\n            alpha = 0.8 + 0.4*random();\n            beta = int(random()*15)\n    \n            img_adjusted = cv2.convertScaleAbs(img_zoomed, alpha=alpha, beta=beta)\n        \n        \n            img_adjusted = makesquare3(img_adjusted)\n            finalmask = makesquare2(finalmask)\n\n            img_resized = cv2.resize(img_adjusted, dim, interpolation = cv2.INTER_NEAREST)\n            finalmask_resized = cv2.resize(finalmask, dim, interpolation = cv2.INTER_NEAREST)\n            \n            filepure,extension = splitext(filename)\n            \n            if randint(0,1) == 0:\n                cv2.imwrite(os.path.join(destimagesdir, \"{}{}.png\".format(filepure, ident)), img_resized)\n                cv2.imwrite(os.path.join(destmasksdir, \"{}{}.png\".format(filepure, ident)), finalmask_resized) \n            else:\n                cv2.imwrite(os.path.join(destimagesdir, \"{}{}.png\".format(filepure, ident)), cv2.flip(img_resized,1))\n                cv2.imwrite(os.path.join(destmasksdir, \"{}{}.png\".format(filepure, ident)), cv2.flip(finalmask_resized,1))    \n                \n            assocf.write(\"{:05d}:{}\\n\".format(count, filename))\n            assocf.flush()\n            count += 1\n\n        else:\n            continue\n    f.close()<\/pre>\n\n\n\n<p>The function <em>createMasksAugmented<\/em> above does basically the same thing as <em>createMasks<\/em>, but the image is randomly zoomed and rotated. Also the brightness and contrast is randomly adjusted.  The purpose is to create more varieties of images from the original image for regularization.<\/p>\n\n\n\n<p>The original images and mask images for neural network training can be generated by calling the functions below. In the directory<em> fullpathjson <\/em>you find the source json files. The parameters <em>fullpathimages<\/em> and <em>fullpathmasks<\/em> are the directory names for the destination. <\/p>\n\n\n\n<p>Since we labeled 250 video images and stored them to json files, the functions below create altogether 1000 original and mask images. The function <em>createMasksAugmented<\/em> can be called even more often for more data.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">createMasks(fullpathjson, fullpathimages, fullpathmasks)\ncreateMasksAugmented(\"a4\",fullpathjson, fullpathimages, fullpathmasks)\ncreateMasksAugmented(\"a5\",fullpathjson, fullpathimages, fullpathmasks)\ncreateMasksAugmented(\"a6\",fullpathjson, fullpathimages, fullpathmasks)<\/pre>\n\n\n\n<p>In Figure 3 below you find one result of the applied <em>createMasks<\/em> function. Left, there is an original image. In the middle there is the corresponding mask image. On the right there is the overlaid image. The region of the traffic sign is marked yellow, the street has color red, the street marking is in green and finally the car has a blue color.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"341\" src=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/testdatargb-1024x341.png\" alt=\"\" class=\"wp-image-2611\" srcset=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/testdatargb-1024x341.png 1024w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/testdatargb-300x100.png 300w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/testdatargb-768x256.png 768w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/testdatargb.png 1101w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 3: Original Image, mask image and overlaid image <\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Training the model<\/h2>\n\n\n\n<p>We used different kinds of UNET models and compared the results from them. The UNET model from the previous <a href=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/08\/15\/image-processing-for-an-autonomous-guided-vehicle-in-a-learn-factory\/\">post<\/a> had mediocre results. For this reason we switched to an UNET model from a different author. It can be found <a href=\"https:\/\/github.com\/divamgupta\/image-segmentation-keras\">here<\/a>. The library <em>keras_segmentation<\/em> gave us much better predictions. However we do not have so much control over the model itself, for this was the reason we created our own UNET model, but we were inspired by  <em>keras_segmentation<\/em>.<\/p>\n\n\n\n<p>Above we already stated that we have five categories. Each category gets a number and a color assigned. The assignment can be seen in the python dictionary below.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">colors = {0 : (0,0,0), \n          1 : (0,0,255), \n          2 : (0,255,0),\n          3 : (255,0,0), \n          4 : (0,255,255),         \n         }\n\n\ndim = (256, 256) <\/pre>\n\n\n\n<p>The UNET model we created is shown in the code below. It consists of a contracting and an expanding path. In the expanding path we have five convolution operations, followed by batch normalizations, relu-activations and maxpoolings. In between we also use dropouts for regularization. The results of these operations are saved into variables (<em>c0, c1, c2, c3, c4<\/em>). In the expanding path we implemented the upsampling operations and concatenate their outputs with the variables  (<em>c0, c1, c2, c3, c4<\/em>)  from the contracting path. Finally the code performs a softmax operation. <\/p>\n\n\n\n<p>In general the concatenation of the outputs with the variables show benefits during training. Gradients of upper layers often vanish during the backpropagation operation. Concatenation is the method to prevent this behavior. However if this is really the case here has not been proofed.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def my_unet(classes):\n    \n    dropout = 0.4\n    input_img = Input(shape=(dim[0], dim[1], 3))\n    \n    #contracting path\n    x = (ZeroPadding2D((1, 1)))(input_img)\n    x = (Conv2D(64, (3, 3), padding='valid'))(x)\n    x = (BatchNormalization())(x)\n    x = (Activation('relu'))(x)\n    x = (MaxPooling2D((2, 2)))(x)\n    c0 = Dropout(dropout)(x)\n    \n    x = (ZeroPadding2D((1, 1)))(c0)\n    x = (Conv2D(128, (3, 3),padding='valid'))(x)\n    x = (BatchNormalization())(x)\n    x = (Activation('relu'))(x)\n    x = (MaxPooling2D((2, 2)))(x)\n    c1 = Dropout(dropout)(x)\n\n    x = (ZeroPadding2D((1, 1)))(c1)\n    x = (Conv2D(256, (3, 3), padding='valid'))(x)\n    x = (BatchNormalization())(x)\n    x = (Activation('relu'))(x)\n    x = (MaxPooling2D((2, 2)))(x)\n    c2 = Dropout(dropout)(x)\n    \n    x = (ZeroPadding2D((1, 1)))(c2)\n    x = (Conv2D(256, (3, 3), padding='valid'))(x)\n    x = (BatchNormalization())(x)\n    x = (Activation('relu'))(x)\n    x = (MaxPooling2D((2, 2)))(x)\n    c3 = Dropout(dropout)(x)\n    \n    x = (ZeroPadding2D((1, 1)))(c3)\n    x = (Conv2D(512, (3, 3), padding='valid'))(x)\n    c4 = (BatchNormalization())(x)\n\n    #expanding path\n    x = (UpSampling2D((2, 2)))(c4)\n    x = (concatenate([x, c2], axis=-1))\n    x = Dropout(dropout)(x)\n    x = (ZeroPadding2D((1, 1)))(x)\n    x = (Conv2D(256, (3, 3), padding='valid', activation='relu'))(x)\n    e4 = (BatchNormalization())(x)\n    \n    x = (UpSampling2D((2, 2)))(e4)\n    x = (concatenate([x, c1], axis=-1))\n    x = Dropout(dropout)(x)\n    x = (ZeroPadding2D((1, 1)))(x)\n    x = (Conv2D(256, (3, 3), padding='valid', activation='relu'))(x)\n    e3 = (BatchNormalization())(x)\n    \n    x = (UpSampling2D((2, 2)))(e3)\n    x = (concatenate([x, c0], axis=-1))\n    x = Dropout(dropout)(x)\n    x = (ZeroPadding2D((1, 1)))(x)\n    x = (Conv2D(64, (3, 3), padding='valid', activation='relu'))(x)\n    x = (BatchNormalization())(x)\n\n    x = (UpSampling2D((2, 2)))(x)\n    x = Conv2D(classes, (3, 3), padding='same')(x)\n    \n    x = (Activation('softmax'))(x)\n    \n    model = Model(input_img, x)\n        \n    return model<\/pre>\n\n\n\n<p>As described in the last<a href=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/08\/15\/image-processing-for-an-autonomous-guided-vehicle-in-a-learn-factory\/\"> post<\/a>, we use for training a different kind of mask representation than the masks created by <em>createMasks<\/em>. In our case, the UNET model needs masks in the shape 256x256x5.  The function <em>createMasks<\/em> creates masks in shape 256&#215;256 though. For training it is preferable to have one layer for each category. This is why we implemented the function <em>makecolormask<\/em>. It is the same function as described in the last <a href=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/08\/15\/image-processing-for-an-autonomous-guided-vehicle-in-a-learn-factory\/\">post<\/a>, but it is optimized and performs better. See code below.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def makecolormask(mask):\n    ret_mask = np.zeros((mask.shape[0], mask.shape[1], len(colors)), 'uint8')\n    \n    for col in range(len(colors)):\n        ret_mask[:, :, col] = (mask == col).astype(int)\n                       \n    return ret_mask<\/pre>\n\n\n\n<p>Below we define callback functions for the training period. The callback function <em>EarlyStopping <\/em>stops the training after ten epoch if there is no improvement concerning validation loss. The callback function <em>ReduceLROnPlateau<\/em> reduces the learning rate if there is no improvement concerning validation loss after three epochs. And finally the callback function<em> ModelCheckpoint<\/em> creates a checkpoint from the UNET model&#8217;s weights, as soon as an improvement of the validation loss has been calculated.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">modeljsonname=\"model-chkpt.json\"\nmodelweightname=\"model-chkpt.h5\"\n\ncallbacks = [\n    EarlyStopping(patience=10, verbose=1),\n    ReduceLROnPlateau(factor=0.1, patience=3, min_lr=0.00001, verbose=1),\n    ModelCheckpoint(os.path.join(path, dirmodels,modelweightname), verbose=1, save_best_only=True, save_weights_only=True)\n]<\/pre>\n\n\n\n<p>The code below is used to load batches of original images and mask images into lists, which are fed into the training process. The function <em>generatebatchdata <\/em>has been described last <a href=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/08\/15\/image-processing-for-an-autonomous-guided-vehicle-in-a-learn-factory\/\">post<\/a>, so we omit the description since it is nearly identical.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def generatebatchdata(batchsize, fullpathimages, fullpathmasks):\n  \n    imagenames = os.listdir(fullpathimages)\n    imagenames.sort()\n\n    masknames = os.listdir(fullpathmasks)\n    masknames.sort()\n\n    assert(len(imagenames) == len(masknames))\n    \n    for i in range(len(imagenames)):\n        assert(imagenames[i] == masknames[i])\n\n    while True:\n        batchstart = 0\n        batchend = batchsize    \n        \n        while batchstart &lt; len(imagenames):\n            \n            imagelist = []\n            masklist = []\n            \n            limit = min(batchend, len(imagenames))\n\n            for i in range(batchstart, limit):\n                if imagenames[i].endswith(\".png\"):\n                    imagelist.append(cv2.imread(os.path.join(fullpathimages,imagenames[i]),cv2.IMREAD_COLOR ))\n                if masknames[i].endswith(\".png\"):\n                    masklist.append(makecolormask(cv2.imread(os.path.join(fullpathmasks,masknames[i]),cv2.IMREAD_UNCHANGED )))\n\n\n            train_data = np.array(imagelist, dtype=np.float32)\n            train_mask= np.array(masklist, dtype=np.float32)\n\n            train_data \/= 255.0\n    \n            yield (train_data,train_mask)    \n\n            batchstart += batchsize   \n            batchend += batchsize<\/pre>\n\n\n\n<p>The variables <em>generate_train<\/em> and <em>generate_valid<\/em> are instantiated from <em>generatebatchdata<\/em> below. We use a batch size of two. We had trouble with a breaking the training process, as soon as the batch size was too large. We assume it is because of memory overflow in the graphic card. Setting it to a lower number worked fine. The method <em>fit_generator<\/em> starts the training process. The training took around ten minutes on a NVIDIA 2070 graphic card. The accuracy has reached 98% and the validation accuracy has reached 94%. This is pretty much overfitting, however we are aware, that we have very few images to train. Originally we created only 250 masks from the videos in the beginning. The rest of the training data resulted from data augmentation.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">generator_train = generatebatchdata(2, fullpathimages, fullpathmasks)\ngenerator_valid = generatebatchdata(2, fullpathimagesvalid, fullpathmasksvalid)\nmodel.fit_generator(generator_train,steps_per_epoch=700, epochs=10, callbacks=callbacks, validation_data=generator_valid, validation_steps=100)<\/pre>\n\n\n\n<p>After training we saved the model. First the model structure is saved to a json file. Second the weights are saved by using the <em>save_weights <\/em>method.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">model_json = model.to_json()\nwith open(os.path.join(path, dirmodels,modeljsonname), \"w\") as json_file:\n    json_file.write(model_json)\n\nmodel.save_weights(os.path.join(path, dirmodels,modelweightname))<\/pre>\n\n\n\n<p>The <em>predict<\/em> method of the model predicts masks from original images. It returns a list of masks, see code below. The parameter <em>test_data<\/em> is a list of original images.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">predictions_test = model.predict(test_data, batch_size=1, verbose=1)<\/pre>\n\n\n\n<p>The masks in <em>predictions_test<\/em> from the model prediction have a 256x256x5 shape. Each layer represents a category. It is not convenient to display this mask, so <em>predictedmask<\/em> converts it back to a 256&#215;256 representation. The code below shows <em>predictedmask<\/em>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def predictedmask(masklist):\n    y_list = []\n    for mask in masklist:\n        assert mask.shape == (dim[0], dim[1], len(colors))\n        imgret = np.zeros((dim[0], dim[1]), np.uint8)\n        \n        imgret = mask.argmax(axis=2)\n        \n        y_list.append(imgret)\n                    \n    return y_list<\/pre>\n\n\n\n<p>Another utility function for display purpose is <em>makemask<\/em>, see code below. It converts 256&#215;256 mask representation to a color mask representation by using the python dictionary <em>colors<\/em>. Again this is a function from last <a href=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/08\/15\/image-processing-for-an-autonomous-guided-vehicle-in-a-learn-factory\/\">post<\/a>, however optimized for performance.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def makemask(mask):\n    ret_mask = np.zeros((mask.shape[0], mask.shape[1], 3), 'uint8')\n\n    for col in range(len(colors)):\n        layer = mask[:, :] == col\n        ret_mask[:, :, 0] += ((layer)*(colors[col][0])).astype('uint8')\n        ret_mask[:, :, 1] += ((layer)*(colors[col][1])).astype('uint8')\n        ret_mask[:, :, 2] += ((layer)*(colors[col][2])).astype('uint8')\n    \n    return ret_mask<\/pre>\n\n\n\n<p>In Figure 4 you find one prediction from an original image. The original image is on the left side. The predicted mask in the middle, and the combined image on the right side.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"337\" src=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/predictdatargb-1024x337.png\" alt=\"\" class=\"wp-image-2612\" srcset=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/predictdatargb-1024x337.png 1024w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/predictdatargb-300x99.png 300w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/predictdatargb-768x253.png 768w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/predictdatargb.png 1112w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 4: Original image, predicted mask and overlaid image<\/figcaption><\/figure>\n\n\n\n<p>You can see that the street was recognized pretty well, however parts of the sky was taken as a street as well. The traffic sign was labeled too, but not completely. As mentioned before we assume that we should have much more training data, to get better results.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Creating an augmented video<\/h2>\n\n\n\n<p>The code below creates an augmented video from an original video the students made in the beginning of the project. The name of the original video is <em>video3.mp4<\/em>. The name of the augmented video is <em>videounet-3-drop.mp4<\/em>. The method <em>read<\/em> of <em>VideoCapture<\/em> reads in each single images from <em>video3.mp4<\/em> and stores them into the local variable <em>frame<\/em>. The image in <em>frame<\/em> is normalized and the mask image is predicted. The function <em>predictedmask<\/em> converts the mask into a 256&#215;256 representation, and the function <em>makemask<\/em> creates a color image from it. Finally <em>frame<\/em> and the color mask are overlaid and saved to the new augmented video <em>videounet-3-drop.mp4<\/em>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">cap = cv2.VideoCapture(os.path.join(path,\"videos\",'video3.mp4'))\nif (cap.isOpened() == True): \n    \n    print(\"Opening video stream or file\")\n    out = cv2.VideoWriter(os.path.join(path,\"videos\",'videounet-3-drop.mp4'),cv2.VideoWriter_fourcc(*'MP4V'), 25, (256,256))\n    while(cap.isOpened()):\n        ret, frame = cap.read()\n        \n        if ret == False:\n            break\n\n        test_data = []\n        test_data.append(frame)\n        test_data = np.array(test_data, dtype=np.float32)\n        test_data \/= 255.0\n        \n        predicted = model.predict(test_data, batch_size=1, verbose=0)\n        assert(len(predicted) == 1)\n        pmask = predictedmask(predicted)\n        \n        if ret == True:\n            \n            mask = makemask(pmask[0])\n  \n            weighted = np.zeros((dim[0], dim[1], 3), 'uint8')\n            cv2.addWeighted(frame, 0.6, mask, 0.4, 0, weighted)\n \n\n            out.write(weighted)\n            cv2.imshow('Frame',weighted)\n            if cv2.waitKey(25) &amp; 0xFF == ord('q'):\n                break\n\n    out.release()\n            \ncap.release()\ncv2.destroyAllWindows()<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>In Figure 5 you see a snapshot of the augmented video. The street is marked pretty well. You can even see how the sidewalk is not marked as a street. On the right side you find a traffic sign in yellow. The prediction works here good too. On the left side you find blue marking. The color blue categorizes cars. However there are no cars on the original image and therefore this is a misinterpretation. The street markings in green are not shown very well. In some parts of the augmented video they appear very strong, and in some parts you see only fragments, just like in Figure 5. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"905\" src=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/videosnap-1024x905.png\" alt=\"\" class=\"wp-image-2609\" srcset=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/videosnap-1024x905.png 1024w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/videosnap-300x265.png 300w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/videosnap-768x679.png 768w, https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/files\/2020\/08\/videosnap.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 5: Snapshot of the augmented video<\/figcaption><\/figure>\n<\/div>\n\n\n<p>Overall we are very happy with the result of this project, considering that we have only labeled 250 images. We are convinced that we get better results as soon as we have more images for training. To overcome the problem we tried with data augmentation. So from the 250 images we created around 3000 images. We also randomly flipped the images horizontally in a code version, which was not discussed here.<\/p>\n\n\n\n<p>We think that we have too much overfitting, which can only be solved with more training data. Due to the time limit, we had to restrict ourselves with fewer training data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Acknowledgement<\/h2>\n\n\n\n<p>Special thanks to the class of Summer Semester 2020 <em>Design Cyber Physical Systems <\/em> providing the videos while driving and labeling the images for the training data used for the neural network. We appreciate this very much since this is a lot of effort. <\/p>\n\n\n\n<p>Also special thanks to the University of Applied Science Albstadt-Sigmaringen for providing the infrastructure and the appliances to enable this class and this research. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>For the summer semester 2020 we offered an elective class called Design Cyber Physical Systems. This time only few students enrolled into the class. The advantage of having small classes is that we can focus on a certain topic without loosing the overview during project execution. The class starts with a brain storming session to &hellip; <a href=\"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/2020\/08\/20\/street-scene-segmentation-with-a-unet-model\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Street Scene Segmentation with a UNET-Model<\/span><\/a><\/p>\n","protected":false},"author":24,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[4,6,3,5,7,14],"class_list":["post-2589","post","type-post","status-publish","format-standard","hentry","category-allgemein","tag-ai","tag-classification","tag-deep-learning","tag-ki","tag-neural-network","tag-unet"],"_links":{"self":[{"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/posts\/2589","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/users\/24"}],"replies":[{"embeddable":true,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/comments?post=2589"}],"version-history":[{"count":262,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/posts\/2589\/revisions"}],"predecessor-version":[{"id":4840,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/posts\/2589\/revisions\/4840"}],"wp:attachment":[{"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/media?parent=2589"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/categories?post=2589"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www3.hs-albsig.de\/wordpress\/point2pointmotion\/wp-json\/wp\/v2\/tags?post=2589"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}