Recognizing Seam Errors and Breaks with a residual neural network

Introduction

Once a year we offer a class called Forschungsprojekt Industrie 4.0 (Research Project 4.0), which is a practical assignment to enrolled of students. This time the goal of the assignment was to create a quality control system which recognizes errors and breaks on textile seams. Six students enrolled to this specific assignment.

In Figure 1 you see on the left upper side a sewing machine from top. We reconstructed a lamp holder to a camera holder by attaching a usb-camera (1 on Figure 1) on it. Then we moved the usb-camera right next to the sewing machine (2 on Figure 1) to have a camera view of the sewed textile.

Figure 1: Setup

While the sewing maching is running, the usb-camera takes images from the textile with its sewed string. The image is sent to a quality control system, which should recognize errors and breaks. So the assignment for the students here is to create a quality control system.

Preparations

We decided to use a neural network to recognize breaks and errors from the images made by the usb-camera. For training the neural network the students needed to gather numerous training images.

For this purpose the students cut textile stripes which can bee seen in Figure 2. Then the students sewed seams along the textile stripes, which were recorded by the usb-camera system at the same time. The videos were saved as mp4 files.

Only in rare cases sewing machines produce errors or breaks. This is why we had to generate the errors and breaks ourselves.

Figure 2: Stripes

In Figure 3 you can see how we create errors and breaks. The left picture shows how an error is produced by sticking a scissor under the seam which widening the string. The right picture shows how we produce a break by cutting the string with the scissor. Therefore we have already two categories to distinguish: “errors” and “breaks”. We call them from now on attributes. There are two more attributes to come.

Figure 3: Prepare errors and breaks

On sewing machines you can set up the distances for the stitches. Possible values are e.g. 2mm or 4mm. We decided to use exactly these values to distinguish and defined for this the attribute “length”. If the attribute “length” is true, than we have a stitch distance of 2mm, if false we have a stitch distance of 4mm.

The last attribute we call “good”. This simply means that we recognize the image on the seam as a good seam. If attribute “good” is set to false, there is a problem with an “error”, “break” or “length”. This makes it altogether four attributes: “good”, “error”, “break” and “length”.

Figure 4: Images of categories “good”, “error”, “break” and “length”

In Figure 4 you find images for each attribute. On the left you find a “good” image. The second left picture you see an “error”. On the second right picture you find a “break”. The right most picture shows a length distance of 4mm, which is an attribute with value false.

You might have noticed that the picture on the right and the picture on the second right have the same stitch distance. So the second left picture has a “break”, and a stitch distance of 4mm. This means the classifications are not exclusive. An image from the usb-camera can show all attributes as true, all attributes as false or any other combination.

Creating labeled data

In Figure 1 you can see the setup of the sewing machine and the attached usb-camera. We used this setup to create videos while sewing strings on the textile stripes (Figure 2). Since error’s and break’s do not happen very often, we prepared the textile stripes with the scissor, and created new videos from them. Our goal was to have balanced data, which means that there is a good proportion of error’s and break’s in our data set. We are not describing the code to create video’s here. You could actually use any cell phone for this task.

However we want to show how we have labeled the training images resulting from the videos. The code below sets the basepath and assigns the video file path to the variable video. We will also use a file called data.csv to store the values (true or false) for each attribute on every image. It is basically a lookup table. The filename is assigned to datacsv.

basepath = r"..."
videos = "videos"
videopath = os.path.join(basepath,videos)

videonamepre = "video_2021_05_05_09_34_57__2mm_Kombi"
videonameext = "mp4"
datacsv = videonamepre + "_data.csv"
videoname = videonamepre + "." + videonameext
video = os.path.join(videopath,videoname)

The list picnames contains the names of the attributes, see code below. The list picstats contain the true or false values of the attributes for one image. The variable picpath is the path of the directory where we store our images extracted from the videos.

picnames = ["good","error","break","length"]
picstats = [False, False, False, False]

picspath = os.path.join(basepath, "pics")

The function save_entry is shown below. It uses the python library pandas to append the classification information (parameters name, picnames and picstats) to a data.csv file. First it creates a data frame, then it reads in an already existing data.csv file and loads in its content into the list datalist. Finally it moves the name, picnames and picstats values into the dictionary dataitem and appends it to datalist. The function save_entry converts datalist into a pandas data frame and saves the content into the data.csv file.

def save_entry(name, picnames, picstats):

    arr = os.listdir(picspath)
      
    df = pd.DataFrame()
    datalist = []
    
        
    if os.path.isfile(os.path.join(os.path.join(picspath,datacsv))):

        df=pd.read_csv(os.path.join(os.path.join(picspath,datacsv)))
        for index, row in df.iterrows():
            datalist.append({'name': row['name'], picnames[0]: row[picnames[0]], picnames[1]: row[picnames[1]], picnames[2]: row[picnames[2]], picnames[3]: row[picnames[3]]})
            
    dataitem ={'name': name, picnames[0]: picstats[0], picnames[1]: picstats[1], picnames[2]: picstats[2], picnames[3]: picstats[3]}
    datalist.append(dataitem)
    df = pd.DataFrame(datalist) 
    df.to_csv(os.path.join(picspath, datacsv))
    
    return

The code below is a small application to label the images from the previously created videos. First it opens a video named video with OpenCV’s constructor VideoCapture. The code runs into a loop to process each single image of the video. The OpenCV’s function waitKey stops the code execution until a key is pressed. In case the user presses the key “n”, the code reads in the next image of the video. It then puts text with classification information onto the image and displays it. Basically it displays the keys g for “good”, e for “error”, b for “break”, l for “length” and their attribute values. The user can press the keys g, e, b, l to toggle the attribute values. If the user presses the s key, then the application saves the attribute values with the function save_entry into the data.csv file.

count = 0
cap = cv2.VideoCapture(video)

if cap.isOpened():
    ret, frame = cap.read() 

while(cap.isOpened()):

    if ret == True:
        framedis = frame.copy()
        framedis = cv2.putText(framedis, str(count), (5, 15), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (255,255,255), 2, cv2.LINE_AA)
        framedis = cv2.putText(framedis, str(count), (5, 15), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (0,0,0), 1, cv2.LINE_AA)
        framedis = cv2.putText(framedis, picnames[0] + " (g) " + str(picstats[0]), (35, 15), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (255,255,255), 3, cv2.LINE_AA)
        framedis = cv2.putText(framedis, picnames[0] + " (g) " + str(picstats[0]), (35, 15), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (0,0,0), 1, cv2.LINE_AA)
        framedis = cv2.putText(framedis, picnames[1] + " (e) " + str(picstats[1]), (35, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (255,255,255), 3, cv2.LINE_AA)
        framedis = cv2.putText(framedis, picnames[1] + " (e) " + str(picstats[1]), (35, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (0,0,0), 1, cv2.LINE_AA)
        framedis = cv2.putText(framedis, picnames[2] + " (b) " + str(picstats[2]), (35, 45), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (255,255,255), 3, cv2.LINE_AA)
        framedis = cv2.putText(framedis, picnames[2] + " (b) " + str(picstats[2]), (35, 45), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (0,0,0), 1, cv2.LINE_AA)
        framedis = cv2.putText(framedis, picnames[3] + " (l) " + str(picstats[3]), (35, 60), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (255,255,255), 3, cv2.LINE_AA)
        framedis = cv2.putText(framedis, picnames[3] + " (l) " + str(picstats[3]), (35, 60), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (0,0,0), 1, cv2.LINE_AA)
        
        cv2.imshow('frame',framedis)
    else:
        break
        
    key = cv2.waitKey(0) & 0xFF
    if key == ord('q'):
        break
    if key == ord('n'):
        ret, frame = cap.read()
        count += 1
    if key == ord('g'):
        if picstats[0] == True:
            picstats[0] = False
        else:
            picstats[0] = True
    if key == ord('e'):
        if picstats[1] == True:
            picstats[1] = False
        else:
            picstats[1] = True
    if key == ord('b'):
        if picstats[2] == True:
            picstats[2] = False
        else:
            picstats[2] = True
    if key == ord('l'):
        if picstats[3] == True:
            picstats[3] = False
        else:
            picstats[3] = True
            
            
    if key == ord('s'):
        save_entry(videonamepre+f"_{count}.png", picnames, picstats)
        cv2.imwrite(os.path.join(picspath, videonamepre+f"_{count}.png"),frame)       
            
cap.release()
cv2.destroyAllWindows()

In Figure 5 you can see how the application displays the current image of a video. The user can change the attribute values by pressing one of the described keys to save each image into the data.csv file.

Figure 5: Labeling

The students working on this project created around 15000 images and saved their attribute values with this application into csv files.

Preprocessing the data

For training we need to have both, training data and validation data, which we want to separate. We do this by creating two files with lookup tables, a train.csv file and a valid.csv file. Both files have the image file name, the path location and its attribute values. Also we take a small number of images and assign them to a test data file which we call test.csv.

The function createcsv below is creating a new csv file with a csv file name as parameter (datacsv) from a list piclist. The list piclist contains a list of image filenames and its attribute values.

def createcsv(datacsv, targetpath, piclist):

    datalist = []
    
    df = pd.DataFrame()
    
    for item in piclist:
        datalist.append({'name': item[0], classes[0]: item[1][classes[0]], classes[1]: item[1][classes[1]], classes[2]: item[1][classes[2]], classes[3]: item[1][classes[3]]})

    df = pd.DataFrame(datalist) 
    df.to_csv(os.path.join(targetpath, datacsv))

The code below opens all csv files one by one located inside the fullpathpic directory . It reads each files content and moves it into a pandas dataframe df. The code iterates through the content and moves each entry into the list picnamelist. So each entry of picnamelist contains a full path and filename of the image and its attribute values.

The list picnamelist is then shuffled with random‘s shuffle function and 20 percent of picnamelist elements are moved into validlist. The code moves another two percent of picnamelist elements into testlist, and finally the code moves the remaining content into trainlist (around 78 percent). The code subsequently calls the function createcsv with validlist, testlist and trainlist as parameters to store lookup tables into valid.csv, test.csv and train.csv.

picnamelist = []
attributelist = []
validlist = []
testlist = []
trainlist = []

for file in os.listdir(fullpathpic ):
    if file.endswith(".csv"):
        f = open(os.path.join(os.path.join(fullpathpic,file)), "r")
        df = pd.read_csv(f, index_col = 0) 
        
        for index, row in df.iterrows():
            picname = row["name"]
            
            if os.path.isfile(os.path.join(fullpathpic, picname)):
                picnamelist.append([os.path.join(fullpathpic, picname), row])


random.shuffle(picnamelist)

num = 20*len(picnamelist) // 100
validlist = picnamelist[:num]
testlist = picnamelist[num:num+num//10]
trainlist =  picnamelist[num+num//10:]

createcsv("valid.csv", basepath, validlist)
createcsv("test.csv", basepath, testlist)
createcsv("train.csv", basepath, trainlist)

Training

The code below defines the basepath, and the location of the lookup tables for the training, validation and testing data (traincsv, validcsv and testcsv). The code stores the model into the path model_path. The model filename is modelsaved. The list classes contains the attributes, like picnames above (note that the code below and above are different code files).

The pandas read_csv function returns the length of the training data and validation data and assigns them to lentrain and lenvalid.

basepath = r"..."
traincsv = os.path.join(basepath, "train.csv")
validcsv = os.path.join(basepath, 'valid.csv')
testcsv = os.path.join(basepath, 'test.csv')
model_path = os.path.join(basepath, 'models')

classes = ["good","error","break","length"]

now = datetime.datetime.now()

modelsaved = "model.h5"

lentrain = pd.read_csv(traincsv, index_col = 0).shape[0]
lenvalid = pd.read_csv(validcsv, index_col = 0).shape[0]

We have described the function generatebatchdata previously here, so we dont go too much into details. As parameter we use the previously created lookup tables and open them with pandas read_csv function. The code moves the filenames into the list filenames and its attribute values into the list classnumbers. During training the number of elements (batchsize elements) are taken from both lists filenames and classnumbers, then the images are opened with OpenCV’s function imread and returned to the training process. Around 70 percent of the images are augmented by its brightness and contrast with OpenCV’s convertScaleAbs function.

def generatebatchdata(batchsize, datacsv, classes):

    filenames = []
    classnumbers = []
    
    df = pd.read_csv(datacsv, index_col = 0) 
        
    for index, row in df.iterrows():
        filenames.append(row["name"])
        classnumbers.append([int(row[classes[0]]), int(row[classes[1]]), int(row[classes[2]]), int(row[classes[3]])])
           
    while True:
        batchstart = 0
        batchend = batchsize    
        
        while batchstart < len(filenames):
            
            imagelist = []
            classlist = []
            
            limit = min(batchend, len(filenames))

            for i in range(batchstart, limit):
                img = np.zeros((dim[0], dim[1],3), 'uint8')
                img = cv2.resize(cv2.imread(filenames[i],cv2.IMREAD_COLOR ), dim, interpolation = cv2.INTER_AREA)
                if random.random() > 0.3:
                    alpha = 0.8 + 0.4*random.random()
                    beta = int(random.random()*15)
                    img = cv2.convertScaleAbs(img, alpha=alpha, beta=beta)

                imagelist.append(img)
                classlist.append(classnumbers[i])


            train_data = np.array(imagelist, dtype=np.float32)
            train_data -= train_data.mean()
            train_data /= train_data.std()
            train_class= np.array(classlist, dtype=np.float32)

            yield (train_data,train_class)    

            batchstart += batchsize   
            batchend += batchsize

We instantiate two functions from generatebatchdata: generator_train and generator_valid. We have set the batchsizes for training to 20, and for validation to one.

batchsizetrain = 20
batchsizevalid = 1

generator_train = generatebatchdata(batchsizetrain, traincsv , classes)
generator_valid = generatebatchdata(batchsizevalid, validcsv, classes)

The functions relu_bn, residual_block, and create_res_net in the code below create a residual neural network (RNN). Dorian Lazar supplied the code on github and it can be found here. We are not going to much into details, but the create_res_net is implementing a RNN from elements as shown as in Figure 6. You see here one residual block element in case the downsample parameter was set to true. The function create_res_net is appending several such blocks into one complete RNN.

Figure 6: Residual Block

We slightly modified the code at the end of the function create_res_net. As an output we have a dense layer with four neurons, which is the length of the list classes (also of the number of attributes we use). The last activation function we use the sigmoid function. Using the sigmoid function is common practice for mulit-label classfication problems in combination with the binary_crossentropy loss function.

def relu_bn(inputs: Tensor) -> Tensor:
    relu = ReLU()(inputs)
    bn = BatchNormalization()(relu)
    return bn

def residual_block(x: Tensor, downsample: bool, filters: int, kernel_size: int = 3) -> Tensor:
    y = Conv2D(kernel_size=kernel_size,
               strides= (1 if not downsample else 2),
               filters=filters,
               padding="same")(x)
    y = relu_bn(y)
    y = Conv2D(kernel_size=kernel_size,
               strides=1,
               filters=filters,
               padding="same")(y)

    if downsample:
        x = Conv2D(kernel_size=1,
                   strides=2,
                   filters=filters,
                   padding="same")(x)

    out = Add()([x, y])
    out = relu_bn(out)
    return out

def create_res_net():
    
    inputs = Input(shape=(dim[0], dim[1], 3))
    num_filters = 64
    
    t = BatchNormalization()(inputs)
    t = Conv2D(kernel_size=3,
               strides=1,
               filters=num_filters,
               padding="same")(t)
    t = relu_bn(t)
    
    num_blocks_list = [2, 4, 2]
    for i in range(len(num_blocks_list)):
        num_blocks = num_blocks_list[i]
        for j in range(num_blocks):
            t = residual_block(t, downsample=(j==0 and i!=0), filters=num_filters)
        num_filters *= 2
    
    t = AveragePooling2D(4)(t)
    t = Flatten()(t)
    outputs = Dense(len(classes), activation='sigmoid')(t)
    
    model = Model(inputs, outputs)

    return model

The code below creates a model with the function create_res_net and assigns it to the variable model. The variable model is compiled with the binary_crossentropy loss function. The summary method shows the structure of the model.

model = create_res_net()  
model.compile(Adam(lr=.00001), loss="binary_crossentropy", metrics=['accuracy'])
model.summary()

For training and validation you need to specify the number of steps: steptrainimages and stepsvalidimages. We can calculate them by dividing the number of training images with the batchsize for training and the number of validation images with the batchsize for validation.

stepstrainimages = lentrain//batchsizetrain
stepsvalidimages = lenvalid//batchsizevalid

Below, the code trains the model with its fit method. Parameters are the generators generator_train and generator_valid. Also the number of steps (stepstrainimages and stepsvalidimages) need to be given. After execution, the fit function returns its history content into variables. They can be used to create a checkpoint modelsaved with details on loss, valid_loss, accuracy and val_accuracy. By looking at the model’s checkpoint name, we can see an indication of of quality of the training. The code saves the checkpoint with its save_weights method.

hist = model.fit(generator_train,steps_per_epoch=stepstrainimages, epochs=10, validation_data=generator_valid, validation_steps=stepsvalidimages)

tl=hist.history['loss'][-1]
vl=hist.history['val_loss'][-1]
ta=hist.history['accuracy'][-1]
va=hist.history['val_accuracy'][-1]

modelsaved = f"model_{net}_{tl:.2f}_{vl:.2f}_{ta:1.3f}_{va:1.3f}.h5"
model.save_weights(os.path.join(model_path,modelsaved))

After training we compared the training losses with the validation losses. We find, that the training losses are smaller compared to the validation losses which indicates an overfitting. In the result section below we show the prediction results with testing data.

Result

We created not only a lookup table for training data and validation data, but also for testing data in the file test.csv. We set aside for this around 300 images. Then we ran the 300 images through the prediction method of model and compared the results with the attribute values in the lookup table. We are not showing here the code. Below you find the percentages of correct answers for the attributes “good”, “error”, “break” and “length”. We see that the “length” exceeds 100 percent correctness (all images were predicted concerning attribute “length” correctly). While “error” and “break” have a prediction accuracy of around 97 percent. Only the “good” category has less accuracy. We explain this, because student’s decision for “good” during labeling could be very error prone. Six students can label the attribute “good” very subjectively. While the decision to label an attribute “error”, “break” and “length” is much clearer to make.

Good Accuracy: 82.70270270270271 
Error Accuracy: 97.2972972972973 
Break Accuracy: 97.02702702702702 
Length Accuracy: 100.

To do further validation on the test images, we created heatmaps. A heatmap can point out the pixels, which leads to the neural network’s classification decision. We have described the code for creating heatmaps here, so we will not show it in this post anymore.

Figure 7 shows four heatsmaps. The left most picture shows correctly an error in blue. This means that the model predicts correctly the pixels around the seam error which led to the right classification decision. The same is true for the second left picture. We have here a break, and the heatmap showing correctly the pixels of the break in green. Since we have multi-label classification, there are cases where we have an error and a break at the same time. The second right picture is showing this case. Also here the decision was made correctly with pixels around the breaks in green and around the error in blue. The most right picture shows the pixels in red, leading to the length decision.

Figure 7: Heatmaps

One last remark. The generatebatchdata function we described above did some data augmentation by changing the brightness and the contrast of 70 percent of all images. Actually we did even more data augmentation, which was not shown here. The 15000 images were doubled to 30000 images by modifying them in the following way. We took a portion of the upper image out and appended this portion to the bottom part of the image. You find in Figure 8 an illustration how we did this. These are simple OpenCV functions, so we leave out the code here.

Figure 8: Data Augmentation

Acknowledgement

Special thanks to the class of Summer Semester 2021 Forschungsprojekt Industrie 4.0 providing 15000 images for the training data used for the neural network training. We appreciate this very much, and we know how much effort you have put into this.

Also special thanks to the University of Applied Science Albstadt-Sigmaringen for hosting the learn-factory and providing the appliances to enable this research.

Gradient-weighted Class Activation Mapping with fruit images

In a previous blog we documented methods and code for recognizing fruits with a neural network on a raspberry pi. This time we want to go one step further and describe the method and code to create heatmaps for fruit images. The method we use here is called Gradient-weighted Class Activation Mapping (Grad-CAM). A heatmap is telling us which pixel of an fruit image leads to the neural network’s decision to assign the input image to a specific class.

We believe that heatmaps are a very useful information especially during the validation after a neural network is trained. We want to know, if the neural networks is really making the right decision upon the given information (such as a fruit image). A good example is the classification of wolf and husky dog images made once by researchers (“Why Should I Trust You?”, See Reference). The researchers had actually pretty good results, until somebody figured out, that most wolf images were taken in a snowy environment, while the husky images were not. The neural network mostly associated a snowy environment with a wolf. The husky dog was therefore classified as wolf, when the image was taken with snow in the background.

For a deeper neural network validation, we can use heatmaps to see how the classification decision was made. Below we will show you how we generate heatmaps from fruit images. The Keras website helped us a lot to write our code, see also Reference.

The Setup

We use three different classes of images: Apfel (appel) images, orange images and tomate (tomato) images, see Figure 1. The list classes, see code below, contains strings describing the classes. We filled in the training, validation and test directories with fruit images, similar to those in Figure 1. Each class of fruit images went into its own directory named Apfel, Orange and Tomate.

Figure 1: Images of an appel, an orange and a tomato

In the code below we use the paths traindir, a validdir and a testdir. The weights and the structure of the neural network model (modelsaved and modeljson) is saved into the model directory .

The classes Apfel, Orange and Tomate are associated to numbers with the python classnum dictionary. The variable dim defines the size of the images.

basepath = "/home/....../Session1"
traindir = os.path.join(basepath, "pics" , "train")
validdir = os.path.join(basepath, "pics" , 'valid')
testdir = os.path.join(basepath, "pics" , 'test')
model_path = os.path.join(basepath, 'models')

classes = ["Apfel","Orange","Tomate"]
classnum = {"Apfel":0, "Orange":1, "Tomate":2}

net = 'convnet'

now = datetime.datetime.now()

modelweightname = f"model_{net}_{now.year}-{now.month}-{now.day}_callback.h5"
modelsaved = f"model_{net}.h5"
modeljson = f"model_{net}.json"


dim = (100,100, 3)

The Model

The following code shows the convolutional neural network (CNN) code in Keras. We have four convolutional layers and two dense layers. The last dense layer has three neurons. The output of one neuron indicates if an image is predicted as an apple, an orange or a tomato. Since the output is exclusive (an apple cannot be a tomato), we use the softmax activation function for the last layer.

def create_conv_net():
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(dim[0], dim[1], 3)))
    model.add(BatchNormalization())
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(len(classes), activation='softmax'))
    #opt = Adam(lr=1e-3, decay=1e-3 / 200)
    return model

The function create_conv_net creates the model. It is compiled with categorical_crossentropy loss function, see code below.

model = create_conv_net()
model.compile(Adam(lr=.00001), loss="categorical_crossentropy", metrics=['accuracy'])

The focus here is to describe the Grad-Cam method, and not the training of the model. Therefore we leave out the specifics on training. More details on training can be found here. The code below loads in (Keras method load_weights) a previously prepared model checkpoint and shows with the summary method the structure of the model. This is a useful information, because we have to look for the last convolutional layer. It is needed later.

model.load_weights(os.path.join(model_path,"model_convnet_2021-7-1_callback.h5"))
model.summary()

Now we have to create a new model with two outputs layers. In Figure 2 you see a simplified representation of the CNN we use. The last layer (color orange) represents the dense layer. The last CNN layer is in color green. Due to the filters of the CNN layer we have numerous channels. The Grad-CAM method requires the outputs of the channels. So we need to create a new model, based on the one above, which outputs both, the images of the results of the channels from the last CNN layer and the classification decision from the dense layer.

Figure 2: Simplified CNN

The code below iterates through the layers of CNN model in reverse order and finds the last batch_normalization layer. We modeled the neural network in a way that each CNN layer is followed by the batch_normalization layer, so we pick the batch_normalization layer as an output. The code produces a new model gradModel with model‘s input; and model‘s last CNN layer (actually last batch_nomalization layer) and last dense layer as outputs.

gradModel = None

for layer in reversed(model.layers):
    if "batch_normalization" in layer.name:
        print(layer.name)
        gradModel = Model(inputs=model.inputs, outputs=[model.get_layer(layer.name).output, model.output])
        break

The Grad-CAM method

The function getGrad below is executing the model gradModel with an image given as a parameter img. It returns the results (conv, predictions) of the last CNN layer (actually batch_normalization) and the last dense layer. We are now interested in the gradients of the images returned from the last CNN layer with respect to the loss of a specific class. The dictionary classnum (defined above) outputs a class number and addresses the loss inside predictions, see command below.

loss = predictions[:, classnum[classname]]

The gradients of the last CNN layer with respect to the loss of a specific class are calculated with tape‘s gradient method. The function getGrad returns the input image, the output of the last CNN layer (conv) and the gradients of the last CNN layer.

def getGrad(img, classname):

    testpics = np.array([img], dtype=np.float32)

    with tf.GradientTape() as tape:

        conv, predictions =  gradModel(tf.cast(testpics, tf.float32))
        
        loss = predictions[:, classnum[classname]]

    grads = tape.gradient(loss, conv)

    return img, conv, grads

The function getHeatMap below creates a heatmap from the output of the image’s last CNN layer and its gradients. Inside getHeadMap, the Tensorflow method reduce_mean takes the mean values (pooled_grads) of grads. The mean value is an indication of how important the output of a channel from the last CNN layer is. The function multiplies its mean values with the corresponding CNN layer outputs (convpic) and sums up the output into heatmap. Since we want to have an image to look at, the function getHeatMap is rectifying, normalizing and resizing the heatmap before it is returning it.

def getHeatMap(conv, grads):
    pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))
    convpic = conv[0]
    heatmap = convpic @ pooled_grads[..., tf.newaxis]
    heatmapsq = tf.squeeze(heatmap)
    heatmapnorm = tf.maximum(heatmapsq, 0) / tf.math.reduce_max(heatmapsq)
    heatmapnp = heatmapnorm.numpy()
    heatmapresized = cv2.resize(heatmapnp, (dim[0], dim[1]), interpolation = cv2.INTER_AREA)*255
    heatmapresized = heatmapresized.astype("uint8")
    ret = np.zeros((dim[0], dim[1]), 'uint8')
    ret = heatmapresized.copy()
    return ret

Testing the images

The apple, orange and tomato images are stored in separate directories. So the function testgrads, see code below, loads in the images from the test directory into the pics0, pics1 and pics2 lists. They are then added together into pics list.

The function testgrads normalizes the list of images and predicts its classifications (variable predictions). The function testgrads calls the getGrad function and the getHeatMap function to receive a heatmap for each image. Numpy’s argmax methods outputs a number which indicates if the image is an apple, an orange or a tomato and moves the result into pos. Finally the heatmap, which is a grayscale image, is converted into a color image (apple images are converted in green color, orange images into blue color and tomato images into red color). The function testgrads is then returning a list of colored heatmaps for each image.

def testgrads(picdir):

    pics0 = [os.path.join(picdir, classes[0], f) for f in os.listdir(os.path.join(picdir, classes[0]))]
    pics1 = [os.path.join(picdir, classes[1], f) for f in os.listdir(os.path.join(picdir, classes[1]))]
    pics2 = [os.path.join(picdir, classes[2], f) for f in os.listdir(os.path.join(picdir, classes[2]))]
  
    pics = pics0 + pics1 + pics2

    imagelist = []
    
    for pic in pics:
        img = np.zeros((dim[0], dim[1],3), 'uint8')
        img = cv2.resize(cv2.imread(pic,cv2.IMREAD_COLOR ), (dim[0], dim[1]), interpolation = cv2.INTER_AREA)
        imagelist.append(img)
    
    train_data = np.array(imagelist, dtype=np.float32)
    
    train_data -= train_data.mean()
    train_data /= train_data.std()
    
    predictions = model.predict(train_data)
    
    heatmaps = []
    
    for i in range(len(train_data)):   
        heatmapc = np.zeros((dim[0], dim[1],3), 'uint8')
        pos = np.argmax(predictions[i])
        img, conv, grads = getGrad(train_data[i], classes[pos])
        heatmap =  getHeatMap(conv, grads)
        
        if pos == 0:
            posadj = 1
        elif pos == 1:
            posadj = 0
        elif pos == 2:
            posadj = 2    
        
        heatmapc[:,:,posadj] = heatmap[:,:]    
        
        heatmaps.append(heatmapc)
    
            
    return imagelist, heatmaps

Below the code which calls the function testgrads. The parameter testdir is the path of the test images.

imagelist, heatlist = testgrads(testdir)

Result

In Figure 3 you find three images (apple, orange and tomato) passed through the testgrads function, see top row. In the middle row, you find the outputs of the testgrads function. These are the visualized heatmaps. The bottom row of Figure 3 are images which were merged with OpenCV’s weighted method. So the heatmap pixels indicate, which group of pixels of the original image led to the prediction decision. You see that the black heatmaps pixels indicate that the background pixels do not lead to any decision. The same is true for the fruit stems.

Figure 3: Original and heatmap images

References

Keras: https://keras.io/examples/vision/grad_cam/

pyimagesearch: https://www.pyimagesearch.com/2020/03/09/grad-cam-visualize-class-activation-maps-with-keras-tensorflow-and-deep-learning/

Why should I trust you: https://arxiv.org/pdf/1602.04938.pdf

Fruit Recognition: https://www3.hs-albsig.de/wordpress/point2pointmotion/2020/03/26/fruit-recognition-on-a-raspberry-pi/