In the digital era, even in medicine, numerous images are produced every day and, despite their classification, the segmentation and recognition of objects are less detailed. With the recent development of artificial intelligence and convolutional neural network algorithms it is possible to achieve excellent results. This article analyzes the neural network differences in Matlab and the ability to recognize cells in light microscopy. The possibility of segmenting the border and internal area is also introduced, extending the algorithm in microscopy to the recognition of moles for dermatoscopy.

There has been a need in biology for several years to automate cell recognition. The ability to recognize a precise cell within any microscopic image is a much sought-after skill. For decades this task has been carried out by human operators but with the exponential increase in the computing power of modern computers it has been possible to introduce different algorithms. Today there are different techniques to identify objects within images, falling within the field of segmentation, but most of them suffer from problems related to the quality, shape, size, convexity and many other geometric properties of the object of interest. The latest development sees the application of neural networks and transfer learning to completely automate segmentation, obtaining extraordinary performances.



Segmentation is formally defined as “splitting an image into a set of non-overlapping regions that when combined make the whole image.” That is, it is a set of techniques for conveying and extracting information from images through some properties, including morphological ones, by assigning a unique label to each pixel, associated with a set of classes. There are different methods and principles of segmentation and all of them talk about foreground and background. The foreground is precisely what is of interest and therefore you want to separate it from the background. There are four main categories of methods:

  • Pixel-based
  • Edge-based
  • Region-based
  • Model-based

Added to these are supervised methods that require neural networks and transfer learning to automate recognition.

Evaluation metrics

In addition to the starting image, it is necessary to have a ground truth (GT) that is a truth mask that expresses which pixel of the image is actually a pixel of the object of interest. On the basis of this mask and the identification that follows the segmentation, some quality metrics are defined.

There will be false positive pixels (FP) or pixels assigned to the object that are not actually part of it. The true positives (TP) instead will be the pixels of the object that are actually recognized. Finally, there are also the false negatives (FN) that is the pixels of the object that are lost. On the basis of these three quantities, two main metrics can be defined.

C M=\frac{T P}{T P+F N}=\frac{T P}{\text { Total area in GT }}\\
C R=\frac{T P}{T P+F P}=\frac{T P}{\text { Total area in BW }}

The completeness (CM) indicates how much the segmentation has actually taken from the starting object while the correctness (CR) takes into account the oversegmentation. On the basis of these two, the F-measure (FM) parameter is introduced, which takes into account both under-segmentation and over-segmentation. This parameter lives in the range [0; 1] and is the better the closer to 1.

F M=\frac{2 \cdot C M \cdot C R}{C M+C R}\quad  \in \left[0;1\right]

Deep learning

Today, neural networks and machine learning are highly developed also considering the recent development allowed by the increase in computing powers. The latest developments leading to the introduction of very powerful computers and high resolution cameras to the consumer market have led to the use of neural networks to process images.

Machine learning was born as opposed to traditional programming in order to put the machine in the condition to autonomously extract some features of interest from the data. For this purpose, what is called deep learning is born. Different techniques based on artificial neural networks foresee the organization on different layers and each of them calculates values which in turn are passed on to the next.

Convolutional networks

Visual data and more generic two-dimensional data are often processed with convolutional neural networks, CNN. They are networks composed of one or more convolutional layers with a feed-forward architecture. In other words, the different connections between the units do not form cycles but the information moves in only one direction with respect to the input nodes.

So a CNN is nothing more than an algorithm that takes an image as input and gives some importance to some of its aspects. By analyzing different aspects of different images he is able to distinguish one from the other. The different CNN networks have some points in common in their architecture.

Starting from the input channel it will accept images of a certain size and with a certain number of channels. Subsequently the image is passed to a first convolutional level where a first filtering takes place. This leads to the extraction of high-level features such as edges and sudden changes in brightness.

Then there are polling layers where the features that are invariant and position and rotation are preserved, also filtering part of the noise. Rectified Linear Units (ReLU) levels have the goal of overriding and deleting unhelpful values after the data has passed through a convolutional level. Then there are the FC layers (fully connected) which are those that deal with the actual classification.

Transfer learning

The complexity of the neural network is evident and in fact an accurate training requires a high computational complexity. However, there is a technique for adapting an artificial intelligence to a task other than the one for which it has already been trained. The idea is to take the lower layers of an already trained network and add new final layers adapting them to the problem of interest. We then use new training sets and optimization algorithms to adapt the classification to the new problem.


Following an analysis on a basic implementation within Matlab of some neural networks for cell segmentation. Training and segmentation performance will also be evaluated. The Deep Learning Toolbox is available within Matlab which provides a platform for the design and implementation of deep neural networks with pre-trained algorithms and models.


The first step to properly train a neural network is to select the images, classes and identification masks of what you want to represent. That is, the reference images must be considered and a mask that allows you to identify the objects of interest within it. In this case we are talking about a binary problem, cell and background, so we consider binary masks, 0 on the background and 1 on the background.

At the computational level in Matlab everything refers to a datastore variable. That is a single variable that contains information on the name of the files, their path, etc. without importing them into the workspace and therefore without occupying memory. The entire connection will then be processed and loaded only at the time of actual training.

Dataset 1


The first step in being able to effectively use a pre-trained neural network is to train it correctly. Training a network consists of providing an adequate number of pairs containing the test image and the result of the segmentation. This allows the network to modify various parameters and coefficients so that it can perform the required tasks.

In particular, in this case we want to do a binary segmentation, separating the foreground from the background. So the goal is to identify the cell and distinguish it from the background. In the first analyzes, a reference dataset divided into two large sets is used. A first part is used for training and a second part for a subsequent network analysis.


The first network that will be tested is ResNet50, a convolutional network with 50 layers of depth. The pre-trained network is capable of classifying over 1000 object categories. It features a total of 177 layers. From the input layer it is also possible to see that the network requires 3-channel, rgb, images with a size of 224 x 224. This should be kept in mind when preparing the training / test dataset. To extract more information, you can read the structure of the network.


Alternatively, it is possible to open the Deep Network Analyzer and explore the different layers in the most evident complexity of the network.

Block diagram of the layers of the neural network

The first step is to consider the reference dataset and what you want to achieve from the network. In this case we want to identify two classes (background and foreground). In this case the network is trained with 61 images inserted inside an ImageDatastore. There are therefore as many ground truths inserted in another data store, pxds, the latter however contains, in addition to the GT matrices / images, also information on the possibility of having two classes labeled with black and white, or respectively 1 and 0.

filePath = matlab.desktop.editor.getActiveFilename;
addpath(strcat(newPath,'functions')); %set path for functions

Furthermore, it is a good idea to pre-process these images so that they are compatible with the input layers of the network. The first step is to generate an augmented dataset. That is a new ImageDatastore containing more images generated by rigid transformations, rotations and crops of the starting set. To do this, the commands imageDataAugmenter() and the corresponding pixelLabelImageDatastore() are used. The first contains information on the operations to increase the dataset while the second contains information on the starting dataset, the operations of increase, size and image channels, compatible with what is required by the network.

imds = imageDatastore(strcat(newPath,'dataset\FRAME_TRAIN')); 
pxds = pixelLabelDatastore(strcat(newPath,'dataset\GT_TRAIN'),["N","B"],[0 1]);

Convolutional networks

The next step is to implement the network. So we proceed with the creation of a DeepLab v3 convolutional network directly ready for segmentation. In its creation it is necessary to specify the size of the images, the number of classes and the starting network. At this point the network will present additional levels that will be trained.

This convolutional network will classify the individual pixels of the image on the basis of the classes with which it was trained. It is convenient, working with binary classes where there is a predominance of black pixels representing the foreground, to normalize the coefficients of the segmentation layer with respect to the number of each class. Neglecting this normalization would lead to an imbalance of the classes and training in favor of the predominant classes.

numClasses=2; % foreground and background
imageSize=net.Layers(1).InputSize; %read size directly from net
augmenter = imageDataAugmenter('RandRotation',[0 360],'RandXReflection',true,'RandXTranslation',[-10 10],'RandYTranslation',[-10 10]);
pximds = pixelLabelImageDatastore(imds,pxds,'DataAugmentation',augmenter,'OutputSize',imageSize,'ColorPreprocessing','gray2rgb');
lgraph = deeplabv3plusLayers(imageSize, numClasses, "resnet50"); 
% queste blocco serve per bilanciare rispetto alle classi background e
% foreground
tbl = countEachLabel(pximds);
totalNumberOfPixels = sum(tbl.PixelCount);
frequency = tbl.PixelCount / totalNumberOfPixels;
classWeights = 1./frequency;
pxLayer = pixelClassificationLayer('Name','labels','Classes',tbl.Name,'ClassWeights',classWeights);
lgraph = replaceLayer(lgraph,"classification",pxLayer);

Training options

Then it is possible to train the network. We use the trainNetwork () command to which they are passed:

  • Data store with training data
  • Layers
  • Training options
Diagram of datasets and training settings

So you can customize some training options. Certainly the most important parameters are the solver, the number of epochs, the size of the mini batch, the number of iterations and some output settings.

The solver is the one who deals with optimizing the network. Several variants of stochastic gradient descent are implemented in Matlab, an iterative technique that deals with training the network. There are three variants: adam, rmsprop and sdgm, which behave differently and lead to different results depending on the problem under consideration.

Another variable of interest is the size of the batch-size or the sub-set over which the network updates the parameters at each iteration. Each complete cycle, on the other hand, takes the name of epoch and in turn can be defined as its upper limit, which varies according to the size of the network and how much it is necessary to refine the transfer learning.

Performance ratings

It is also possibile to view the progress and progress of your training with the option “Plots”, “training-progress”. Then it is possible to evaluate its accuracy and the loss of training.

options = trainingOptions('sgdm', ...
    'MaxEpochs',30, ...  
    'MiniBatchSize',8, ...

The trainNetwork() command works mainly with the GPU and this allows you to speed up training and scale its potential on multi-GPU systems. To confirm the performance on two very different systems, both in terms of architecture and peripherals, were evaluated.

  • Mac. Software: MacOS Monterey. Hardware: 3,1 GHz i5, 8 GB DDR3, SSD NVME, Iris + Graphics 650
  • Win1. Software: Windows11. Hardware: 2,7 GHz i7-10850H , 32 GB DDR4, SSD NVME, Nvidia Quadro P620
  • Win2. Software: Windows11. Hardware: 2,3 GHz i7-11800H, 32 GB DDR4, SSD NVME, Nvidia RTX 3060 Laptop
DeviceEpocheIterazioniTimeMini-batch accuracyMini-batch lossBase learning rate

Given the great disparity in speed and the potential in exploiting the CUDA drivers on Nvidia cards, the following results refer to the Windows Win1 workstation.


A first analysis can be made, starting from the segmentation metrics on a new image dataset. By loading it, passing it to the network to segment and evaluating the FM parameter, it is possible to have a practical estimate of how well the network works.

[net, info]= trainNetwork(pximds,lgraph,options);
GT over original image
Segmented GT over original image
GT vs segmented GT
for l = 1:length(f_test)
C_test = semanticseg(testImage,net);
pause(0.5); drawnow;
clear C_test D testImage;
FM for the 34 images

Then it is possibile to see how some training parameters affect the final result and the performance of the network.


A first factor of interest is certainly the solver. Going to test the different solvers, all conditions being equal, it is evident that a greater accuracy is obtained with RMSProp while a lower loss with SGDM.

SolutoreTimeAccuracyLossMean (FM)Max (FM)Min (FM)Std (FM)

Although the training parameters are in favor of RMSProp, the segmentation metrics, obtained from an effective action of the network, show themselves in favor of SGDM and RMSProp with higher average values. Furthermore, SGDM presents, not only a higher average value, also a higher peak value and a lower standard deviation.

Trend of the FM parameter with the different test images.

The following tests are carried out starting from these conditions.


Testing instead the number of epochs it is evident how too low a value makes the network not very precise. Although with a maximum of 60 epochs greater accuracy is obtained, an equally high FM is obtained with with only 20 epochs. This compared to a number of iterations of 140 against 420, which is a greater training speed.

Batch dimensions

There is also the possibility to analyze the minimum batch size which shows that the initial choice is the optimal choice.

Learning rate

Analyzing the initial learning speed, the optimal values are evident. It is therefore advisable to choose the value 0.01 which leads to higher and less distributed FM values.


The training is mainly based on the goodness and greatness of the training set. There are therefore some tricks to obtain a greater number of images starting from the starting ones and applying rotations, translations and reflections. The best results are obtained thanks to the reflection, in fact the results with only reflection have much higher FM averages.

Augmenter optionsMean (FM)
Only rotation0.4659
Only reflection0.7989
Only traslation0.755

Furthermore, by investigating the translation it is possible to observe how the best performances are obtained with translations of +/- 50 pixels.

Results applying only the translation augmenter in the interval [5 10 20 50 100].
Results applying all the options for the augmenter with investigation of the translation in the interval [5 10 20 50 100].

Considering all the available options, therefore the simultaneous rotation, reflection and translation, more stable and in any case high results are obtained. The optimal translation window, however, shifts to higher values by +/- 20 pixels. However, from the charts it is evident that the variations are small and the F-measure still remains very high.

Pre-trained libraries and networks

The evolution and variation of segmentation metrics can also be investigated by exploiting different networks. In particular, the Deep Net V3 + algoritm allows you to use five different networks:

  • ResNet50
  • ResNet18
  • InceptionResNetV2
  • Xception
  • MobileNetV2

From the results it is evident that the best choice is ResNet50. The other networks have less accurate and less accurate results than InceptionResNetV2 which however has 5 times the computation time.

Edge segmentation

Another topic of interest is the search for an edge region of a particular cell. This search can be easily extended to any object present in the image. In this regard, and to show the scalability of these algorithms, a dataset of moles for dermoscopy is considered. Although they may seem very different images, the information content of the image is the same: an object and its border immersed in a background without interest.

pathImage = strcat(newPath,'dataset\mole\Immagini'); % link for training frame

Starting from the dataset, the resized images are prepared to make them compatible with the network. The training dataset is also prepared by making sure that there is 0 on the background, 1 on the border and 2 inside the object.

pxds = pixelLabelDatastore(strcat(pathExport,'GT'),["BACK","BORDER","INSIDE"],[0 1 2]); % link for GT images
imds = imageDatastore(strcat(pathExport,'IMG'));

By training and testing the network it is possible to verify its correct functioning on a different set of tests.


Affine transformations

In this case, having no longer a binary problem and having 3 channels (rgb) slightly changes the creation of the AugmentedDataStore. The principle remains the same as in the previous case but uses different functions.

dsTrain = combine(imds, pxds);
xTrans = [-10 10];
yTrans = [-10 10];
dsTrain = transform(dsTrain, @(data)augmentImageAndLabel(data,xTrans,yTrans));
lgraph = deeplabv3plusLayers(imageSize, numClasses, "resnet50"); 

After combining the datastore with the images and the one with the labels of the individual pixels, it uses a transformation function that allows you to translate, resize, rotate, etc. the individual images.

function data = augmentImageAndLabel(data, xTrans, yTrans)
for i = 1:size(data,1)
    tform = randomAffine2d(...
        'Rotation',[0 360],...
        'XTranslation', xTrans, ...
        'YTranslation', yTrans);
    rout = affineOutputView(size(data{i,1}), tform, 'BoundsStyle', 'centerOutput');
    data{i,1} = imwarp(data{i,1}, tform, 'OutputView', rout);
    data{i,2} = imwarp(data{i,2}, tform, 'OutputView', rout);

Morphological operators

Some morphological operators are also used to generate the border starting from the GT that identifies the entire object. An erosion is made on the GT and then it is added to the previous edge. This leads, passing from logical operators to integers, to have:

  • 0 over background with label 'BACK'
  • 1 over object border with label 'BORDER'
  • 2 inside the object with label 'INSIDE'

A closing operator is also used for the final presentation on what is identified as internal to the object. Then the inside of the object is subtracted from the edge to further clean up the image. By optimizing the size of the elements with which the morphological operators act, it is possible to increase the precision of the algorithm.


Segmentation with convolutional networks has great potential in the medical field. It is useful in biology, once the cell has been identified it is possible to use the information contained in it for any analysis. At the same time it is easily scalable to the identification of particular objects such as tumors, machines, fractures, alterations, etc. that may be present within the different medical images.


All the code is available in the GitHub repository:


  • Image Processing Using Deep Learning – Mathworks
  • Pretrained Convolutional Neural Networks – Mathworks
  • Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation – Liang-Chieh Chen
  • Artificial Convolutional Neural Network in Object Detection and Semantic Segmentation for Medical Imaging Analysis – Ruixin Yang
  • Deep Semantic Segmentation of Natural and Medical Images: A Review – Saeid Asgari Taghanaki
  • Deep Neural Architectures for Medical Image Semantic Segmentation: Review – Muhammad Zubair Khan