Diabetic Retinopathy Grading

Brief Description

This project is a term project for Medical Image Analysis course offered in Spring 2022 by Prof Nirmalya Ghosh at IIT Kharagpur. Sneha and I teamed up to complete this project.

The final report of the project can be accesible from here : Report

The final slides of the project can be accesible from here : Slides

Problem

Diabetic Retinopathy is a complication of diabeties that effect eye. DR is the major cause of blindness in India, which accounts for 30% of DR cases in the world. Early diagnosis of DR can reduce the risk of blindness by 90%. Causes for DR can be identified using the following features:

  1. Hemmorages
  2. Exudates
  3. Abnormal growth of blood vessels
Now lets see how DR can be classified into different types. The following table shows the classification of DR.

Model interpretation for each class using Integrated Gradients and DeepLIFT
Class and Label No DR - 0 Mild - 1 Moderate -2 Severe - 3 Proliferate - 4
Image No DR No DR No DR No DR No DR
Description No presence of DR features. Clean Eye At least one microaneurysm present on retinal exam Multiple microaneurysms, dot-and-blot hemorrhages, venous beading, and/or cotton wool spots Cotton wool spots, venous beading, and severe intraretinal microvascular abnormalities (IRMA) Growth of new blood vessels, Blood vessels bleeding, Retinal detachment

Methods

Method 1 : Deep Feature Extraction + External Classification

The first method was to extract features from the images using pretrained model Inception V3. Inception V3 is trained on the ImageNet dataset and can be used as a feature extractor.The extracted features are used as inputs to difeerent ML classifiers. The statistical results are tabulated for each classifier with and without data augumentation.

Evalution metrics for DR binary classification - without data augumentation
Classifier Accuracy Precision Recall F1 Score
KNN 0.7848 0.79 0.78 0.78
SVM 0.7818 0.78 0.78 0.78
Random Forest 0.8091 0.81 0.81 0.81
XGBoost 0.8121 0.81 0.81 0.81
DNN 0.7787 0.77 0.77 0.77

From the results above that it is clear that Random Forest and XGBoost are performing well but XGBoost is outperforming by a slight marging. As we all know we can let the model learn image invariances and which can be lead to an improved performance. The following data augumentation are performed to increase the performance of the model.

  • Rotation
  • Resizing
  • Gaussian Noise
  • Histogram Equalization

The results of the model with data augumentation are as follows.

Evalution metrics for DR binary classification - with data augumentation
Classifier Accuracy Precision Recall F1 Score
KNN 0.9064 0.90 0.90 0.90
SVM 0.9102 0.91 0.91 0.91
Random Forest 0.9707 0.97 0.97 0.97
XGBoost 0.9707 0.97 0.97 0.97
DNN 0.9757 0.97 0.97 0.97
Our data augumentaions worked well and significantly impoved the performance of the models. It is obivoius from the results that the DNN is the best performing classifier with 97.57% accuracy. Now we will look into different deep neural networks for end-to-end classification. Remember that we have not yet done the our main problem multi class classification.

Method 2 : End-to-End Deep Neural Net based Classification

Now our approach is to use the state of the art deep neural networks for end-to-end classification. For this task we can go with either fine-tuning or traning from scratch.

  • Training from scratch : This method work well only if we had a large set of images (but we have only 2667 for training for binary classification and 2963 for multi class classification)
  • Fine-tuning : This methods works well even if we have a small set of images(2667 and 2963 are good enough to produce reasonable results). All fine tuning models are trained on ImageNet dataset which is a large dataset with over 1 million images and 1000 categories.
The following architectures are used with mentioned hyperparamenters and training reciepe.
  1. VGG - VGG19 is a standard 19 layer convnet with that achieved SOTA results on ImageNet in 2013. Commonly used as a feature extractor for downstream tasks such as image classification, detection, segmentation etc.
  2. ResNet - ResNet is a builds upon VGG network. It utilizes skip connections to jump over layer that help in gradient propagation, local - global feature interactions. We will be using ResNet18 with 18 convolutional layers.
  3. EfficinetNet : EfficientNet builds upon basis that appropriate depth, width and resolution of the networks are essential for best performance. It uses a depth, width, resolution scaling factor to develop a network that is efficient for a given task. We will be using EfficientNetB0 which is a base network.
  4. ConvNext : After Visual Transformers, convnext uses the best practices from both natural language and vision to create a set of convnext variants. ConvNext are current state of the art convolutional neural networks for image classification on benchmark dataset ImageNet. We will be using ConvNext-Tiny variant for this task.
The models are trained on a Tesla T4 GPU (16GB memory) on AWS Sagemaker Studio Lab. Binary/Categorical Cross Entropy loss is used as cost function to penalize the false classifications. Adaptive momentum (Adam) optimizer with learning rate = 3e-4 and momentum = 0.9 is used to optimize the error. Batch size is chosen appropriately so that capacity of the GPU is not exceeded. Used batch sizes are 32, 64 and 128.

We have tried with training from scratch but that doesn't gone well. If had time I'll try post those results as well(I have re-run the experiments to get the results). The maximum accuracy achieved with the training from scratch is ~90% . Now we will fine-tune pretrained layers to match the number of classes. Following are the results for binary classification ({No DR, DR}) and multi class classification ({No DR, Mild DR, Moderate DR, Severe DR, Proliferate DR}) using fine-tuning .

Evalution metrics for binary DR classification - using fine-tuning
Model Accuracy Precision Recall F1 Score False classifications(out of 800)
VGG19 0.977 0.98 0.98 0.98 18
ResNet18 0.974 0.97 0.97 0.97 21
EfficientNet-B0 0.974 0.97 0.97 0.97 21
ConvNext-Tiny 0.981 0.98 0.98 0.98 13
From the above results we can see that ConvNext-Tiny is the best performing model for binary classification with an accuracy of 0.981 and only 13 false classifications out of 800..
Evalution metrics for Multi class DR classification - using fine-tuning
Model Accuracy Precision Recall F1 Score False classifications(out of 704)
VGG19 0.9644 0.96 0.92 0.94 25
ResNet18 0.99 0.98 0.98 0.98 7
EfficientNet-B0 0.997 0.99 0.99 0.99 2
ConvNext-Tiny 0.995 0.99 0.99 0.99 3
From the above results we can see that EfficientNet-B0/b> is the best performing model for multi class classification with an accuracy of 0.997 and only 2 false classifications out of 704.

Model Understanding

We just not only want a model that can classify DR but give reasonings for the classifications. There are quite a few popular methods that can be used to understand the model. We will be using Visual Explanations to understand the model. We'll see Saliency Maps, Gradient based Attributions, DeepLIFT, Guided Backpropagation. The following table shows the outputs of visual explanations from the model.

Model interpretation for each class using Integrated Gradients and DeepLIFT
Class Input Image Overlayed Gradient Magnitudes Overlayed Integrated Gradients DeepLIFT
Mild DR No DR No DR No DR No DR
Moderate DR No DR No DR No DR No DR
Severe DR No DR No DR No DR No DR
Proliferate DR No DR No DR No DR No DR
From the above attributions we can understand where the model is looking to produce a classification result.

Conclusion

The obtained results indicate that modern deep networks outperform traditional methods by significant margins (without data augmentation). The are train and test set are obtained from the APTOS 2019 Blindness Detection dataset. The shown results are not representative of the real world. The results are obtained on a small dataset. Given that 99.8% accuracy doesn't mean that DR recognition problem is solved. The accuracy is only very small set of images that are not representative of the real world scenarios. There are still gaps exists in taking these models to the real world. Following are the list of few(out of many) problems.

Probelms
  1. Expert labes are very costly. Hoping to create a large dataset(Millions of images) with expert labels is unfeasible task.
  2. Noisy labels create a mess for the models. The quality of the model is of prime importance.
  3. Data bias is a huge probelm in medical domain. Models are being trained on single dataset which is collected from one source / hospital / region. This induces the local bias in the model. The model performance then never matters if it can work well on differnent dataset.
Forward directions
Currently the researchers are working towards the following solutions.
  1. Since expert labels costly. Developing robust and high performance Unsupervised / Semisupervised alogrithms can be a game changer.
  2. Training the models on multiple dataset can make model more robust. Federated Learning can be used to train the models on multiple datasets. In federated learning, the model is trained using distributed and centralized process where data is securely stored at its origin.