Sarica et al.

From CADDementia
Jump to: navigation, search

Advanced Feature Selection in Multinominal Dementia Classification from Structural MRI Data

Alessia Sarica 1, Giuseppe Di Fatta 2, Garry Mark Smith 2,3, Mario Cannataro 1, and James Douglas Saddy 3 and for the Alzheimer’s Disease Neuroimaging Initiative 4

  1. Department of Medical and Surgical Sciences, Magna Graecia University of Catanzaro, Italy, sarica,cannataro@unicz.it
  2. School of Systems Engineering, University of Reading, UK, g.difatta,g.m.smith@reading.ac.uk
  3. Centre for Integrative Neuroscience and Neurodynamics (CINN), University of Reading, UK, j.d.saddy@reading.ac.uk
  4. Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the inves- tigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp- content/uploads/how to apply/ADNI Acknowledgement List.pdf.

Summary

This algorithm employs a fully automatic feature selection approach for discovering the best subset of attributes in terms of classification accuracy. A workflow for this purpose has been implemented on the Konstanz Information Miner (KNIME) 2.9 [1]. The K-Surfer plugin [2] has been used for importing FreeSurfer [3] data into KNIME, and the R (Interactive) plugin has been used for integrating R with KNIME.

KNIME workflow designed for implementing the proposed algorithm.

1. MRI Feature Generation

Explanation: CADDementia MRIs has been pre-processed for segmenting and reconstructing them, with the aim of obtaining structural measures of brain areas.

Software: FreeSurfer 5.2 [4] has been used for pre-processing MRIs.

Command:

for each file named test_*.nii.gz:

recon-all -i /studies/2014/CADDementia/test_*.nii.gz -subjid test_* -sd /studies/2014/CADDementia -all -hippo-subfields

2. Selection of subjects from ADNI

The data for training our algorithm have been obtained from the Alzheimer’s disease Neuroimaging Initiative (ADNI). ADNI cohort consists of three protocols: (i) ADNI1, containing control, Alzheimer’s disease (AD) and Mild Cognitive Impaiment (MCI) subjects,(ii) ADNIGO that unifies ADNI1 with a new ensemble of partecipants iden- tified as Early Mild Cognitive Impairment (EMCI) and (iii) ADNI2 that assesses participants from ADNI1/ADNIGO in addition to new subjects affected by MCI, now called Late MCI (LMCI). Besides imaging resources, ADNI provides cortical reconstruction and volumetric segmentation performed by FreeSurfer, using both longitudinal and cross-sectional processing. In particular ADNI1 MRIs, acquired by 1.5T scanners, have been processed by FreeSurfer version 4.3, while ADNIGO and ADNI2 MRIs, acquired by 3T scanners, with FreeSurfer version 5.1. In this study, the criteria for choosing the optimal dataset from ADNI is based on the characteristics of CADDementia data. Thus, only MRIs acquired by 3T scanners, weighted in T1 and processed by FreeSurfer 5.1, have been chosen. In particular, two text files were considered:

  • UCSFFSX51_ADNI1_3T_02_04_14.csv
  • UCSFFSX51_02_04_14.csv

containing respectively subjects from ADNI1 and ADNI1/ADNIGO/ADNI2. Two more text files:

  • ADNI BaselineList 3T_8_28_12.csv
  • ADNIMERGE.csv

were used for extracting gender, age and diagnosis of the subjects.

  • Only scans of the baseline visit have been included.

Other measures, like MMSE score, Hachinski score or NPI-Q score have not been considered here, since they have not been provided for the CADDementia dataset. The filtering of database has been performed by using KNIME version 5.2.9. In a first phase, rows have been filtered by:

  • including those subjects that have attribute NL or CN (both stand for healthy control here renamed HC ), or AD (Alzheimer’s disease), or MCI (Mild Cognitive Impairment in ADNI1, called Late Mild Cognitive Impairment in ADNIGO and ADNI2) or LMCI (Late Mild Cognitive Impairment) in columns related to the Diagnosis;
  • including those subjects that have attribute Pass in columns related to the quality of the process phase, both in ADNI1 and ADNI1_GO_2;
  • including only those subjects that have the attribute bl in column VISCODE in ADNI1 dataset and attribute v04 in column VISCODE in ADNI1_GO_2; both columns are related to the baseline scan.
  • including only those subjects that had the attribute complete in column STATUS in ADNI1_GO_2 dataset;
  • including only those subjects that have the attribute Non-Accelerated T1 in column IMAGETYPE in ADNI1_GO_2 dataset;
  • excluding those subjects with missing values.

In a second phase, the two datasets, ADNI1 and ADNI1_GO_2, have been concatenated and the columns renamed so to follow the names convention of FreeSurfer. Subjects from class LMCI were renamed MCI and the dataset were splitted into three sub-datasets: HCvsAD, HCvsMCI and ADvsMCI. The three groups have been randomly sampled (with a fixed seed) by diagnosis, in order to obtain balanced dataset for avoiding that the classification algorithm privileges the larger class. The final dataset has a total of 210 subjects (70 for each class) and 200 columns (197 plus Diagnosis, Gender and Age).

For this phase, the following KNIME workflow has been designed:

KNIME workflow designed for selecting subjects from ADNI database.

3. Feature Selection and Classification Model Inference

The adopted workflow is composed by five steps:

  1. IntraCranial Volume normalization;
  2. Feature Selection with three techniques;
  3. Z-Score normalization;
  4. Binary classification;
  5. Multi-class classification.


1. IntraCranial Volume normalization IntraCranial Volume normalization is simply performed by dividing each volume by the total intracranial volume of the subject. The following KNIME workflow has been designed for automatically performing this purpose.

KNIME workflow designed for applying the Intra Cranial Volume normalization.

2. Feature Selection The core of the advanced feature selection consists in sequentially applying a: (a) Correlation filter (b) Random Forest (RF) filter (c) Support Vector Machines (SVM) wrapper on the training dataset, to identify a subset of features that provides the highest binary classification accuracy. Four different combinations of these feature selection techniques are considered and tested.

A KNIME workflow has been designed for automatically executing these three feature selection steps, as following:

KNIME workflow designed for automatically applying advanced feature selection.

a. Correlation-based filter For the application of the Correlation filter, we used the R library caret [5]. The following implementation has been written in R and integrated in the KNIME workflow by using R snippet:

library(caret)
d <- knime.in
corMat <- cor(d[,-1])
#corrplot(corMat12, order = "hclust")
highCor <- findCorrelation(corMat, 0.90)
dc <- d[,-highCor]

b. Random Forest (RF) filter For the application of the Random Forest filter, we used the R library caret [6]. The following implementation has been written in R and integrated in the KNIME workflow by using R snippet:

library(caret)
d <- knime.in
#Feature selection by Univariate filter
sbf <- sbf(d[,-1],d[,1], sbfControl = sbfControl(functions = rfSBF,verbose = TRUE,method = "cv"))
df <- subset(d, select=c(predictors(sbf)))

c. Support Vector Machines (SVM) wrapper For the application of SVM wrapper, we used the R library caret [7]. The following implementation has been written in R and integrated in the KNIME workflow by using R snippet:

library(caret)
d <- knime.in 
#Feature selection by Wrapper
rfem <- rfe(d[,-1],d[,1],sizes = c(2, 5, 10, 30),rfeControl = rfeControl(functions = caretFuncs, verbose = TRUE,number = 10),method = "svmRadial")
dw <- subset(d, select=c(predictors(rfem)))

3. Z-Score normalization The Z-Score normalization is automatically applied by adding in the train command for SVM classification (see the next section), the following command:

preProc=c("center","scale")

4. Binary classification For the SVM binary classification, we used the R library caret [8]. The following implementation has been written in R and integrated in the KNIME workflow by using R snippet:

library(caret)
#10-fold cross validation
ten.fold <- trainControl(method = "cv",number = 10, classProbs=TRUE, savePred=TRUE)
d <- knime.in
#SVM radial 
set.seed(2511)
SVMr <- train(Diagnosis~.,d, method="svmRadial",preProc=c("center","scale"), trControl = ten.fold, allowParallel = FALSE, importance=TRUE)

5. Multi-class classification For the multi-class classification the one-versus-one (OVO) method has been used. The final output of the OVO method is derived from the probabilities calculated during the previous phase, as reported in a score matrix. The Voting Strategy (VOTE) method has been here chosen for its simplicity and robustness.

Here a screenshot of the KNIME sub-workflow designed for obtaining the score matrix from the binary classification probabilities:

KNIME workflow designed for automatically applying the multi-class classification by the VOTE strategy.


In particular, the R code for VOTE, integrated in the KNIME workflow by using R snippet, is:

 winner <- matrix(data=NA,nrow(scoreMatrix), ncol=1)
 for(i in 1:nrow(scoreMatrix)){
   matrix <- data.frame(matrix(c(0,scoreMatrix[i,grep("HC_MCI.HC",colnames(scoreMatrix))],scoreMatrix[i,grep("HC_AD.HC",colnames(scoreMatrix))], scoreMatrix[i,grep("HC_MCI.MCI",colnames(scoreMatrix))], 0, scoreMatrix[i,grep("AD_MCI.MCI",colnames(scoreMatrix))], scoreMatrix[i,grep("HC_AD.AD",colnames(scoreMatrix))], scoreMatrix[i,grep("AD_MCI.AD",colnames(scoreMatrix))],0), nrow=3, ncol=3, byrow = TRUE))
   for(k in 1:nrow(matrix)){
     for(j in 1:ncol(matrix)){
       if(!k==j){ 
         if(matrix[k,j] > matrix[j,k]){
           matrix[k,j] <- 1
           matrix[j,k] <- 0        
         }
         else{
           matrix[k,j] <- 0
           matrix[j,k] <- 1
         }
       }
     }
   }
   summatrix <- data.frame(matrix(rowSums(matrix)))
   rownames(summatrix) <- c("HC","MCI","AD")
   winner[i] <- rownames(summatrix)[which.max(summatrix[,1])]    
 }


References

  • Sarica Alessia, Giuseppe Di Fatta, and Mario Cannataro. “K-Surfer: A KNIME Extension for the Management and Analysis of Human Brain MRI FreeSurfer/FSL Data.” Brain Informatics and Health. Springer International Publishing, 2014. 481-492.
  • Remi Cuingnet, Emilie Gerardin, Jerome Tessieras, Guillaume Auzias, Stephane Lehericy, Marie-Odile Habert, Marie Chupin, Habib Benali, and Olivier Colliot. Automatic classification of patients with Alzheimer’s disease from structural MRI: A comparison of ten methods using the ADNI database. NeuroImage, 56(2):766 – 781, 2011. Multivariate Decoding and Brain Reading.
  • Isabelle Guyon and Andre Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157–1182, March 2003.


Contacts

For questions on the method, please contact Alessia Sarica, sarica@unicz.it