CS-E109 Final Project - Fall 2014

Suraj Khetarpal

Classifier Prediction Accuracy Based on Monkey Brain fMRI

Please click below to view my Project Screencast which contains a summary of the project.
You can also click on one of my Process Books. The updated Process Book contains a great deal of additional information and visual aids that are not featured in the Dec 10 submission.

Project Screencast - View Me!!!
Process Book Dec 10 Submission
Process Book Updated! With Bonus Features!
Dataframes

Overview

I am very interested in artificial intelligence and its application to medicine and medical research. I am also excited by the possibility of using machine learning to develop tools that will help people to better understand the human brain. Hopefully, by better understanding how the brain learns, we will be able to develop better machine learning methodologies (and vice versa). Research in this area can be used to better understand how the brain processes information and how it learns. For example, if we have good prediction models for brain activity, we can use this to determine when a baby monkey's brain is starting to recognize different types of visual stimuli. This could help us to understand how a baby monkey's brain develops in the first place. The possible benefits from this are limitless. This is why I have decided to make this project about regression analysis and machine learning using fMTI brain scan data. I will be using fMRI scans that were taken while monkeys were shown different types of visual stimuli. These experiments were done on monkeys rather than humans for legal reasons. Using the results of the scans, I will use regression analysis and machine learning to develop a classifier that can accurately predict a visual stimulus based on a monkey's fMRI brain activity.

This project is an analysis of fMRI data that comes from an experiment in which a monkey named Paul was repeated shown 4 types of images. His brain activity was recorded via fMRI and this data consolidated into files that I have been able to read using Matlab.

I have performed a linear regression on the fMRI data to estimate how individual voxels (or cubic millimeters of brain tissue) responded to the images. Because the dataset was so large, I focused in a on a region of the brain known as the Ventral Visual Stream.

I then fed this data into 6 different types of machine learning classifiers and tested their ability to predict image type based on the voxel response values. Because the visual ventral stream contains over 22,000 voxels, I had to select which voxels' data to feed to the classifiers. I tested out 7 different methods for selecting voxels, including two that involve using preselected voxels that I knew would give good results. In the end, I was able to acheive prediction accuracies of around 78% without using the preselected voxels. Using preselected voxels, I acheived 93% accuracy.

The Data

The fMRI monkey brain scan data comes from the Livingstone Lab at the Department of Neurobiology at the Harvard Medical School. It fMRI data measures brain blood flow activity levels for each voxel of the brain, collected at two second increments. In total, the fMRI files contain over a billion data points! As can be seen in the below visual, the raw data can be chaotic and tends to float. One way to deal with this problem is to use linear regression techniques.

After performing a linear regression on the raw fMRI data, I was able to calculate "response" values that measure how much an individual voxel responded to each image shown during the experiment.

As can be seen in the below visual, it is not easy to differentiate response values that come from each of the different image categories.

Analysis

I tested 6 different Machine Learning Classifiers
1. K-Nearest Neighbors
2. Decision Trees
3. Random Forest
4. Support Vector Machines
5. Gaussian Naive Bayes
6. Bernoulli Naive Bayes

However, this bring us to a major challenge. It is not effective to feed 20,770 voxels to a classifier. So what do we do?

At this point, I began to think about different ways to narrow down the number of voxels I would use in a machine learning classifier. I would prefer to use voxels that are "informative", as in they have a strong and consistent response to the experiment stimuli. In the end, I came up with 7 different methods to select voxels, and they are shown below. The third and forth methods involve variance calculations that came up with on my own. The fifth method takes the voxel set generated from the fourth method and implements a decision tree important features finder to identify the two most important voxels. Finally, the last two methods are in a sense cheating. I use voxels that neuroscientists have determined work well in machine learning applications, especially for distinguishing image type 2.

Analysis & Results

After running each set of voxels through each classifier, I came to up with the following results:

My results were of mixed quality. Many selection methods performed poorly, including random selection, highest response, and favorable variance. I was surprised by that my favorable variance method did so poorly, worse than even random selection! However, selection by favorable image variance was much better than random selection, almost matching the neuroscientist preselected voxels. Another surprise was how well the important features worked. I did not think that two voxels alone could acheive an accuracy of 78%! Another surprise was how poorly the support vector machine performed, the worst of any classifier. I expected the SVM to perform well because it is know for handling a high feature dimensions space, as is the case here where there are 22,770 potential features. However, the Gaussian Naive Bayes classifier seems to be the ideal classifier for this task.
Naturally, the greatest accuracies were acheived using the preselected voxels after the data was narrowed down to only image types 2 and 3. However, this test had an unfair advantage in that the image types were narrowed down to 2, meaning that a 50% accuracy rate is easily acheived by randomly guessing classes. Regardless, when we do cheat in this manner, we break the 90% accuracy rate. Too bad we couldn't do so without the help of prior knowledge.

In conclusion, if you don't have a neuroscientist to tell you which voxels to use, my Important Features method along with a Guassian Naive Bayes Classifier is your best bet.