This is my second post about a paper I wrote. This time it is about Label stability in multiple instance learning, published at MICCAI 2015. Here you can see a short spotlight presentation from the conference. The paper focuses is on a particular type of algorithms, which are able to detect abnormalities in medical scans, and a potential problem with such algorithms.
What if I told you that I like words “cat”, “bad” and “that”, but I don’t like the words “dog”, “good” and “this”. Did you spot a pattern? To give you a hint, there is one letter that I like in particular*, and I don’t like any words that don’t have that letter. Note that now you are able to do two things: to tell whether I’d like any other word from the dictionary, AND why that is.
You are probably asking yourself what this has to do with medical scans. Multiple instance learning (MIL) algorithms out there that are faced with a similar puzzle: each of the scans in group 1 (words that I like) has abnormalities (a particular letter), and each of the scans in group 2 is healthy. How can we tell whether a scan has abnormalities or not, and where those abnormalities are? If the algorithm figures this out, it could detect abnormalities in images it hasn’t seen before. This is a toy example of how the output could look like:
Detecting whether there are any abnormalities is slightly easier than finding their locations. For example, imagine that the scan above has the lung disease COPD, and therefore contains abnormalities. Now imagine, that our algorithm output has changed:
The algorithm still correctly detects that the scan contains abnormalities, but the locations are different. If the locations of the abnormalities were clinically relevant, this would be a problem!
Of course, in the ideal case we would evaluate such algorithms on scans, where the regions with abnormalities are manually annotated by experts. But the problem is that we don’t always have such annotations – otherwise we probably would need not to use MIL algorithms. Therefore, the algorithms are often evaluated on whether they have detected ANY abnormalities or not.
In the paper I examined whether we can say something more about the algorithm’s output, without having the ground truth. For example, we would expect a good algorithm to be stable: for one image, slightly different versions of the algorithm should detect the same abnormalities. My experiments showed that an algorithm that is good at detecting whether there are any abnormalities, isn’t necessarily stable. Here is an example:
Here I compare different algorithms – represented by different colored points – on a task of detecting COPD in chest CT scans. The y-axis measures how good an algorithm is at detecting COPD (whether there are any abnormalities) – the higher this value, the better. Typically this would be the measure of how researchers in my field would choose the “best” algorithm.
I proposed to also examine the quantity on the x-axis, which measures the stability of the detections: a value of 0.5 means that multiple slightly different versions of the same algorithm only agree on 50% of the abnormalities they detected. Now we can see that the algorithm with the highest performance (green square) isn’t the most stable one. If the locations of the abnormalities are clinically relevant, it might be a good idea to sacrifice a little bit of the performance by choosing a more stable algorithm (blue circle).
A more general conclusion is: think carefully whether your algorithm evaluation measure really reflects what you want the algorithm to do.