In my PhD I am working on multiple instance learning (MIL). This is a weakly supervised classification problem where only groups (bags) of examples (instances) are labeled, such as the overall diagnosis for a bag of regions of interest in a patient’s CT scan, but the individual labels for the regions of interest are not known.
This type of problem arises in many different applications, such as image classification, text classification, molecule activity prediction, and others. I try to find as many of these applications as possible, and compare the assumptions made in each problem, and how this affects algorithm performance. To do these experiments, I collected several datasets and transferred them into the same MATLAB format. I now also created a dataset repository, where these datasets can be downloaded:
It’s not a long list, but there are already more datasets than 90% of the papers use 🙂 I will be adding more in the future (if I get the permission of the original authors, that is).