Plenteous AI, Exiguous Medical Data

The astonishing success of artificial intelligence (AI) over the past few years in solving complex recognition problems has brought about a disruptive and ongoing revolution. However, one shouldn’t overlook how much of the foundational work that has culminated in this exciting new technology goes back many years. Without perceptrons, backpropagation, neocognitrons, convolutional neural networks, restricted Boltzmann machines, and autoencoders, there would be no "deep learning" as we understand it today. From Frank Rosenblatt to Geoffrey Hinton, thousands of computer scientists, mathematicians, and engineers have invested countless research hours and dollars and published tens of thousands of papers to get us to this point. Ironically, ‘this point’ is where many are now afraid that AI will take over human affairs! While this fear is widely baseless, it is nonetheless a reminder of just how capable AI has become.

One of the most striking attributes of this AI revolution is the principles by which so much of its development have been guided. Despite the many breakthroughs achieved, and their significant economic and societal impacts, the AI community has been nothing short of unflinchingly generous in ensuring the democratization of its assets.

There is no patent on the amazingly accurate classifiers like SVM (support vector machines) or Random Forests. Similarly, on impressive visualization algorithms like t-SNE (T-Distributed Stochastic Neighbor Embedding), or genius refinements such as Adam Optimizer, Dropout, stochastic gradient descent, and kernel tricks, on which so many advances have relied. Even more astounding, is the absence of intellectual property claims on any of the pre-trained deep networks. Painfully trained by passionate young researchers on weekends and evenings over many weeks, months and years when, by rights, they really should have been having fun or taking a vacation, one can now download and deploy DenseNet, ResNet, EfficientNet, and many other capable AI models entirely for gratis. The AI community has seemingly not been driven by a desire to make money, one may speculate, at least not for general-purpose technologies.

It is commonly understood that AI operates best when it is fed large and representative data cohorts. Acquiring data is widely doable in many fields without experiencing significant obstacles. In most application scenarios, AI researchers can collect their data without being dependent on permissions (e.g., general face recognition, autonomous driving, stock market prediction). In some other fields, the custodians of data, who may or may not be the owners, feel prohibited from or inhibited to do so (e.g., finance, law, medicine).

Many of us have found hospitals to be frustratingly steadfast in their reticence to release data. Patient privacy and data security are always mentioned as the main reasons. However, these objections struggle to hold water when thoroughly capable "anonymizer" technologies, such as off-the-shelf healthcare AI/NLP solutions, and highly secure storage and communication channels have been readily available for some time.

There is also a wealth of data out there inclusive of patient consent. Similarly, significantly more data from retrospective studies that would also qualify as a waiver. So, what is the real problem? Why, in contrast to the generosity exemplified by that AI community, are hospitals and clinics so unwilling to share the critical data required to meet their peers halfway and foster exciting new advances in healthcare AI? There is no doubt that all of us, including the patients, would immensely benefit from such openhandedness.

"Well, it is not that simple," administrative colleagues from hospitals – colleagues who primarily reside in the commercialization and tech transfer divisions – are quick to inform us. Hospitals, not just private ones, are often expecting both a share of any downstream IP and revenue before agreeing to any release. As long as the data is labeled, this is a perfectly legitimate expectation, particularly coming from the expert physicians. However, calculating the contribution of manually annotated data to IP is no easy task. For instance, a small set of labeled data does not qualify as an "intellectual" contribution when even the most sophisticated AI solution cannot use it for training. But what about unlabelled data? What about millions of records and images that are "raw"? Why are such data not being shared?

The fact of the matter is that data repositories are gold mines. Naturally, many are eager to get their fair share of the "El Dorado". There is no shame in this when done with the best interest of patients at heart and conducted based on the common rules of engagement and ethical principles. However, we should not hide behind patient privacy and data security.

There were many people back in the 1950s that thought Jonas Salk (virologist, 1914-1995) was naive when he decided not to patent his polio vaccine formula. "Can you patent the sun?" he replied to flabbergasted journalists who asked why he would not secure his intellectual property to make money. In 2021, and amid a scary pandemic with countless uncertainties, there is – far and wide – no Jonas Salk of data to be found.

This is a guest post by Dr. Hamid R. Tizhoosh, a Professor in the Faculty of Engineering at the University of Waterloo, where he leads the KIMIA Lab (Laboratory for Knowledge Inference in Medical Image Analysis).

Plenteous AI, Exiguous Medical Data

Related Articles

How AI is Transforming Cancer Reporting

The Four Pillars of Responsible AI

The Path to AI/NLP Excellence