Skip Navigation

Crunch Time: Constructing a Nation-wide Health Database

Data science has recently matured into an industry-changing discipline, driven by exponentially growing computing power, data, and wildly better algorithms. Netflix can predict what will keep you watching with stunning accuracy, Facebook can manipulate users on a massive scale, and Amazon knows what you’re going to buy next before you do. Data science has transformed advertising and financial services, amongst hundreds of other industries. Any field that is predicated on prediction and sits upon large data sets is ripe for optimization.

On the surface, medicine seems like a perfect candidate for such wild improvements—our society generates an increasingly accelerating wealth of medical data. The view of each person’s medical life is slowly coming into focus with higher resolution, and our insight into our health has evolved from annual snapshots to a daily feed. In addition to traditional yearly tests done during a checkup, our picture of our health has come to include a variety of health tracking apps and devices such as the Fitbit and Apple Watch, which even in their natal stage, have saved lives. Medicine has begun to fall firmly under the aegis of big data—large, comprehensive, and generated in real time.

It should be noted that large-scale data collection, like any sharp tool, can cut both ways. The recent Strava fiasco illuminates the dark side of big data. When the fitness tracking company Strava released anonymized data from their users, it inadvertently exposed several US army bases in the Middle East. Even if data isn’t voluntarily released, it can still be compromised. Equifax recently exposed identity data of 143 million Americans to hackers. China has all of our federal personnel files. Health insurer Anthem allowed the data of up to 80 million customers to be compromised. History has shown that wherever data aggregates, it can be targeted. Granular tracking of our day to day lives could be a boon for our health, but it could also be a nightmare for our privacy.

Both the promise and dangers of large-scale data collection are acutely highlighted by the burgeoning field of genomics. The genome codes for every cell and interaction in our body, and understanding it—if we are able to—promises to deliver a wealth of insight unparalleled in the history of medicine, such as the ability to allow us to predict cancers and help prevent heart attacks. As we grow more adept at modifying our own genomes through technologies like CRISPR—a groundbreakingly precise gene editing technology developed in 2013—interpretation of genomic data will remain the last frontier barring broad genomic health interventions. However, accurate genetic interpretation could also blaze a path towards discrimination and exploitation on an unprecedented scale, from discriminatory insurance premiums to ad targeting towards those known to be coded with a predisposition for addiction.

Such spectacular and terrifying implications remain outside of the realm of possibility for now. Although we have sequenced over half a million human genomes, our ability to interpret such information remains nearly non-existent. Data science allows us to discern effect given cause, but in order to do so, we need our causal data—in this case, genomes—to be mapped with the effects we are attempting to predict—the health outcomes. Genome sequencing will continue to grow as it’s price continues to exponentially drop—the first human sequencing in 2003 cost 2.7 billion, while in 2014 it cost just under $1000. However, until we can generate genome-outcome pairs at scale, our ability to draw pertinent insights from both genomic and supplementary health data will be handicapped. Current methods of generating actionable genome-outcome pairs are rudimentary at best, often relying on opt-in programs with clunky surveys.

What also remains outside of our current capacities are the positive outcomes of large-scale data analysis—our current system excels at treating sick people, but fails and keeping people healthy. The goal of healthcare is ultimately healthy people, and prevention is integral to maintaining health, yet neglected in our current system. Prevention is inherently paired with prediction. Large-scale data analysis offers a solution to the problem of prediction. With enough data we would be able to establish relationships between genes and diabetes, level of activity and heart failure, the excretion of a new enzyme and cancer—in short, we would be able to discern our current heading and adjust accordingly.

Although the future holds promise, we see that the current state of health data is fractured. Data is isolated in proprietary silos, split between insurance companies, among others. The US healthcare system is split between three different groups: payers (insurance companies), providers (hospitals and private practice), and patients. The payers have access to claims data, and providers only have access to pertinent health data.

Additionally, any attempt to aggregate health data in the U.S. faces significant privacy hurdles. Each health insurer has their own respective silo of data, as does each health provider. Furthermore, health insurers only have access to claims based data, whereas clinicians have access to full diagnostic data but may not have access to the entire history of the patient. In other words, health insurers may see that they paid for a lipid panel, and that the patient then requested a follow-up with a cardiologist, but they cannot see the actual results of the test. On the other hand, although clinicians do get to see the results in addition to previous tests within their system, they are not able to see a patient’s entire health history. Different clinicians may operate within different health systems, further fracturing the patient’s data. Although a clinician can place a request for further pertinent health history, the process is not near frictionless.

Given the privacy constraints imposed by HIPAA, the only participants in the system who has access to the entirety of the data are the patients themselves. Therefore, the patient is the key to the quest for a large-scale, integrated database of health outcomes and the fruitful, democratized analysis that would follow. However, fragmentation within health providers complicates the process of data consolidation, even with the full legal right to the data. Some systems continue to maintain solely paper-based records, and working within such a heavily administered, non-standardized system poses significant challenges.

Payer-provider hybrids, such as Kaiser, offer incremental progress towards integration. Kaiser both insures its population and provides the healthcare it pays for. Not only does this better align incentives, it also allows Kaiser to maintain a more cohesive and interconnected database of patient health data than other insurers or providers.

Although Kaiser is progressive, it still operates as a data silo. In order to move towards a society scale health database, we need to both implement standards for the storage of medical data, and make the process of acquiring one’s full health history seamless. Only at that point will patients be able to begin to offer their anonymized health data for analysis at scale. In order to maximize progress,  prediction data must be accessible to anyone with an internet connection. Any given person has access to open source machine learning APIs and the availability of cheap high-performance cloud computing, but these resources are useless without data, the oil of the internet age. Health data should be as democratically accessible as stock data. Only then can the intelligence of crowds and the incessant drumbeat of capitalism turn their formidable focus upon elucidating the mystery of our health. Data is infrastructure, and our society needs to treat it that way. Health would be a great place to start,

About the Author

Jake Martin '20 is a Senior Staff Writer for the Culture Section of the Brown Political Review. Jake can be reached at jake_martin@brown.edu

SUGGESTED ARTICLES