Protecting Big Data Used for ResearchNational Institutes of Health Works on Safeguards
As the National Institutes of Health rolls out many "big data" research endeavors involving human genomes, electronic health records and other sensitive data, it's tackling the challenges involved in ensuring security.
"The kinds of data sets that will truly yield the biomedical insights we need are so large that we need to think about how to get that data all in one place and analyze it," says Eric Green, M.D. PhD., director of the National Human Genome Research Institute at NIH, in an interview with Information Security Media Group [transcript below].
That means being able to securely provide access to the data to hundreds of scientists around the world, explains Green, who was also recently named NIH's acting director of data science, a new position that's being created because of the great volume of informatics work under way.
NIH is investigating new models for aggregating and integrating the data, including putting the data onto cloud platforms so that researchers can share and access it, he says.
"When you start moving [data] into the cloud, or you move it ... beyond where the data might have originally been generated, it poses challenges and raises lots of questions," he says. "We need to protect that data and make sure it's appropriately secure."
Among the challenges in using big data for research, he says, are validating sources of data, protecting the identities of patients providing information and making sure the researchers accessing the data are properly credentialed, he says.
In the interview, Green also discusses:
- A new NIH scientific data council that will address research challenges, including privacy issues;
- NIH funding for projects that will investigate the use of cloud, encryption and other technologies.
- The impact on NIH projects of inconsistent state privacy and security regulations.
Green has been director of National Human Genome Research Institute at NIH since 2009. In January, he was named NIH acting associate director of data science, a position for which NIH is currently recruiting. Previously, Green was the scientific director of NHGRI and director of the NIH Intramural Sequencing Center. He holds a Ph.D. in cell biology from Washington University in St. Louis, where he also completed medical school.
NIH and Big Data
MARIANNE KOLBASUK MCGEE: Could you tell us a little bit about your role at NIH and why the new associate director of data science position is being created?
ERIC GREEN: My role at NIH is to direct one of the 27 institutes and centers. I'm the director of the National Human Genome Research Institute. In addition to that, I've now been asked on an interim basis to be the acting associate director for data science as part of a new emphasis on big data and biomedical research, in particular data science as it pertains to the large amounts of biomedical research data being generated by NIH-supported research and other supported research in biomedicine.
MCGEE: Tell us a little bit about the kinds of informatics work under way at NIH and where data privacy and security fits in?
GREEN: There's a large amount of data being generated and informatics research being done at NIH, but the even greater amount takes place outside of NIH, but supported by NIH grants and contracts.
If I speak broadly to all NIH-supported research, both being done within the government and outside of the government, I would tell you that we're in the middle of a change in biomedical research. Ten or 15 years ago we were data-poor and analysis-capable. Now, because of major technology advances in a number of areas, we find ourselves data-rich and analysis-poor. We cannot analyze the data as quickly as we generate it because the technologies we now have for generating data are so spectacular.
When it comes to issues of data privacy and security, those are most relevant to biomedical research data that have been generated on human subjects, participants of research projects that are involved in large studies. But in addition, we have lots of data that we're generating of non-human data: mice, flies, worms, yeast, bacteria and so forth.
Emerging Security, Privacy Challenges
MCGEE: Related to the research on human data, as more patient data from electronic health records, genomic sequencing, medical imaging, etc., becomes available to NIH for research projects, what are the emerging challenges related to data security and privacy there?
GREEN: There's great excitement in being able to analyze very large data sets coming out of electronic health records, as you say, and then integrating those with increasing amounts of genomic data and imaging data, even environmental-exposure data and so forth. The kinds of data sets that will truly yield the biomedical insights that we need will have to be so large that we need to think about how to get all that data in one place and analyze it.
Then, of course, you want to have a community of scientists - hundreds of scientists around the world - being able to analyze that data in different and creative ways to get the greatest insights. Moving that data around, therefore, because of its size, becomes a challenge. One of the things we're looking at now, investigating and doing some pilot studies on, are what the different models one might use to have that analysis take place. Perhaps [it's] not by moving the data around but putting the data in one site and having people access it. Perhaps [it's] putting it up in the cloud. But there are lots of questions. We need to protect that data and make sure that it's appropriately secure ... because that's what we promised the individuals who are participants.
But when you start moving it into the cloud, or you move it to different sites, or you start moving it beyond where that data might have originally been generated, it poses challenges and raises many questions. Part of the new initiative that NIH will be pursuing now is to investigate new models for being able to get the kind of power of the studies I just described to you - aggregating all this data, integrating all this data, making it available widely to the scientific community - but at the same time ensuring security and privacy that's so needed as part of what we do when we do research on humans.
MCGEE: When it comes to security for that sort of data, what are the challenges? Is it a matter of authenticating the users who are having access to it, or making sure that it's secure on cloud-based platforms? What are some of those challenges?
GREEN: It's all of the above. It includes validating the source of data, making sure that when you integrate different data sets you're appropriately doing it and keeping the proper identification of individuals so that what you attribute to an individual - some piece of data - is correct. At the same time, you need to make sure that the people who are accessing that data have been properly credentialed to do that work. We have a whole set of approaches that we used to allow an individual to do what's called "human subjects research," and then that requires appropriate approvals and appropriate authentication of the individuals. And as the data becomes distributed, you need to be honoring all of these steps, but doing so in a distributed way.
Navigating a Regulatory Patchwork
MCGEE: Because many of your projects involve multiple parties, researchers from different places, contributors of data from different sources, when it comes to privacy regulations across different states and the federal government, there's an inconsistent patchwork. Does that inconsistency in terms of security and privacy regulations impact the research that you're able to do at NIH, or does it complicate the informatics projects? If so, how?
GREEN: I wouldn't say it overwhelmingly complicates it up until now. Going forward it will become an issue that we need to be carefully monitoring. Part of what is being created as part of this new initiative at NIH is something called the Scientific Data Council. And that Scientific Data Council has many responsibilities, one of which is to be a body with expertise and appropriate authorities for monitoring these kinds of obstacles as they might develop, and also to develop new approaches when new challenges come up - to be able to tackle some of these issues and have it also anticipate what might become challenging down the road. Some of these issues around privacy regulations are the kinds of things they will need to be monitored and appropriately developing strategies to help overcome any barriers.
Securing Big Data Projects
MCGEE: NIH has a lot of big data projects under way. What are the security and privacy challenges related to those efforts, and how are you addressing those challenges?
GREEN: It's similar to what I was saying earlier. If you take as one example, among large projects that are going on now, projects to sequence the genomes of many individuals as part of clinical research studies and other research studies, the real benefit of having large data sets available is that you're not just looking at dozens of individuals and their data, or hundreds or thousands, but tens of thousands, even eventually hundreds of thousands of individuals.
In order to do that, you need all the data in one place to be able to compute off it all at once. You get more statistical power the more people you have data from. But when you go to do that, you need to aggregate all that data into one place to allow computation to take place. And when you do that, you're moving data around and so you have to be cognizant of privacy and security: where that data is; who can access it; who has the right to access it; what permissions have been given; what stipulations have been associated with the use of that data and so forth. It just becomes challenging. But those are the kinds of things that we're now looking into with figuring out the best solutions.
MCGEE: Are there any new or emerging security technologies that you think will be important to your biomedical research, for instance the use of encryption or any other new security technologies that evolve?
GREEN: ... We're funding individuals both here at NIH and elsewhere to look at these kinds of things, especially when it comes to cloud computing, encryption technologies and so forth. All this is under active investigation.