ELSIcon2022 • Paper • June 3, 2022
J. Scott Roberts, Kayte Spector-Bagdady
The National Institutes of Health (NIH) is investing vast resources into building accessible and demographically diverse genomic databanks. But ensuring data availability is only one step toward enabling life-saving genetic advances. Researchers must also use these diverse data resources. Relatively little effort has been put into this second component - evaluating what drives researchers to choose between different genomic data resources.
This study presents a broad empirical assessment of what drives genetic researchers when selecting and using different kinds of datasets including those which are held by industry, academia, and government. Overall, we found that a major factor in the use of private industry genetic data is the services they provide alongside access, including sharing large, cleaned, and harmonized datasets with support for questions and further analyses. By contrast, many researchers bemoaned the “complicated hassle” that often went hand-in-hand with use of NIH datasets. Because NIH data are deposited by many different researchers with different analytic approaches and platforms, they are often burdensome to clean and investigate. On the other hand, government datasets were found to be broadly available to support many kinds of genetic research, whereas using industry datasets involved strict limitations in data availability in addition to other constraining terms and conditions such as mandatory industry co-authorship of resultant publications.
Overall, our work sets up important next steps for NIH efforts to both support data accessibility as well as use to enable genetic advances generalizable across diverse communities.