In the research and education community, digital data continues to be a key driver of scientific discovery and innovation. The ability to access, manipulate, and share data is becoming increasingly important and many institutions are looking for ways to form collaborative partnerships and share data to propel initiatives and unlock new opportunities. In addition, educational institutions must properly prepare students to enter the workforce with the data expertise and skills necessary to meet the current and future needs of business organizations. “Many fields are generating a massive amount of data, and this data is often too large to be sent across from different sites,” explains Vasant G Honavar, Professor and Edward Frymoyer Chair of Information Sciences and Technology, Professor, Computer Science, Bioinformatics, and Genomics, and Neuroscience Graduate Programs, The Pennsylvania State University. “Analyzing and interpreting the data where this information resides is often the only option, because moving the data around is not feasible. However, many times institutions want to access the data, work on research in a virtual environment, and use the infrastructure as a medium for facilitating this distributed collaboration.”
“We conceptualized the Virtual Data Collaboratory (VDC) to address the growing importance of data across all fields of science and engineering, and critical needs to support collaborative data-driven research” said Manish Parashar, the lead and original Principal Investigator for the VDC project.
Development of the VDC was supported by a $4 million National Science Foundation (NSF) grant and is a collaboration among Rutgers, The State University of New Jersey, Penn State, and Temple University. The VDC is envisioned as a federated and coordinated cyberinfranstructure across the Rutgers University campuses in New Jersey and multiple campuses in Pennsylvania and New York by a high-speed network, with the potential to incorporate academic and research institutions across the nation. “The grant allowed us to build an infrastructure where research data created at Rutgers and other collaborating universities could be stored, discovered, and reused,” says Grace Agnew, Associate University Librarian for Digital Library Systems, Rutgers University. “A large-scale research data infrastructure is critical for Rutgers to continue to advance as a research institution.” The VDC project is in partnership with Edge, the New Jersey statewide research and education network, and KINBER, Keystone Initiative for Network Based Education and Research (KINBER). The advanced peering connection between KINBER’s PennREN network and the Edge network allows for greater capacity, speed, and enhanced research capabilities between institutions across Pennsylvania, New Jersey, and beyond.
Seamless Data Accessibility
Data sharing is a central component of the scientific process and helping to further expand the field of data science and effectively training future generations. “In the past, a group of scientists gathered around a microscope or a telescope to share insights, enabling great discovery to be made,” says Manish Parashar, Ph.D., Director of the Rutgers Discovery Informatics Institute (RDI2) and Distinguished Professor of Computer Science, Rutgers University. “Now, researchers analyze shared data collectively to make new discoveries. The goal of the VDC was to create a means for members of a research community to similarly access and work on shared data sets collaboratively to drive research and data-driven discoveries.” The VDC is a data cyberinfrastructure designed to enable data-intensive, interdisciplinary, and collaborative research and inspire data-driven science and engineering discoveries. The goal of the VDC is to provide seamless access to data and tools that can help educators, students, researchers, and industry leaders foster learning communities and bring Big Data skills into the classroom. In addition, the VDC helps data scientists and engineers develop and apply advanced federated data management and analysis tools for high impact scientific applications. “Research groups have become more interdisciplinary and this trend is a key element for establishing the VDC,” says Ivan Rodero, Ph.D., Associate Research Professor, RDI2, Rutgers University. “While there are currently many data depositories, without access to shared information, we would never know what other discoveries could be revealed. Conducting research individually can be very challenging. By providing access to data, tools, and resources, we can help researchers create new science models and discoveries.”
Federated Data Management
To further enhance the research ecosystem and assist both humans and machines in scientific discovery, a set of standards was developed called the FAIR Guiding Principles. Created in the academic community, these standards have since been embraced by scientific funding agencies and are meant to improve the findability, accessibility, interoperability, and reusability of digital assets. Rutgers University played an important role in conceptualizing and constructing the VDC and creating access to FAIR data. “At Rutgers, we recognize the challenges and importance of sharing FAIR data,” explains Parashar. “In a conversation with Dr. Honavar and his team, they were thinking about similar issues and we found research cases where sharing data would make sense. For example, Dr. Honavar was working on a proteomics use case with a dataset at Rutgers that they needed to share. Similarly, there was a use case around smart cities with CUNY. Driven by these science needs and a shared vision, we realized we needed to at least double such an environment like the VDC to support the science and research people wanted to do. This shared vision and need led to the proposal that we submitted to the NSF.”
Data sharing allows researchers to review, confirm, or challenge each other’s findings, as well as test alternative theories and explore new avenues of reasoning — all with the purpose of driving scientific discovery. “Moving data around is not really feasible,” explains Parashar. “One option is to keep the data where the information originated and connect the data in a federated environment, which was the motivation for architecting the VDC the way we did. While making data discoverable, we must determine how institutions maintain the ownership of data.”
Enabling optimal use of research data and methods also requires a level of harmonization in terms of multiple interfaces and management mechanisms. “One of the challenges is that data cannot leave certain sites, either due to ownership constraints, the data set is too large, or the bandwidth is simply not enough to move the data around at a necessary rate,” adds Honavar. “If an institution has a large depository, they likely do not want the resource replicated in numerous places. Plus, data analysis often takes place at multiple sites. The VDC strives to address these challenges and help researchers in many different scientific fields find relevant data and the tools needed for successful analysis.”
Ensuring Data Discoverability
Rutgers University Libraries are a key part of the VDC team and worked with the lead department, RDI2, to implement the data services layer according to their design. The Libraries engage with and support Rutgers users across the spectrum, from incoming first-year students to faculty members engaged in groundbreaking research,” says Agnew, “We understand and represent user needs. We were tasked with designing the data services layer which is the user-facing part of the project. Our design encompassed adding, discovering, and reusing data. We took a unique approach to ensuring the discoverability and reuse of data by designing an interface that links data with the person who created the information, the tools used to analyze the data, and the intermediate research products that are created around the data before the peer-reviewed publications begin. We designed a strategy that not only supports the workflow of the researcher, but helps other researchers, perhaps in other disciplines, understand the context of the data and how the information is used as part of the discovery process.”
Education and Outreach
An important focus of the VDC is to help train the next generation of scientists and instill the skills and knowledge necessary for them to leverage data and cyberinfrastructure to address important research problems. Educational programs are delivered across multiple disciplines and academic levels and are available through universities like Temple University, Rutgers, Penn State, Drexel, and CUNY, as well as through high school workshops and post-graduate seminars. “The bulk of our effort has been devoted to building up the VDC infrastructure and demonstrating the use cases,” explains Honavar. “The broader use of this infrastructure is leveraging the VDC platform to deliver educational programs focused on an interdisciplinary aspect of research, connectivity to external repositories, and collaboration.”
The VDC extends beyond the platform’s collaborating institutions and leverages NSF investments to provide cyberinfrastructure typically not available to community colleges, state-associated colleges and universities, and regional liberal arts colleges and universities. “As part of the NSF-funded VDC project, the RDI2 is developing educational modules to help researchers solve their data issues and increase the impact of their research. Through a variety of programs including high school workshops, undergraduate learning modules, and online seminars, educators can help foster the next generation of data science researchers and professionals. Internships are also available to provide an opportunity for students to work with VDC mentors to develop their skills. Students are able to gain hands-on experience in their prospective field, enhance their career potential, collaborate in globally-influential projects, and learn more about Big Data and artificial intelligence (AI).” Forough Ghahramani, Education and Outreach Co-Lead for the VDC project. According to Ghahramani, “The educational material is shared on the VDC website, datacollaboratory.org, so that it can be used in courses and programs across NJ, PA, and beyond. A goal is to foster increased use of the VDC while building sustainable data science peer-learning communities.”
FAIR Guiding Principles
Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services.
Once the user finds the required data, she/he needs to know how the data is accessed, possibly including authentication and authorization.
The data usually need to be integrated with other data. In addition, the data needs to interoperate with applications or workflows for analysis, storage, and processing.
Metadata and data should be well-described so that they can be replicated and/or combined in different settings.
The VDC is available to other universities through the Internet2 high speed networking facilities, KINBER in Pennsylvania and Edge in New Jersey. “The VDC is also designed to bridge to existing collaboratories, such as the Protein Data Bank,” shares Agnew. “So much of the data in the VDC will be ‘virtual’ because the information exists in other collaboratories but is accessible via the VDC.”
Central to the VDC vision is the regional data science DMZ network that provides services to enable efficient and transparent access to data and computing capabilities, as well as a flexible and scalable architecture for data-centric infrastructure federation. “A critical element of enabling a virtual data collaboratory is providing the network bandwidth and connectivity to allow the seamless access and sharing of data,” says Parashar. “Edge allows us to achieve this goal through their regional high performance optical network. The organization is a very important partner because they provide the fundamental infrastructure that we can further build upon.”
Honavar says institutions like Penn State and Rutgers have significant infrastructure, and there are smaller colleges and universities that are interested in taking advantage of these large datasets and analysis expertise that they may not have access to. “Providing the connectivity and access to resources like the VDC can help broaden the reach of every institution’s research efforts.”
Going forward, a main focus of the VDC will largely be on further developing educational programs that leverage the VDC platform. “When we first developed the concept, we asked ourselves, can this idea help research and enhance education, connectivity, and outreach activities?” says Parashar. “Our experience has shown that the VDC is accomplishing these goals. Now, we want to build upon these conceptual ideas and put together a more coherent deployment of the VDC that can enable the science that we envisioned. We look to build use cases that can start using this platform. The next step after that is to look to the VDC as a model for building something that could be used at a larger scale. We want to investigate similar deployments in other places and build upon a complementary infrastructure that exists like the Pacific Research Platform (PRP) to create a more national or international infrastructure.”
Honavar says the VDC was a proof of concept exercise that showed this idea could be successfully achieved. “The lessons learned will inform anything we do going forward. When you think about collaborative discoveries, the process would involve not just sharing of data, but sharing tools and expertise and integrating people as part of the ecosystem. You can imagine scientists at different locations contributing different tools that would weave together into a collaborative workflow that operates across data at multiple locations. By scaling up these kinds of efforts, this vision would be feasible.” Public funding, like the grant provided by the NSF, will also be critical for accomplishing the goals set forth by the VDC. “We are working to find synergy with the industry and using resources in the Cloud to help support our mission,” says Rodero.
As the VDC evolves, Parashar says further building the community is vital to gaining insight into the tools that are needed to conduct the research projects of today and the future. “We cannot possibly provide every tool that every discipline may need, so a model of co-creation for the next generation of the infrastructure will be essential. For example, a community of biologists that uses a particular kind of data may get involved in the creation of the next generation of tools in collaboration with people who have computer science or infrastructure expertise.” As a state-of-the-art data-intensive computing, storage, and networking solution, the VDC will continue to integrate existing data depositories and help to blaze new trails that support interdisciplinary research, expand access, and increase the impact of data-science worldwide.