Anyone involved with research in healthcare and life science knows that the funding model often has unintended consequences. In academia as well as in industry, a researcher or principal investigator (PI) writes a grant proposal describing a project and explaining why they need money for it. If the grant is approved, the PI receives funding to use as they see fit. If the research program involves data science it will need IT infrastructure, typically a combination of the following elements:
- GPU compute
- Flash storage
- Object storage
- A software stack including hypervisors, operating systems, and containers
- An orchestration layer
- Free or open-source environments such as Jupyter Notebooks, Python, and Kubeflow
- Automation tools such as Ansible
- MLOps tools such as the NetApp® AI Control Plane and NetApp Data Science toolkit
How shadow AI takes hold
The PI or grant recipient rightfully controls how the grant is spent. This is good in terms of shielding scientists from external pressure, but it can create redundancy and inefficiency in an institution. Imagine an organization in which several research teams are undertaking data science projects. Given the way financial support is allocated, each team can build an almost identical infrastructure stack to support their work. This is inefficient, because the compute and storage resources in each of those environments is unlikely to be fully used, and it also takes away from the time the teams have for their research.
Installing and configuring a data science infrastructure stack takes time. In addition to the time it takes to get these environments ready for production, someone must be ready to provide ongoing support and troubleshooting if problems arise. The local IT team may be unwilling to get involved with a project they were not involved with architecting or deploying. They may view helping as a way to get drawn into a long ongoing engagement that they don’t have the resources for.
This is how shadow IT (or shadow AI) starts. Research teams want to make their own decisions and control their resources, which creates islands of almost identical, underused infrastructure that may not follow the organization’s data security best practices. Fortunately, this suboptimal outcome is easy to avoid. Reducing duplication, maximizing resource utilization, simplifying operations, and strengthening data security and governance are goals that everyone is interested in pursuing.
Data science as a service to the rescue
In our age of anything and everything as a service, why not apply a proven strategic and operational blueprint to data science infrastructure? You can share resources, lower costs, and give time back to data scientists so they can focus on their work.
There is consensus among data scientists that they spend almost half their time (or even up to 75%!) on tasks that are not data science; for example:
- Configuring hardware and software
- Resource orchestration
- Container management
- Resource scheduling and assignment
- Production management
- Repository management
- Version control
- Data wrangling
Deploying a data science as a service (DSaaS) infrastructure stack and making it available to researchers through a self-service portal that includes chargeback allows them to keep their independence and to control their grant funding. It also delivers a more complete and secure solution, at the same time eliminating shadow AI silos. This is a case in which everyone wins. Researchers get easy access to the IT resources they need at a lower price than they would pay if they built the infrastructure themselves, and the organization avoids silos, increases efficiency, and promotes security and compliance.
Using the NetApp AI Control Plane and Data Science Toolkit, data scientists and engineers gain powerful tools that also alleviate the burden on IT. Researchers get an AI data- and experiment-management solution, and they also gain the ability to perform data management tasks from within the software environments they commonly use, like Jupyter Notebooks, Kubeflow and Apache Airflow pipelines, and Python. Bringing storage system API integration into the data science realm gives power to the data scientists so they can work more efficiently and avoid having to open IT tickets for common tasks.
NetApp has been helping healthcare organizations deploy data science as a service for years and has refined a process that is optimized for speed, accuracy, and customer satisfaction. For a description of our approach to DSaaS in an academic setting, read in Data Science as a Service—Prototyping an integrated and consolidated IT infrastructure combining enterprise self-service platform and reproducible research.
Adding MLOps tools rounds out the offering by helping data scientists to automate, streamline, and speed up feature engineering, pipeline deployment, continuous integration/continuous deployment), and model monitoring. To learn more, read our technical reports TR-4834 and TR-4841. You can also visit netapp.ai to learn more about our AI solutions.
If you would like to start a conversation about data science as a service, contact your NetApp Sales team or your NetApp partner, or get in touch with us.