Denodo Datafest – My talk & Concluding Thoughts on Data Science @ Scale

Denodo Datafest NYC has concluded with excellent talks and discussions around data virtualization, real-time data APIs, information fabric, data catalogs, and real-world big data problems. It was great meeting and speaking to technologists and business professionals like Brian Hopkins, Michele Goetz, Luke Slotwinski, Mike Sofen, Nicolas Brisoux, Madeleine Corneli, Avinash Deshpande, Paco Hernandez, Kurt Jackson, Arif Rajwani, Mano Vajpey, Vincent L'Archeveque, Minnix, Rajat Sinha, Vineet Talwar, Steve Young, Steven Leistner, Ignacio, Ravi Shankar and many others.

Adnan Masood Speaker Panel Data Science AI Machine Learning

I had the distinct pleasure to speak to the audience about data science bottlenecks i.e. how real data science use cases unfold when the MLOps lifecycle becomes critical -- case studies like training a faster RCNN Model for detecting lab equipments, doing customer escalation triage probability analysis from complaint dataset, using attention network for machine comprehension in contract analysis for keyterms in LIBOR, building meaningful healthcare ontologies to respond to realtime queries, working with pretrained vgg6 model for roof age and condition detection etc. In essence, data Science roadblocks span the entire lifecycle.

For those of us who have toiled beyond hello-world projects in the enterprise, we know this very well that getting meaningful data is hard in presence of data silos and data lake catch-22s. Data virtualization helps provide fast results by integrating disparate data from any enterprise source, big data and cloud in real time. The approach is different from traditional ETL solutions, which replicate data, data virtualization leaves the data in source systems, simply exposing an integrated view of all the data to data consumers. From a data science perspective, data virtualization helps by providing a logical data layer that integrates all enterprise data siloed across the disparate systems, building the Information Fabric.

I believe the data challenges can be divided into 3 main categories -- with virtualization, MLOps, and explainability each solving part of the puzzle.

Availability Challenges

Data silos, modalities, fragmented - Cleaning
Data quality and amount of training data
Ingestion, analysis, transformation, training, serving, Model Development

Continuity Challenges

Deployment / operationalization
Delivery of accurate models with consideration for model and data drift
Relevant Business Accuracy measures
Versioning of data and models for governance

and finally

Explainability Challenges

Interpetabilty and Explainability
Reproduciblity