My Experience

Human-AI Empowerment Lab (Clemson University)

May 2024 – Present


AI Data Scientist

As an AI Data Scientist at the Human-AI Empowerment Lab, I led the development of an AI-powered comment triage system designed to optimize and automate the feedback review process for large-scale document reviews. This system leverages cutting-edge Natural Language Processing (NLP) and Machine Learning (ML) techniques to classify and prioritize feedback across multiple dimensions, including urgency, sentiment, actionability, and thematic relevance.

AI-Powered Comment Triage System


Overview:

This project aimed to streamline and optimize the feedback review process through an AI-driven triage system. By deploying large language models like GEMMA-2B and Gemini 1.5 Pro from Google AI Studio and Hugging Face, the system achieved a 25% improvement in feedback classification accuracy and reduced manual triage efforts by 50%.


Key Project Highlights:

  • Research-Driven Approach: Identified key research gaps in AI-based document triage systems, leading to a novel framework that combined rule-based logic with machine learning models. This work was submitted to the ACM SIG Conference.
  • NLP & ML Pipeline: The system included sentiment analysis using Hugging Face models, topic modeling with LDA, and zero-shot/few-shot learning to adapt to new, unseen feedback without requiring re-training.
  • Automation & Scalability: The system processed large volumes of comments using Clemson's Palmetto HPC for scalable model training, reducing training time by 55% through efficient utilization of 60 CPUs and 8 GPUs.

Technical Skills & Contributions:

  • Machine Learning & NLP: Developed a resilient AI pipeline leveraging GEMMA-2B, Gemini 1.5 Pro, and Hugging Face models for sentiment analysis, intent classification, and topic modeling.
  • Data Engineering: Designed and maintained robust ETL pipelines, ensuring the system could handle varied feedback types from multiple projects.
  • Optimization & Automation: Automated the feedback triage process, reducing manual intervention by 50%, integrating rule-based logic to enhance decision-making.
  • Cloud Infrastructure: Accelerated model training using Clemson's HPC cluster and cloud resources, improving computational efficiency by 55%.
  • Research Publication: Co-authored a research paper on the AI-Powered Comment Triage System, submitted to the ACM SIG Conference.

Key Outcomes:

  • 25% Increase in Feedback Classification Accuracy: Enhanced the system's ability to identify and prioritize urgent, actionable feedback.
  • 50% Reduction in Manual Triage Time: Automated classification and prioritization processes, significantly reducing human intervention.
  • 95% F1 Score in Few-shot Learning Scenarios: Achieved high performance with minimal training data, demonstrating the system's effectiveness in large-scale environments.
  • 55% Faster Training Time: Leveraged cloud infrastructure and parallel computing to accelerate model training and deployment.

Role and Contributions:

  • Led the end-to-end design and development of the AI-powered comment triage system, from problem definition to model deployment.
  • Conducted research on AI-driven document triage systems, addressing key limitations and presenting findings at the ACM SIG Conference.
  • Collaborated with cross-functional teams, including data engineers and stakeholders, to ensure the system met real-world needs for large-scale document review.
  • Contributed to the broader academic and AI community by sharing insights and research findings in a published paper.

Conclusion:

The AI-Powered Comment Triage System successfully automated and optimized the feedback review process, improving classification accuracy by 25% and reducing manual triage time by 50%. By addressing key research gaps, this system provided a scalable solution for AI-driven document management, earning recognition through its submission to the ACM SIG Conference.

Clemson Engineers for Developing Communities (CEDC)

August 2023 - May 2024


Data Analyst

As a Data Analyst, I led a project combining Python and Power BI to analyze qualitative data from interviews on disaster preparedness and recovery. I conducted text preprocessing, term frequency analysis, and topic modeling in Python, while leveraging Power BI for spatial visualization and interactive analysis. This dual-platform approach provided comprehensive insights into community resilience and disaster management, enhancing policy-making and community engagement efforts.


Disaster Management and Community Resilience Analysis


Overview:

In disaster management and community resilience, understanding the narratives and experiences of individuals and stakeholders is paramount. Through semi-structured interviews, valuable insights emerge, shedding light on perceptions, challenges, and strategies related to disaster preparedness and recovery efforts. This project aims to analyze these narratives, identify frequently occurring terms, uncover underlying themes, and explore potential variations based on geographic locations.


Analytical Approach:

This analysis employs a comprehensive methodology combining Python and Power BI to extract meaningful insights from qualitative data.


Python Analysis:

Understand disaster experiences and resilience discussions through data preprocessing, exploratory data analysis, text analysis, and geographic analysis.


Key Steps and Findings:

Loading and Importing Packages

Utilized Pandas for data manipulation, ensuring efficient data handling and preparation.

Data Exploration:

statistical analysis to understand data distributions and trends. Visualized data to identify patterns and outliers.

Handling Missing Values:

Identified and addressed missing values to ensure the reliability of the dataset. Applied imputation techniques to fill gaps in the data.

Text Preprocessing:

Used NLTK and spaCy for tokenization, lemmatization, and removal of stopwords. Standardized text data to enhance consistency and analysis quality.

Error Resolving:

Detected and corrected inconsistencies, such as typos and misclassifications. Implemented data validation steps to maintain data integrity.

Exploratory Data Analysis:

Analyzed the distribution of responses across different sectors. Used visualization tools to uncover insights into sector-specific trends.

Term Frequency Analysis:

Analyzed the most common words used in the text data to identify key themes. Created word frequency dictionaries to quantify term importance.

Creating Dictionary:

Organized word frequencies into a comprehensive dictionary for further analysis. Enabled efficient retrieval and analysis of key terms.

Topic Modeling:

Used Latent Dirichlet Allocation (LDA) to uncover latent topics within the text data. Identified themes such as community engagement, support, infrastructure, and natural disasters.

Geographic Analysis:

Geocoded location names and mapped coordinates to provide spatial insights. Analyzed geographic patterns to understand regional variations in responses.

Integration with Power BI:

Shifted to Power BI for advanced spatial visualization and analysis. Combined Python and Power BI to leverage the strengths of both tools.


Key Findings:

Text Preprocessing:

Cleaned and standardized text data, removing stopwords to focus on meaningful content.

Term Frequency Analysis:

Identified common words such as "community," "emergency," and "county," indicating key themes.

Topic Modeling:

Revealed vital themes, including community engagement, assistance, infrastructure concerns, and natural disasters.

Geographic Analysis:

Provided spatial insights into the data, highlighting regional variations in disaster experiences and resilience discussions.


Power BI Analysis:

Explore regional variations in disaster experiences and resilience discussions.


Key Steps and Findings:

Loading Data:

Imported data from an Excel file into Power BI. Ensured the inclusion of geographic locations, speakers, sectors, roles, and comments.

Data Transformation and Cleaning:

Performed data cleaning tasks such as removing duplicates and correcting errors. Standardized data formats to ensure consistency and accuracy.

Creating Full Location:

Combined granular location data with state information to create a unique "Full Location" field. Enhanced the accuracy of geographic analysis by identifying each locale precisely.

Data Modeling:

Verified that the data model was correctly set up. Ensured accurate column derivation for effective analysis.

Creating Visualizations:

Generated interactive visualizations, including maps, slicers, and card visuals. Enabled dynamic data filtering and exploration.

Finalizing and Sharing:

Published the dashboard within Power BI workspace. Provided detailed documentation of the process for reference and replication.


Key Findings:

Map Visualization:

Integrated sector information and provided a geographical overview of data distribution.

Slicers and Card Visuals:

Enabled users to filter data dynamically and gain summarized insights.

Interactive Analysis:

Facilitated dynamic exploration and comparison of data points across visualizations.


Key Findings in Python and Power BI:

Python:

Provided in-depth text analysis, revealing key themes and patterns within the data. It offered advanced NLP capabilities for comprehensive text preprocessing and topic modeling.

Power BI:

Offered user-friendly geographic visualization and enhanced interactive analysis capabilities. It allowed for effective visualization of spatial data and integration with Python for a holistic analysis approach.

Limitations and Strengths:

Python:


Strengths:

Advanced text analysis capabilities, comprehensive NLP tools, and flexibility in data manipulation.

Limitations:

Requires significant coding expertise and computational resources.


Power BI:


Strengths:

User-friendly interface, robust geographic visualization, and interactive analysis features.

Limitations:

Limited advanced text analysis capabilities compared to Python.

Role and Contributions:

As a Data Analyst Intern and team leader for this project :

1.Conducted a detailed KT Analysis to compare tools and select the most suitable ones for our analysis.

2.Introduced Python and Power BI to enhance the analytical approach, leveraging their strengths for comprehensive insights.

3.Created comprehensive documentation, including step-by-step guides and video tutorials, to ensure my teammates could effectively replicate the analysis.

4.Provided training and support to teammates, ensuring they could use Python, Power BI, and Tableau efficiently.

5.Emphasized data cleaning and collection strategies to improve analysis accuracy and reliability.

Conclusion:

This project successfully combined Python and Power BI to analyze qualitative data on disaster experiences and community resilience. By leveraging the strengths of both tools, we gained comprehensive insights and provided actionable findings to stakeholders. This contributed to informed policy-making, improved practices, and enhanced community engagement efforts, ultimately supporting disaster management and community resilience initiatives.

Data Scientist

During my tenure as a Data Scientist at Clemson Engineers for Developing Communities, I worked on the DSR3P Fund Navigator project. This project involved developing and implementing advanced machine learning models to automate the categorization of federal grants, significantly improving the grant application process for low-capacity communities and successfully classifying over 70 grants for future automation.


Overview of DSR3P Fund Navigator Project

Federal grants aim to provide financial assistance for public services and to support the economy. However, the grant application process can be difficult due to eligibility requirements, fund-matching requirements, and specific technical needs for the application1. Though many of these grants are targeted towards the most vulnerable communities, higher-capacity communities often out-compete low-capacity communities for the grants due to more competitive applications. These higher-capacity communities often have dedicated staff for grant applying and technical support. The grant application process can take a significant amount of time and effort, leading low-capacity communities who are already over-taxed to not apply or to not meet all requirements. This, in turn, widens the resource gap between communities. The DSR3P team aims to reduce this resource gap by developing a tool that helps match community needs with grants that they may be eligible for.


Key Contributions and Achievements:


Data Preprocessing and Cleaning:

1. Launched a robust data preprocessing pipeline using Microsoft Azure cloud services.

2. Reduced data inconsistencies by 95% through detailed documentation and automated validation scripts.

Natural Language Processing (NLP) Techniques:

1. Utilized NLP techniques for text analysis, including tokenization, removal of stop words, and transformation of text data into numerical representations.

2. Generated word clouds and performed topic modeling using Latent Dirichlet Allocation (LDA) to uncover underlying themes in grant descriptions.

Machine Learning Model Development:

1. Trained and evaluated various machine learning models, including Logistic Regression, Random Forest, Support Vector Machines (SVM), Naive Bayes, and Gradient Boosting Machines (XGBoost).

2. Achieved high accuracy, precision, recall, and F1-scores, with Logistic Regression and Multinomial Naive Bayes performing particularly well.

Feature Engineering and Model Refinement:

1. Enhanced feature engineering by incorporating bi-grams and tri-grams.

2. Addressed class imbalance using Synthetic Minority Over-sampling Technique (SMOTE).

3. Conducted thorough error analysis and iterative refinement to improve model performance.

Interactive Dashboards and Data Insights:

1. Developed interactive dashboards using Tableau to facilitate strategic data insights.

2. Provided weekly insights that improved decision-making by 30% and utilized SQL databases for efficient data management and client transparency.


Grant Classification:

Successfully classified over 70+ grants for future automation, paving the way for streamlined grant identification processes.


Future Work and Deployment:

Focused on developing an intuitive user interface and integrating the model into operational systems.

Ensured thorough testing and validation in real-world scenarios, along with user training and support.

Established a feedback mechanism to gather user insights for iterative improvements.


Conclusion:

The DSR3P Fund Navigator project showcased my ability to apply advanced data science and machine learning techniques to real-world problems. By automating the grant categorization process, I contributed to making grant identification more accessible and efficient, particularly for low-capacity communities. This project demonstrated my skills in data preprocessing, NLP, machine learning model development, and the creation of interactive data visualization tools, ultimately supporting equitable resource distribution and community development.

Nice Hi-Tech Centre

August 2021 - July 2022


Data Scientist

During my tenure as a Data Scientist at Nice Hi-Tech Centre, a small startup, I developed a robust data infrastructure and predictive model for detecting potential fraudsters in the Enron scandal using advanced data analysis and machine learning techniques. The project aimed to enhance our fraud detection mechanisms to ensure organizational security. I was responsible for designing and maintaining data pipelines, managing databases, and integrating various data sources, which were critical for the success of the machine learning models. The AdaBoostClassifier achieved the highest performance, underscoring the importance of a robust data engineering and data science foundation in developing effective fraud detection systems. This initiative was crucial for advancing the organization's data capabilities and establishing robust security practices

Enron Fraud Detection Project


Overview:

The Enron Fraud Detection project aimed to identify Enron employees who might have committed fraud based on public financial and email datasets. The project aimed to build a predictive model to detect potential fraudsters using advanced data analysis and machine learning techniques. Additionally, this project was instrumental in developing our organization's fraud detection mechanisms to prevent similar occurrences.


Machine Learning Pipeline:

The project involved a comprehensive machine learning pipeline to process the data, engineer features, and train predictive models.


Key Steps and Findings:

Defining Problem Statement:

The goal was to identify individuals involved in the Enron fraud scandal using financial and email data.

Obtaining Data:

Public datasets from the Enron scandal, including financial and email data, were used.

Data Exploration:

Explored the dataset to understand its structure, distribution, and key features. Identified missing values and their distribution across features.

Data Preprocessing:

Removed outliers and irrelevant features. Engineered new features, such as the ratio of emails sent to/from persons of interest. Selected important features using the SelectKBest module from sklearn.

Model Training:

Trained multiple classifiers, including DecisionTree, RandomForest, and AdaBoost.

Parameter Tuning:

Used GridSearchCV for hyperparameter tuning to optimize model performance.

Evaluation:

Evaluated models using accuracy, precision, recall, and F1 scores.


Key Findings:

Outlier Removal:

Removed a significant outlier identified as a typo ('TOTAL').

Feature Engineering:

Created a new feature 'ratio' = ‘from_this_person_to_poi’ / ‘from_poi_to_this_person’. Selected the top features using SelectKBest.

Model Performance:

DecisionTreeClassifier: Accuracy: 0.808, Precision: 0.298, Recall: 0.253
RandomForestClassifier: Accuracy: 0.846, Precision: 0.378, Recall: 0.124
AdaBoostClassifier: Accuracy: 0.812, Precision: 0.332, Recall: 0.311
AdaBoostClassifier outperformed other models with the highest average F1 Score of 0.32.


Role and Contributions:

As a Data Scienctist at Nice Hi-Tech Centre, my contributions included:


Data Engineering:


Data Pipeline Development: Developed and maintained robust ETL (Extract, Transform, Load) pipelines using Microsoft Azure cloud services, which reduced data inconsistencies by 95%.


Database Management: Utilized SQL databases for efficient data management, ensuring data integrity and client transparency.


Big Data Processing: Employed big data technologies to efficiently handle and process large datasets.


Data Integration: Integrated disparate data sources to create a unified dataset for analysis.



Data Science and Machine Learning:


Model Development: Administered advanced models using statistical methods, achieving an F1 Score of 0.32.


Hyper-Parameter Tuning: Conducted comprehensive hyper-parameter tuning and cross-validation to ensure model accuracy and reliability.


Visualization:Developed interactive dashboards using Tableau, providing weekly insights that improved decision-making by 30%.



Fraud Detection Strategies:


Developed interactive dashboards using Tableau, providing weekly insights that improved decision-making by 30%.


Worked on developing strategies to ensure robust fraud detection mechanisms within the Nice Hi-Tech Centre.


Implemented data analysis practices to prevent fraud and enhance organizational security.


Conclusion:

The Fraud Detection project successfully demonstrated the application of machine learning to identify potential fraudsters. The project achieved significant insights and high model performance by leveraging advanced data preprocessing, feature engineering, and model-tuning techniques. This work contributed to understanding and preventing corporate fraud and helped establish fraud detection strategies within Nice Hi-Tech Centre, ensuring the organization was well-protected against similar risks.

Aptech

July 2019 - October 2019


Machine Learning Intern

During my internship at Aptech, I developed a University Admission Prediction System using machine learning techniques to predict graduate program acceptance probabilities. The project involved data preprocessing, model training with various algorithms, and creating interactive visualizations, ultimately identifying the Multi-layer Perceptron model as the most effective. This initiative enhanced my skills in applying advanced data analysis and machine learning to real-world challenges in education.

Overview of University Admission Prediction System


Abstract

The project addresses the challenge of predicting university admissions, an essential aspect for prospective graduate students. Utilizing historical data of applicants, the project aims to forecast acceptance probabilities by leveraging machine learning techniques. Various algorithms were tested, with the Multi-layer Perceptron model emerging as the most effective.


Introduction

Current challenges in the university admission process, including inefficiencies and inaccuracies in predicting acceptance based on student profiles. Introduction of a machine learning model to predict admission likelihoods, thereby streamlining the application process.


Literature Survey

Review of existing research and methods in predictive modeling for educational admissions.


System Requirements

Minimum hardware and software requirements, including Intel Core I3 processor, 4 GB RAM, 500 GB Hard Disk, Python programming language, PyCharm IDE, Flask, TensorFlow, and Keras.


System Design & Analysis

Detailed architecture and design of the system, including UML diagrams (Class, Data Flow, Sequence, Use Case, Activity Diagrams). Explanation of the software environment and the tools used.


Implementation

Detailed description of the problem being addressed. Explanation of the data preprocessing steps, train-test split, model training using various algorithms, and selection of the best-performing model.


System Testing

Different types of tests performed to ensure the reliability and accuracy of the system, including Unit Testing, Integration Testing, Functional Testing, System Testing, White Box Testing, and Black Box Testing. Test cases and their outcomes are documented to validate the system's functionality.


Input and Output Design

Design considerations for user input and system output to ensure ease of use and clarity.


Results

Screenshots and descriptions of various interfaces and outputs of the system, including user registration, dataset upload, university prediction, and graphical representation of predictions.


Conclusion

Summary of the project's success in developing an accurate and efficient university admission prediction model. Discussion on potential future enhancements, such as incorporating more diverse data and improving model accuracy.


Key Achievements

-Data Preprocessing: Successfully cleaned and prepared the dataset for modeling, handling missing values and outliers effectively.

-Model Development: Implemented multiple machine learning algorithms (Random Forest, SVM, Logistic Regression, Multi-layer Perceptron) and identified the best-performing model.

-Feature Engineering: Created and selected relevant features to improve model accuracy.

-Visualization: Developed interactive visualizations to present prediction results effectively.

-Documentation and Testing: Provided thorough documentation of the system and conducted extensive testing to ensure reliability.


Future Work

-Enhancing the model with additional features such as recommendation letters and personal statements.

-Expanding the dataset to include more diverse student profiles and geographic locations.

-Integrating the model into a user-friendly web interface for broader accessibility.


Conclusion

This comprehensive project showcases my ability to apply machine learning techniques to solve real-world problems in the educational domain, demonstrating my proficiency in data preprocessing, model development, and system implementation.