Keyword Suggestion for In-site Search Engine
Role: Project Owner of Winning Hackathon Project
Data Preparation: Built easy-to-use data pipeline to extract training and evaluation data from PDF and Word documents by leveraging techniques like OCR model; further developed dataset cleansing pipeline to filter out invalid data, like de-duplication and typo fix
Exact Match based Method Development: Leveraged the full-text search function provided by Elasticsearch, which uses BM25/TF-IDF, indexed the extracted phrases from the large dataset, and then built an auto-complete suggestion system based on Elasticsearch indexes
Deep Model based Method Development: Studied word and sentence embedding-based methods, such as Word2Vec and Doc2Vec, finetuned a BERT-base model by taking the idea of sentence and phrase vectorization to vectorize the extracted phrases, and then combined cosine-similarity with hand-crafted features to do keyword suggestion with a multi-layer perception network
Recommendation System of Government Opportunities
Role: Technical Leader
Problem Investigation: Investigated the in-site opportunities recommendation system, such as user dwelling time and click through rate; Analyzed the award history for opportunity recommendation; Gathered user feedback through anonymous survey which shows that the opportunity recommendation does not meet the expectation
Model Design and Development: Designed and developed a user interests model based on the users’ long-run interests, specifically leveraging TF/IDF model computed on the users’ public award history; Categorized the opportunities to several topics; Built a new recommendation system based on user interests and award history which push biddable opportunities to government contractors
A/B Test: DeployedA/B tests and designed several metrics to measure the gain of treatment over baseline, which shows that the new recommendation system has a higher user engagement and retention rate
Deep Learning-based Model: Designed and trained a BERT based siamese model as a POC, which vectorizes the user and opportunities to 100D vectors and recalls opportunities based on cosine similarity of user embedding and opportunity embedding
Document Information Extraction
Role: Technical Leader
Entity Extraction System: Built a name entity extraction system based on Spark NLP to extract entities from documents, such as place, key personnel and clearance requirements
Data Preparation: Designed and developed a pipeline to facilitate human dataset annotation to gather more high-quality training data
Model Fine-tune: Implemented an exhaustive search method to try out different combinations of data preprocessing methods and model hyper-parameters
Project: Held brain-storming session in the team to bring up new ideas; Worked with business stakeholders to pivot from a labor-intensive approach to a ML approach; Piloted the initial POC implementation which transformed into multi-millions strategic investment in the next fiscal year
Probability of Win
Role: Technical Leader
Data collection and normalization: Carefully designed a spark based data pre-processing framework on AWS to handle a large amount of unstructured data, such as public government transactions of a given vendor, and normalized the data into a standard format
Model Development: Adopted fussy clustering and hierarchical clustering algorithm to cluster the normalized data, and then computed the probabilities of winning the bid based on the project clustering and the vendor’s past award history
Data Lake Project
Role: Chief Architect
Data lake development: Transformed a legacy Oracle PL/SQL based data pipeline into a modern Spark-based data lake on AWS
