cv
Welcome to Kaili Huang's CV. Click the PDF icon to view or download the full resume.
Basics
Name | Kaili Huang |
Label | Applied Scientist |
kaili@cs.stanford.edu | |
Url | https://kailihuang.com |
Summary | An applied scientist and researcher with a strong background in machine learning and natural language processing. |
Education
Work
-
2023.08 - Present Applied Scientist
Microsoft
Working in Bing Ads Team.
- Large language models
- Multi-modality
- Visual-linguistic
- Research
-
2022.06 - 2022.09 Data & Applied Scientist Intern
Microsoft
Worked in Bing Ads Team on multi-modality model pruning. Extended Transformer models' structured pruning methods into CLIP (Contrastive Language-Image Pre-training) for the first time. Reduced the model size significantly (-40%) with a slight accuracy drop (-1%).
- Multi-modality
- Model pruning
- Transformers
- 40% model size reduction
-
2020.07 - 2021.08 Machine Learning Engineer
ByteDance
Worked on natural language processing (NLP) at Language and Information Technology Group (LITG), AI Lab, supervised by Dr. Hang Li. Constructed a fake news detection pipeline from scratch, and built bot-written articles detection models.
- Natural language processing
- Fake news detection
- Bot-written articles detection
- AI Lab
Publications
-
2025.04 ColBERT-Serve: Efficient Multi-stage Memory-Mapped Scoring
Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025
We study serving retrieval models, particularly late interaction retrievers like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT's query latency and supporting many concurrent queries in parallel.
-
2024.07 Overview of the Ninth Dialog System Technology Challenge: DSTC9
IEEE/ACM Transactions on Audio, Speech, and Language Processing
Overview paper for the Ninth Dialog System Technology Challenge (DSTC9), covering multi-domain task-oriented dialogue systems and cross-lingual dialogue state tracking.
-
2021.02 Multi-domain Task-oriented Dialog Challenge II at DSTC9
AAAI-2021 Dialog System Technology Challenge 9 Workshop
The paper provides an overview of the 'Multi-Domain Task Completion Dialog Challenge II' track at the 9th Dialog System Technology Challenge (DSTC9). Two tasks are introduced in this track: end-to-end multi-domain task completion and cross-lingual dialog state tracking.
-
2020.10 A Large-Scale Chinese Short-Text Conversation Dataset
Natural Language Processing and Chinese Computing (NLPCC)
We present a large-scale cleaned Chinese conversation dataset LCCC, which contains a base version (6.8 million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs.
-
2020.07 KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)
We propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs. Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0.
-
2020.06 CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset
Transactions of the Association for Computational Linguistics (TACL)
To advance multi-domain (cross-domain) dialogue modeling as well as alleviate the shortage of Chinese task-oriented datasets, we propose CrossWOZ, the first large-scale Chinese Cross-Domain Wizard-of-Oz task-oriented dataset. It contains 6K dialogue sessions and 102K utterances for 5 domains, including hotel, restaurant, attraction, metro, and taxi.
Awards
- 2020
NLPCC Best Student Paper
Natural Language Processing and Chinese Computing (NLPCC)
NLPCC is a leading international conference specialized in the fields of Natural Language Processing (NLP) and Chinese Computing (CC).
- 2019
Stanford UGVR Scholar
Stanford University, Tsinghua University
Up to 18 students from China are admitted per year
- 2014
1st Prize in National Olympiad in Informatics in Provinces
National Olympiad in Informatics
Skills
Computer Science | |
Machine Learning | |
Natural Language Processing | |
Information Retrieval | |
Deep Learning | |
Computer Vision | |
Multi-modal Models | |
Large Language Models |
Languages
English | |
Fluent |
Chinese | |
Native speaker |