cv

Welcome to Kaili Huang's CV. Click the PDF icon to view or download the full resume.

Basics

Name Kaili Huang
Label Applied Scientist
Email kaili@cs.stanford.edu
Url https://kailihuang.com
Summary An applied scientist and researcher with a strong background in machine learning and natural language processing.

Education

  • 2021.09 - 2023.06

    Stanford, California, US

    MS
    Stanford University
    Computer Science (GPA: 4.0/4.0)
  • 2016.08 - 2020.07

    Beijing, CHINA

    BE
    Tsinghua University
    Industrial Engineering (CS GPA: 3.8/4.0)

Work

  • 2023.08 - Present
    Applied Scientist
    Microsoft
    Working in Bing Ads Team.
    • Large language models
    • Multi-modality
    • Visual-linguistic
    • Research
  • 2022.06 - 2022.09
    Data & Applied Scientist Intern
    Microsoft
    Worked in Bing Ads Team on multi-modality model pruning. Extended Transformer models' structured pruning methods into CLIP (Contrastive Language-Image Pre-training) for the first time. Reduced the model size significantly (-40%) with a slight accuracy drop (-1%).
    • Multi-modality
    • Model pruning
    • Transformers
    • 40% model size reduction
  • 2020.07 - 2021.08
    Machine Learning Engineer
    ByteDance
    Worked on natural language processing (NLP) at Language and Information Technology Group (LITG), AI Lab, supervised by Dr. Hang Li. Constructed a fake news detection pipeline from scratch, and built bot-written articles detection models.
    • Natural language processing
    • Fake news detection
    • Bot-written articles detection
    • AI Lab

Publications

  • 2025.04
    ColBERT-Serve: Efficient Multi-stage Memory-Mapped Scoring
    Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025
    We study serving retrieval models, particularly late interaction retrievers like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT's query latency and supporting many concurrent queries in parallel.
  • 2024.07
    Overview of the Ninth Dialog System Technology Challenge: DSTC9
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
    Overview paper for the Ninth Dialog System Technology Challenge (DSTC9), covering multi-domain task-oriented dialogue systems and cross-lingual dialogue state tracking.
  • 2021.02
    Multi-domain Task-oriented Dialog Challenge II at DSTC9
    AAAI-2021 Dialog System Technology Challenge 9 Workshop
    The paper provides an overview of the 'Multi-Domain Task Completion Dialog Challenge II' track at the 9th Dialog System Technology Challenge (DSTC9). Two tasks are introduced in this track: end-to-end multi-domain task completion and cross-lingual dialog state tracking.
  • 2020.10
    A Large-Scale Chinese Short-Text Conversation Dataset
    Natural Language Processing and Chinese Computing (NLPCC)
    We present a large-scale cleaned Chinese conversation dataset LCCC, which contains a base version (6.8 million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs.
  • 2020.07
    KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)
    We propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs. Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0.
  • 2020.06
    CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset
    Transactions of the Association for Computational Linguistics (TACL)
    To advance multi-domain (cross-domain) dialogue modeling as well as alleviate the shortage of Chinese task-oriented datasets, we propose CrossWOZ, the first large-scale Chinese Cross-Domain Wizard-of-Oz task-oriented dataset. It contains 6K dialogue sessions and 102K utterances for 5 domains, including hotel, restaurant, attraction, metro, and taxi.

Awards

Skills

Computer Science
Machine Learning
Natural Language Processing
Information Retrieval
Deep Learning
Computer Vision
Multi-modal Models
Large Language Models

Languages

English
Fluent
Chinese
Native speaker