Publications | Genta Indra Winata

Detailed publications can be found in my Google Scholar profile.

2026

CVPR

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, and Genta Indra Winata

arXiv preprint arXiv:2512.05959, 2026

arXiv
ICLR

mR3: Multilingual Rubric-Agnostic Reward Reasoning Models

David Anugraha, Shou-Yi Hung, Zilu Tang, Annie En-Shiun Lee, Derry Tanti Wijaya, and 1 more author

arXiv preprint arXiv:2510.01146, 2026

arXiv
Nature

A benchmark of expert-level academic questions to assess AI capabilities

Scale AI. & HLE Contributors Consortium AI Safety.

Nature, 2026

PDF
arXiv

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, Rafael Mosquera-Gómez, Sara Hincapie-Monsalve, and 6 more authors

arXiv preprint arXiv:2601.18026, 2026

arXiv
arXiv

PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

Mohammad Rifqi Farhansyah, Hanif Muhammad Zhafran, Farid Adilazuarda, Shamsuddeen Hassan Muhammad, Maryam Ibrahim Mukhtar, and 4 more authors

arXiv preprint arXiv:2601.17277, 2026

arXiv
arXiv

Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection

Tianyi Niu, Justin Chih-Yao Chen, Genta Indra Winata, Shi-Xiong Zhang, Supriyo Chakraborty, and 4 more authors

arXiv preprint arXiv:2601.09692, 2026

arXiv
arXiv

Can Large Language Models Understand, Reason About, and Generate Code-Switched Text?

Genta Indra Winata, David Anugraha, Patrick Amadeus Irawan, Anirban Das, Haneul Yoo, and 6 more authors

arXiv preprint arXiv:2601.07153, 2026

arXiv

2025

AACL-IJCNLP

Indopref: A multi-domain pairwise preference dataset for indonesian

Vanessa Rebecca Wiyono, David Anugraha, Ayu Purwarianti, and Genta Indra Winata

In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025
arXiv

Vision Language Models are Confused Tourists

Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, and 1 more author

arXiv preprint arXiv:2511.17004, 2025

arXiv
arXiv

Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs

Stefan Horoi, Sangwoo Cho, Supriyo Chakraborty, Shi-Xiong Zhang, Sambit Sahu, and 2 more authors

arXiv preprint arXiv:2511.10850, 2025

arXiv
arXiv

Optimizing Reasoning Efficiency through Prompt Difficulty Prediction

Bo Zhao, Berkcan Kapusuzoglu, Kartik Balasubramaniam, Sambit Sahu, Supriyo Chakraborty, and 1 more author

arXiv preprint arXiv:2511.03808, 2025

arXiv
WMT

Smol: Professionally translated parallel data for 115 under-represented languages

Isaac Caswell, Elizabeth Nielsen, Jiaming Luo, Colin Cherry, Geza Kovacs, and 6 more authors

In Proceedings of the Tenth Conference on Machine Translation, 2025
MRL

ENTROPY2VEC: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady, Alham Fikri Aji, and 3 more authors

In Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), 2025
arXiv

SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi, Fajri Koto, and 4 more authors

arXiv preprint arXiv:2508.07069, 2025

arXiv
ACL

Crowdsource, crawl, or generate? creating sea-vl, a multicultural vision-language dataset for southeast asia

Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, and 6 more authors

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025
MRL

Language Surgery in Multilingual Large Language Models

Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, and 4 more authors

arXiv preprint arXiv:2506.12450, 2025

arXiv
arXiv

Datasheets Aren’t Enough: DataRubrics for Automated Quality Metrics and Accountability

Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, and 6 more authors

arXiv preprint arXiv:2506.01789, 2025

arXiv
NeurIPS

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, and 4 more authors

arXiv preprint arXiv:2505.16986, 2025

arXiv
arXiv

R3: Robust rubric-agnostic reward models

David Anugraha, Zilu Tang, Lester James V Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, and 3 more authors

arXiv preprint arXiv:2505.13388, 2025

arXiv
arXiv

Behind Maya: Building a Multilingual Vision Language Model

Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, and 6 more authors

arXiv preprint arXiv:2505.08910, 2025

arXiv
arXiv

Crosslingual reasoning through test-time scaling

Zheng-Xin Yong, M Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, and 5 more authors

arXiv preprint arXiv:2505.05408, 2025

arXiv
JAIR

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Genta Indra Winata, Hanyang Zhao, Anirban Das, Wenpin Tang, David D Yao, and 2 more authors

Journal of Artificial Intelligence Research, 2025
MRL

What Causes Knowledge Loss in Multilingual Language Models?

Maria Khelli, Samuel Cahyawijaya, Ayu Purwarianti, and Genta Indra Winata

arXiv preprint arXiv:2504.20356, 2025

arXiv
NAACL Findings

Proxylm: Predicting language model performance on multilingual tasks via proxy models

David Anugraha, Genta Indra Winata, Chenyue Li, Patrick Amadeus Irawan, and En-Shiun Annie Lee

In Findings of the Association for Computational Linguistics: NAACL 2025, 2025
arXiv

Fine-tuning diffusion generative models via rich preference optimization

Hanyang Zhao, Haoxian Chen, Yucheng Guo, Genta Indra Winata, Tingting Ou, and 3 more authors

arXiv preprint arXiv:2503.11720, 2025

arXiv
ACL

Do Language Models Understand Honorific Systems in Javanese?

Mohammad Rifqi Farhansyah, Iwan Darmawan, Adryan Kusumawardhana, Genta Indra Winata, Alham Fikri Aji, and 1 more author

arXiv preprint arXiv:2502.20864, 2025

arXiv
arXiv

Textgames: Learning to self-play text-based puzzle games via language model reasoning

Frederikus Hudi, Genta Indra Winata, Ruochen Zhang, and Alham Fikri Aji

arXiv preprint arXiv:2502.18431, 2025

arXiv
ICLR

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, and 6 more authors

In The Thirteenth International Conference on Learning Representations, 2025
arXiv

Humanity’s last exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, and 6 more authors

arXiv preprint arXiv:2501.14249, 2025

arXiv
COLING

Towards efficient and robust vqa-nle data generation with large vision-language models

Patrick Amadeus Irawan, Genta Indra Winata, Samuel Cahyawijaya, and Ayu Purwarianti

In Proceedings of the 31st International Conference on Computational Linguistics, 2025

2024

arXiv

Maya: An Instruction Finetuned Multilingual Multimodal Model

Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, and 6 more authors

arXiv preprint arXiv:2412.07112, 2024

arXiv
arXiv

A Multi-Agent Dual Dialogue System to Support Mental Health Care Providers

Onno P Kampman, Ye Sheng Phang, Stanley Han, Michael Xing, Xinyi Hong, and 6 more authors

arXiv preprint arXiv:2411.18429, 2024

arXiv
arXiv

An AI-Assisted Multi-Agent Dual Dialogue System to Support Mental Health Care Providers

Onno P Kampman, Ye Sheng Phang, Stanley Han, Michael Xing, Xinyi Hong, and 6 more authors

arXiv preprint arXiv:2411.18429, 2024

arXiv
EMNLP

Academics Can Contribute to Domain-Specialized Language Models

Mark Dredze, Genta Indra Winata, Prabhanjan Kambadur, Shijie Wu, Ozan İrsoy, and 4 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
EMNLP

Re-Evaluating Evaluation for Multilingual Summarization

Jessica Forde, Ruochen Zhang, Lintang Sutawika, Alham Aji, Samuel Cahyawijaya, and 5 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
WMT

MetaMetrics-MT: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration

David Anugraha, Garry Kuwanto, Lucky Susanto, Derry Tanti Wijaya, and Genta Winata

In Proceedings of the Ninth Conference on Machine Translation, Nov 2024

arXiv
arXiv

Linguistics Theory Meets LLM: Code-Switched Text Generation via Equivalence Constrained Large Language Models

Garry Kuwanto, Chaitanya Agarwal, Genta Indra Winata, and Derry Tanti Wijaya

arXiv preprint arXiv:2410.22660, Nov 2024
arXiv

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, and 6 more authors

arXiv preprint arXiv:2410.12705, Nov 2024
arXiv

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization

Hanyang Zhao, Genta Indra Winata, Anirban Das, Shi-Xiong Zhang, David D Yao, and 2 more authors

arXiv preprint arXiv:2410.04203, Nov 2024
arXiv

MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences

Genta Indra Winata, David Anugraha, Lucky Susanto, Garry Kuwanto, and Derry Tanti Wijaya

arXiv preprint arXiv:2410.02381, Nov 2024

arXiv
COLING

Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models

Patrick Amadeus Irawan, Genta Indra Winata, Samuel Cahyawijaya, and Ayu Purwarianti

arXiv preprint arXiv:2409.14785, Nov 2024

arXiv
arXiv

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Genta Indra Winata, Hanyang Zhao, Anirban Das, Wenpin Tang, David D Yao, and 2 more authors

arXiv preprint arXiv:2409.11564, Nov 2024

arXiv
EMNLP

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V Miranda, Jennifer Santoso, and 6 more authors

arXiv preprint arXiv:2406.10118, Nov 2024

arXiv
arXiv

ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models

David Anugraha, Genta Indra Winata, Chenyue Li, Patrick Amadeus Irawan, and En-Shiun Annie Lee

arXiv preprint arXiv:2406.09334, Nov 2024

arXiv
EMNLP Findings

MINERS: Multilingual Language Models as Semantic Retrievers

Genta Indra Winata, Ruochen Zhang, and David Ifeoluwa Adelani

arXiv preprint arXiv:2406.07424, Nov 2024

arXiv
arXiv

Lessons from the Trenches on Reproducible Evaluation of Language Models

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, and 6 more authors

arXiv preprint arXiv:2405.14782, Nov 2024

arXiv
ACL Findings

SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages

Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, and 6 more authors

arXiv preprint arXiv:2402.08638, Nov 2024

arXiv
ACL

Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, and 6 more authors

arXiv preprint arXiv:2404.06138, Nov 2024

arXiv
EMNLP Findings

LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization

Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Alham Fikri Aji, Genta Indra Winata, and Ayu Purwarianti

arXiv preprint arXiv:2401.06034, Nov 2024

arXiv

2023

arXiv

Bloom: A 176b-parameter open-access multilingual language model

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, and 6 more authors

arXiv preprint arXiv:2211.05100, Nov 2023

arXiv
SEALP

IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

Muhammad Kautsar, Rahmah Nurdini, Samuel Cahyawijaya, Genta Winata, and Ayu Purwarianti

In Proceedings of the First Workshop in South East Asian Language Processing, Nov 2023

Awarded

Best Paper
AACL

Efficient Zero-Shot Cross-lingual Inference via Retrieval

Genta Winata, Lingjue Xie, Karthik Radhakrishnan, Yifan Gao, and Daniel Preoţiuc-Pietro

In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), Nov 2023
AACL

NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, and 6 more authors

In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Nov 2023

Awarded

Best Resource Paper
Machine Learning

Transfer learning application of self-supervised learning in ARPES

Sandy Adhitia Ekahana, Genta Indra Winata, Anna Tamai, Radovic Milan, Gabriel Aeppli, and 1 more author

Machine Learning: Science and Technology, Nov 2023
arXiv

Multilingual Few-Shot Learning via Language Model Retrieval

Genta Indra Winata, Liang-Kang Huang, Soumya Vadlamannati, and Yash Chandarana

arXiv preprint arXiv:2306.10964, Nov 2023
EMNLP

GlobalBench: A Benchmark for Global Progress in Natural Language Processing

Yueqi Song, Simran Khanuja, Pengfei Liu, Fahim Faisal, Alissa Ostapenko, and 6 more authors

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Nov 2023
EMNLP

Multilingual Large Language Models Are Not (Yet) Code-Switchers

Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Winata, and Alham Aji

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Nov 2023
CALCS

Prompting multilingual large language models to generate code-mixed texts: The case of south east asian languages

Zheng Xin Yong, Ruochen Zhang, Jessica Forde, Skyler Wang, Arjun Subramonian, and 6 more authors

In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, Nov 2023
ACL Findings

NusaCrowd: Open source initiative for Indonesian NLP resources

Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, and 6 more authors

In Findings of the Association for Computational Linguistics: ACL 2023, Nov 2023
ACL Findings

Multi-lingual and Multi-cultural Figurative Language Understanding

Anubha Kabra, Emmy Liu, Simran Khanuja, Alham Fikri Aji, Genta Winata, and 4 more authors

In Findings of the Association for Computational Linguistics: ACL 2023, Nov 2023
ACL Findings

Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

Genta Winata, Lingjue Xie, Karthik Radhakrishnan, Shijie Wu, Xisen Jin, and 3 more authors

In Findings of the Association for Computational Linguistics: ACL 2023, Nov 2023
ACL

On “Scientific Debt” in NLP: A Case for More Rigour in Language Model Pre-Training Research

Made Nindyatama Nityasya, Haryo Wibowo, Alham Fikri Aji, Genta Winata, Radityo Eko Prasojo, and 2 more authors

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Nov 2023
ICAICTA

Implementing Quantization to Indonesian BERT Language Model

Muhammad Ayyub Abdurrahman, Samuel Cahyawijaya, Genta Indra Winata, and Ayu Purwarianti

In 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), Nov 2023
EACL

Towards a Unified Multi-Domain Multilingual Named Entity Recognition Model

Mayank Kulkarni, Daniel Preoţiuc-Pietro, Karthik Radhakrishnan, Genta Indra Winata, Shijie Wu, and 2 more authors

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Nov 2023
EACL

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Genta Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, and 5 more authors

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Nov 2023

Awarded arXiv

Outstanding Paper Award
ACL Findings

The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges

Genta Winata, Alham Fikri Aji, Zheng Xin Yong, and Thamar Solorio

In Findings of the Association for Computational Linguistics: ACL 2023, Nov 2023

2022

SumEval

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata, Pascale Fung, and Ayu Purwarianti

SumEval 2022, Nov 2022
ACL

BLOOM+ 1: Adding Language Support to BLOOM for Zero-Shot Prompting

Zheng-Xin Yong, Hailey Schoelkopf, Niklas Muennighoff, Alham Fikri Aji, David Ifeoluwa Adelani, and 6 more authors

arXiv preprint arXiv:2212.09535, Nov 2022

arXiv
AACL

Cross-lingual Few-Shot Learning on Unseen Languages

Genta Winata, Shijie Wu, Mayank Kulkarni, Thamar Solorio, and Daniel Preoţiuc-Pietro

In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, Nov 2022
SumEval

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata, Pascale Fung, and Ayu Purwarianti

In Proceedings of the First Workshop on Scaling Up Multilingual Evaluation, Nov 2022
arXiv

Transfer Learning Application of Self-supervised Learning in ARPES

Sandy Adhitia Ekahana, Genta Indra Winata, Gabriel Aeppli, Radovic Milan, and Ming Shi

arXiv preprint arXiv:2208.10893, Nov 2022

arXiv
arXiv

NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

Samuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia, Genta Indra Winata, Bryan Wilie, and 6 more authors

arXiv preprint arXiv:2207.10524, Nov 2022

arXiv
EMNLP Demo

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, and 6 more authors

arXiv preprint arXiv:2206.11249, Nov 2022

arXiv
Accepted at TMLR

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, and 438 more authors

Nov 2022

arXiv
DialDoc

Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters

Yan Xu, Etsuko Ishii, Zihan Liu, Genta Indra Winata, Dan Su, and 2 more authors

Accepted at DialDoc, Nov 2022

Awarded arXiv

Best Student Paper at DialDoc 2022
ACL

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, and 6 more authors

Accepted at ACL, Nov 2022

arXiv
LREC

CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

Wenliang Dai, Samuel Cahyawijaya, Tiezheng Yu, Elham J Barezi, Peng Xu, and 6 more authors

Accepted at LREC, Nov 2022

arXiv
LREC

ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation

Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Peng Xu, Xu Yan, and 6 more authors

Accepted at LREC, Nov 2022

arXiv

2021

arXiv

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Kaustubh D Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, and 6 more authors

arXiv preprint arXiv:2112.02721, Nov 2021

arXiv
arXiv

Few-Shot Bot: Prompt-Based Learning for Dialogue Systems

Andrea Madotto, Zhaojiang Lin, Genta Indra Winata, and Pascale Fung

arXiv preprint arXiv:2110.08118, Nov 2021

arXiv
ICAICTA

A Comparative Study on Language Models for Task-Oriented Dialogue Systems

Vinsen Marselino Andreas, Genta Indra Winata, and Ayu Purwarianti

In 2021 8th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), Nov 2021

arXiv PDF
MRL

Language Models are Few-shot Multilingual Learners

Genta Indra Winata, Andrea Madotto, Zhaojiang Lin, Rosanne Liu, Jason Yosinski, and 1 more author

arXiv preprint arXiv:2109.07684, Nov 2021

arXiv PDF Video
arXiv

Greenformer: Factorization toolkit for efficient deep neural networks

Samuel Cahyawijaya, Genta Indra Winata, Holy Lovenia, Bryan Wilie, Wenliang Dai, and 2 more authors

arXiv preprint arXiv:2109.06762, Nov 2021
RepL4NLP

Preserving Cross-Linguality of Pre-trained Models via Continual Learning

Zihan Liu, Genta Indra Winata, Andrea Madotto, and Pascale Fung

In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), Aug 2021

Abs PDF

Recently, fine-tuning pre-trained language models (e.g., multilingual BERT) to downstream cross-lingual tasks has shown promising results. However, the fine-tuning process inevitably changes the parameters of the pre-trained model and weakens its cross-lingual ability, which leads to sub-optimal performance. To alleviate this problem, we leverage continual learning to preserve the original cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks. The experimental result shows that our fine-tuning methods can better preserve the cross-lingual ability of the pre-trained model in a sentence retrieval task. Our methods also achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
DialDoc21

CAiRE in DialDoc21: Data Augmentation for Information-Seeking Dialogue System

Etsuko Ishii, Yan Xu, Genta Indra Winata, Zhaojiang Lin, Andrea Madotto, and 3 more authors

DialDoc21, Aug 2021

Awarded arXiv

Third Place in the Shared Task
NeurIPS

BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue Modeling

Zhaojiang Lin, Andrea Madotto, Genta Indra Winata, Peng Xu, Feijun Jiang, and 3 more authors

In arXiv preprint arXiv:2106.02787, Aug 2021

arXiv PDF Code
Interspeech

Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition

Genta Indra Winata, Guangsen Wang, Caiming Xiong, and Steven Hoi

INTERSPEECH, Aug 2021

arXiv
SIGDIAL

ERICA: An Empathetic Android Companion for Covid-19 Quarantine

Etsuko Ishii, Genta Indra Winata, Samuel Cahyawijaya, Divesh Lala, Tatsuya Kawahara, and 1 more author

SIGDIAL, Aug 2021

arXiv
RepL4NLP

Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models via Continual Learning

Zihan Liu, Genta Indra Winata, Andrea Madotto, and Pascale Fung

RepL4NLP, Aug 2021

arXiv
RepL4NLP

X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented Compositional Semantic Parsing

Zihan Liu, Genta Indra Winata, Peng Xu, and Pascale Fung

RepL4NLP, Aug 2021

arXiv PDF Code
ACL-IJCNLP Findings

Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation

Zihan Liu, Genta Indra Winata, and Pascale Fung

ACL-IJCNLP Findings, Aug 2021

arXiv PDF Code
CALCS

Are Multilingual Models Effective in Code-Switching?

Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, and 1 more author

In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, Aug 2021

arXiv PDF Video
AAAI

On the Importance of Word Order Information in Cross-lingual Sequence Labeling

Zihan Liu, Genta Indra Winata, Samuel Cahyawijaya, Andrea Madotto, Zhaojiang Lin, and 1 more author

In Proceedings of the AAAI Conference on Artificial Intelligence, Aug 2021

arXiv PDF
EMNLP

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

Samuel Cahyawijaya, Genta Indra Winata, Bryan Wilie, Karissa Vincentio, Xiaohong Li, and 6 more authors

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Aug 2021

arXiv PDF
arXiv

Nora: The Well-Being Coach

Genta Indra Winata, Holy Lovenia, Etsuko Ishii, Farhad Bin Siddique, Yongsheng Yang, and 1 more author

arXiv preprint arXiv:2106.00410, Aug 2021

arXiv

2020

EMNLP

Cross-lingual Spoken Language Understanding with Regularized Representation Alignment

Zihan Liu, Genta Indra Winata, Peng Xu, Zhaojiang Lin, and Pascale Fung

In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Aug 2020

arXiv PDF Code
EMNLP

MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems

Zhaojiang Lin, Andrea Madotto, Genta Indra Winata, and Pascale Fung

In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Aug 2020

arXiv PDF Code
EMNLP-Findings

Learning Knowledge Bases with Parameters for Task-Oriented Dialogue Systems

Andrea Madotto, Samuel Cahyawijaya, Genta Indra Winata, Yan Xu, Zihan Liu, and 2 more authors

In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, Aug 2020

arXiv PDF Code
AACL-IJCNLP

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, and 6 more authors

In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Aug 2020

arXiv PDF Code
arXiv

Emograph: Capturing emotion correlations using graph networks

Peng Xu, Zihan Liu, Genta Indra Winata, Zhaojiang Lin, and Pascale Fung

arXiv preprint arXiv:2008.09378, Aug 2020

arXiv
ACL

Meta-Transfer Learning for Code-Switched Speech Recognition

Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Zihan Liu, Peng Xu, and 1 more author

In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Aug 2020

arXiv PDF Code
ACL

Coach: A Coarse-to-Fine Approach for Cross-domain Slot Filling

Zihan Liu, Genta Indra Winata, Peng Xu, and Pascale Fung

In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Aug 2020

arXiv PDF Code
arXiv

Variational Transformers for Diverse Response Generation

Zhaojiang Lin, Genta Indra Winata, Peng Xu, Zihan Liu, and Pascale Fung

arXiv preprint arXiv:2003.12738, Aug 2020

arXiv Code
ConvAI

XPersona: Evaluating Multilingual Personalized Chatbot

Zhaojiang Lin, Zihan Liu, Genta Indra Winata, Samuel Cahyawijaya, Andrea Madotto, and 3 more authors

arXiv preprint arXiv:2003.07568, Aug 2020

Awarded arXiv Video Code

Honorable Mention. Nominated as Best Paper
Interspeech

Learning Fast Adaptation on Cross-Accented Speech Recognition

Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, and 2 more authors

Proc. Interspeech, Aug 2020

arXiv PDF Code
RepL4NLP

Zero-Resource Cross-Domain Named Entity Recognition

Zihan Liu, Genta Indra Winata, and Pascale Fung

In Proceedings of the 5th Workshop on Representation Learning for NLP, Aug 2020

arXiv PDF
AAAI

Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems

Zihan Liu, Genta Indra Winata, Zhaojiang Lin, Peng Xu, and Pascale Fung

In Proceedings of the AAAI Conference on Artificial Intelligence, Aug 2020

arXiv PDF Code
ICASSP

Lightweight and Efficient End-to-End Speech Recognition Using Low-Rank Transformer

Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Zihan Liu, and Pascale Fung

In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Aug 2020

arXiv PDF
AAAI

CAiRE: An End-to-End Empathetic Chatbot

Zhaojiang Lin, Peng Xu, Genta Indra Winata, Farhad Bin Siddique, Zihan Liu, and 2 more authors

In Proceedings of the AAAI Conference on Artificial Intelligence, Aug 2020

arXiv PDF

2019

EMNLP-IJCNLP

Zero-shot Cross-lingual Dialogue Systems with Transferable Latent Variables

Zihan Liu, Jamin Shin, Yan Xu, Genta Indra Winata, Peng Xu, and 2 more authors

In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Aug 2019

arXiv PDF Code
EMNLP-IJCNLP

Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition

Genta Indra Winata, Zhaojiang Lin, Jamin Shin, Zihan Liu, and Pascale Fung

In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Aug 2019

arXiv PDF Code
MRQA

Generalizing Question Answering System with Pre-trained Language Model Fine-tuning

Dan Su, Yan Xu, Genta Indra Winata, Peng Xu, Hyeondey Kim, and 2 more authors

In EMNLP 2019 MRQA Workshop, Aug 2019

PDF
CoNNL

Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences

Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung

In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Aug 2019

arXiv PDF
PACLIC

On the Effectiveness of Low-Rank Matrix Factorization for LSTM Model Compression

Genta Indra Winata, Andrea Madotto, Jamin Shin, Elham J Barezi, and Pascale Fung

In Proceedings of the 33rd Pacific Asia Conference on Language, Information and Computation, Aug 2019

arXiv PDF
WMT

Incorporating Word and Subword Units in Unsupervised Machine Translation Using Language Model Rescoring

Zihan Liu, Yan Xu, Genta Indra Winata, and Pascale Fung

In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Aug 2019

arXiv PDF
RepL4NLP

Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition

Genta Indra Winata, Zhaojiang Lin, and Pascale Fung

In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Aug 2019

Awarded PDF Code

Best Paper Award
SemEval

CAiRE_HKUST at SemEval-2019 Task 3: Hierarchical Attention for Dialogue Emotion Classification

Genta Indra Winata, Andrea Madotto, Zhaojiang Lin, Jamin Shin, Yan Xu, and 2 more authors

In Proceedings of the 13th International Workshop on Semantic Evaluation, Aug 2019

Awarded arXiv PDF Code

Fourth Place in the Shared Task (out of 160 submission)
ICASSP

Learning comment generation by leveraging user-generated data

Zhaojiang Lin, Genta Indra Winata, and Pascale Fung

In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Aug 2019

arXiv PDF
FinNLP

Learning to learn sales prediction with social media sentiment

Zhaojiang Lin, Andrea Madotto, Genta Indra Winata, Zihan Liu, Yan Xu, and 2 more authors

In Proceedings of the First Workshop on Financial Technology and Natural Language Processing, Aug 2019

PDF

2018

arXiv

Towards end-to-end automatic code-switching speech recognition

Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung

arXiv preprint arXiv:1810.12620, Aug 2018

arXiv
arXiv

Learn to code-switch: Data augmentation using copy mechanism on language modeling

Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung

arXiv preprint arXiv:1810.10254, Aug 2018

arXiv
CALCS

Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning

Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung

In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Aug 2018

arXiv PDF Code
ICASSP

End-to-End Dynamic Query Memory Network for Entity-Value Independent Task-oriented Dialog

Chien-Sheng Wu, Andrea Madotto, Genta Indra Winata, and Pascale Fung

In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Aug 2018

PDF
ICASSP

Attention-Based LSTM for Psychological Stress Detection from Spoken Language Using Distant Supervision

Genta Indra Winata, Onno Pepijn Kampman, and Pascale Fung

In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Aug 2018

arXiv PDF Code
CALCS

Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition

Genta Indra Winata, Chien-Sheng Wu, Andrea Madotto, and Pascale Fung

In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Aug 2018

Awarded arXiv PDF

Second Place in English-Spanish Shared Task

2017

DSTC6

End-to-end recurrent entity network for entity-value independent goal-oriented dialog learning

Chien-Sheng Wu, Andrea Madotto, Genta Winata, and Pascale Fung

In Wu, Chien-Sheng, et al. "End-to-end recurrent entity network for entity-value independent goal-oriented dialog learning." Dialog System Technology Challenges Workshop, DSTC6, Aug 2017

PDF
Interspeech

Nora the Empathetic Psychologist

Genta Indra Winata, Onno Kampman, Yang Yang, Anik Dey, and Pascale Fung

Proc. Interspeech 2017, Aug 2017

PDF

2015

ICEEI

Handling imbalanced dataset in multi-label text categorization using Bagging and Adaptive Boosting

Genta Indra Winata, and Masayu Leylia Khodra

In 2015 International Conference on Electrical Engineering and Informatics (ICEEI), Aug 2015

Awarded arXiv PDF

Best Student Paper