Are there pre-trained AI models for the Kashmiri language?

Yes. 13+ models support Kashmiri, including IndicTrans2 (AI4Bharat, state-of-the-art MT), NLLB-200 (Meta AI, 200 languages), Kashmiri BERT (kashmiri-llm-bert-base by Omarrran), IndicConformer ASR, mBART-50, IndicBARTSS, BLOOM-560M, and XLS-R Wav2Vec2 for speech. Most are available on Hugging Face.

Where can I find Kashmiri NLP research papers?

KashmirAI Research curates 14+ peer-reviewed papers on Kashmiri NLP from venues including Scientific Reports (Nature), IEEE, ACL, arXiv, INDIACom, and university publications. Topics cover neural machine translation, text classification, OCR, POS tagging, and morphological analysis.

Is Kashmiri a low-resource language for NLP?

Yes. Kashmiri (ISO 639-1: ks) is spoken by 7+ million people but has very limited digital resources. It uses dual scripts (Perso-Arabic Nastaliq and Devanagari), complex morphology, and code-switching with Urdu/Hindi — all of which make it extremely challenging for general-purpose AI systems. Google Translate does not support Kashmiri.

How can I contribute to Kashmiri NLP research?

You can contribute by evaluating translations on the KashmirAI platform as a native speaker, submitting datasets or papers to our resource hub, or building tools using the open-source resources listed here. Contact us at kashmirairesearch.online/about.

Kashmiri NLP Resource Hub — Datasets, Papers, Models & Tools | KashmirAI Research

Q: What datasets are available for Kashmiri NLP?

There are 18+ open datasets available for Kashmiri NLP, including the Kashmiri-English 270K parallel corpus (SMUQamar), KS-LIT-3M pretraining dataset (3.1M words), 600K-KS-OCR dataset, Kashmiri Audio Corpus, FLORES-200 benchmark, OpenSLR SLR122, and Wikipedia dumps. These cover machine translation, language modeling, OCR, and speech recognition.

About the Kashmiri NLP Resource Hub

The Kashmiri NLP Resource Hub by KashmirAI Research is the most comprehensive, open-source collection of Kashmiri language AI resources on the internet. We aggregate every publicly available dataset, research paper, pre-trained model, and NLP tool for the Kashmiri language (کٲشُر / कॉशुर, ISO 639-1: ks) — a language spoken by over 7 million people across Kashmir, India, and the global diaspora.

Kashmiri is classified as a low-resource language in NLP, meaning it has very limited training data compared to high-resource languages like English, Chinese, or Hindi. Google Translate does not support Kashmiri. Major AI systems like ChatGPT and Gemini have minimal Kashmiri capability. This makes dedicated research infrastructure — like this hub — essential for advancing Kashmiri language technology.

Kashmiri Language Datasets for Machine Learning

We catalog 18 open datasets for Kashmiri NLP, including parallel corpora (Kashmiri-English sentence pairs for machine translation), monolingual text corpora for language model pretraining, OCR image datasets for Nastaliq script recognition, audio corpora for automatic speech recognition (ASR), and standardized evaluation benchmarks like FLORES-200 and IN22. Key resources include the Kashmiri-English Dataset 270K (the largest open parallel corpus), the KS-LIT-3M pretraining corpus (3.1 million words), and the 600K-KS-OCR dataset with 602,000 synthetic Nastaliq images.

Pre-Trained Models for Kashmiri

We list 13 pre-trained models that support Kashmiri, from large multilingual systems to Kashmiri-specific fine-tuned models. These include IndicTrans2 by AI4Bharat (state-of-the-art for Indian language MT), Meta AI's NLLB-200 (200-language translation), the first dedicated Kashmiri BERT model, IndicConformer for speech recognition, and transformer models like mBART-50, IndicBARTSS, and BLOOM-560M. Most are available for free on Hugging Face.

Academic Research Papers on Kashmiri NLP

Our hub curates 14 peer-reviewed papers on Kashmiri NLP from top venues including Scientific Reports (Nature), IEEE, ACL, arXiv, and INDIACom. Research topics span neural machine translation, news classification, OCR for Perso-Arabic script, part-of-speech tagging, morphological analysis, and cross-lingual transfer learning. Each paper listing includes DOI links, author information, and publication venue for easy citation.

Frequently Asked Questions

What datasets are available for Kashmiri NLP? ▼

There are 18+ open datasets available, including the Kashmiri-English 270K parallel corpus, KS-LIT-3M pretraining dataset (3.1M words), 600K-KS-OCR dataset, Kashmiri Audio Corpus, FLORES-200 and IN22 benchmarks, OpenSLR SLR122, and Wikipedia dumps. These cover machine translation, language modeling, OCR, and speech recognition.

Are there pre-trained AI models for Kashmiri? ▼

Yes — 13+ models support Kashmiri, including IndicTrans2 (AI4Bharat), NLLB-200 (Meta), Kashmiri BERT, IndicConformer ASR, mBART-50, and XLS-R Wav2Vec2. Most are free on Hugging Face.

Is Kashmiri supported by Google Translate? ▼

No. As of 2026, Google Translate does not support Kashmiri (کٲشُر). This is one reason why dedicated research platforms like KashmirAI Research exist — to advance Kashmiri language technology where commercial AI has not yet reached.

How can I contribute to Kashmiri NLP? ▼

You can evaluate translations as a native speaker on our platform, submit resources to this hub, or use the open-source datasets and models listed here to train your own Kashmiri AI systems. Visit the About page to get in touch.

Who maintains this resource hub? ▼

This hub is maintained by Faizan Ayoub and the KashmirAI Research team. It is an open-source, community-driven initiative to accelerate AI for the Kashmiri language.

This page is updated regularly. Last updated: April 2026. Resources are verified manually. Read our blog post about this resource hub →

Kashmiri NLP Resource Hub

Kashmiri-English Dataset 270K

Kashmiri-English Parallel Corpus (30K)

KS-LIT-3M — 3.1 Million Word Pretraining Dataset

KashmirAI Parallel Corpus

Kashmiri Text Corpus Cleaned (2025)

600K-KS-OCR Dataset

Kashmiri Data Corpus (OpenSLR-122)

Kashmiri Audio Corpus (Segmented)

Kashmiri Spoken Words Dataset

Kashmiri Sample Text Recognition (OCR)

Kashmiri Multilingual Dictionary

kscp — Kashmiri Speech Corpus Processing

Kashmiri Text Dataset Collection

Kaeshir Database

IndicNLP Catalog (incl. NLLB-Seed)

FLORES-200 Benchmark

IN22 Benchmark (IndicTrans2)

Bhasini / ULCA Portal

Know a Resource We're Missing?

Explore More

About the Kashmiri NLP Resource Hub

Kashmiri Language Datasets for Machine Learning

Pre-Trained Models for Kashmiri

Academic Research Papers on Kashmiri NLP

Frequently Asked Questions