If you're a researcher, developer, or linguist looking to work on Kashmiri language AI, you've probably experienced the frustration of scattered resources — datasets buried across HuggingFace profiles, papers spread across arXiv and IEEE, and models with no clear documentation. This guide is our answer to that problem: a single, comprehensive, regularly-updated reference to every open-source Kashmiri NLP resource we've been able to verify.
🎯 What This Guide Covers
- 18 Datasets — Parallel corpora, monolingual text, OCR, audio, and benchmarks
- 14 Research Papers — Peer-reviewed publications from Nature, IEEE, ACL, and more
- 13 Pre-Trained Models — From IndicTrans2 to Kashmiri BERT and ASR models
- 6 Tools & Platforms — Translation APIs, evaluation toolkits, and language assistants
🌍 Why Kashmiri Needs Dedicated NLP Resources
Kashmiri (کٲشُر / कॉशुर, ISO 639-1: ks) is an Indo-Aryan language spoken by over 7 million people primarily in the Kashmir Valley of India. Despite this sizeable speaker population, Kashmiri remains one of the most underrepresented languages in modern AI:
- Google Translate does not support Kashmiri (as of 2026)
- ChatGPT and Gemini have minimal Kashmiri language capability
- Kashmiri uses dual writing systems — Perso-Arabic Nastaliq (RTL) and Devanagari (LTR)
- Extensive code-switching with Urdu and Hindi in everyday speech
- Complex morphology with pronominal clitics and verb agreement patterns
These challenges make it impossible for general-purpose NLP systems to handle Kashmiri well. Dedicated, curated resources are essential — and that's exactly what our Resource Hub provides.
📦 Kashmiri Datasets: The Complete List
Here are the 18 verified, publicly-available datasets for Kashmiri NLP as of April 2026:
| Resource | Source | Key Detail |
|---|---|---|
| Kashmiri-English Dataset 270K | HuggingFace | Largest parallel corpus — 270K pairs |
| KS-LIT-3M | HuggingFace | 3.1M word pretraining corpus |
| Kashmiri-English Parallel Corpus (30K) | HuggingFace | 30K+ pairs, multi-directory |
| KashmirAI Parallel Corpus | KashmirAI | Human-evaluated MQM quality scores |
| Kashmiri Text Corpus (Cleaned 2025) | HuggingFace | Deduplicated, preprocessed |
| Kashmiri Audio Corpus (Segmented) | HuggingFace | 1,955 ASR-ready segments |
| 600K-KS-OCR Dataset | HuggingFace | 602K synthetic Nastaliq images |
| Kashmiri Spoken Words Dataset | HuggingFace | 12 frequent words + audio |
| Kashmiri Sample Text Recognition | HuggingFace | 5K OCR samples |
| Kashmiri Data Corpus (SLR122) | OpenSLR | Transcribed audio recordings |
| Kashmiri Wikipedia Dump | HuggingFace | Full Kashmiri Wikipedia text |
| FLORES-200 Benchmark | GitHub/Meta | Standardized MT evaluation |
| IN22 Benchmark | GitHub/AI4Bharat | India-specific MT eval |
The Kashmiri-English 270K corpus by SMUQamar is the cornerstone for most neural MT research. For language model pretraining, the KS-LIT-3M dataset (3.1 million words / 16.4 million characters) is indispensable. For OCR, the 600K-KS-OCR dataset provides 602,000 synthetic images across three Nastaliq typefaces.
🤖 Pre-Trained Models Supporting Kashmiri
13 verified models support Kashmiri, ranging from massive multilingual systems to Kashmiri-specific fine-tuned models:
| Resource | Source | Key Detail |
|---|---|---|
| IndicTrans2 (AI4Bharat) | HuggingFace | State-of-the-art Indian language MT |
| NLLB-200 (Meta AI) | HuggingFace | 200-language translation, 1.3B & 3.3B |
| Kashmiri BERT | HuggingFace | First Kashmiri-specific BERT model |
| IndicConformer ASR | HuggingFace | First dedicated Kashmiri ASR model |
| mBART-large-50 | HuggingFace | Multilingual seq2seq, 50 languages |
| IndicBARTSS | HuggingFace | Indic-focused BART for generation |
| IndicBERTv2 | HuggingFace | Embeddings backbone for Indian langs |
| BLOOM-560M | HuggingFace | Zero-shot classification via cross-lingual transfer |
| XLS-R 300M (Wav2Vec2) | HuggingFace | Fine-tunable for Kashmiri speech |
| Llama3 8B Kashmiri | HuggingFace | Fine-tuned for Kashmiri QA/translation |
For machine translation, IndicTrans2 and NLLB-200 are the strongest starting points. For text classification and NLU, Kashmiri BERT and IndicBERTv2 provide powerful embeddings. For speech recognition, IndicConformer and XLS-R 300M are fine-tunable with relatively small amounts of labeled Kashmiri audio.
📄 Key Research Papers
14 peer-reviewed papers form the academic backbone of Kashmiri NLP. Here are the most impactful:
| Resource | Source | Key Detail |
|---|---|---|
| IndicTrans2 (Gala et al.) | ACL 2023 (Transactions) | State-of-the-art Indic MT covering Kashmiri |
| Deep Neural Architectures for Kashmiri-English MT | Scientific Reports (Nature), 2025 | Transformer + GRU + Attention comparisons |
| Kashmiri News Classification | Scientific Reports (Nature), 2025 | ParsBERT achieves F1=0.98 |
| NLLB-200 (Costa-jussà et al.) | arXiv 2022 | 200-language translation benchmark |
| 600K-KS-OCR Corpus Paper | arXiv 2026 | First large-scale Kashmiri OCR research |
| Low-Resource Indian MT with DPO | IEEE 2025 | DPO alignment for Indian MT |
| POS Tagging with CRF (80–94%) | Academic 2023 | First systematic Kashmiri POS tagging |
All papers are linked with DOIs and direct URLs in our full Resource Hub.
🛠️ Tools & Platforms
Beyond datasets and models, several tools enable practical Kashmiri NLP work:
- KashmiriGPT — AI chatbot for Kashmiri language with Roman + Perso-Arabic input
- KashmirAI Evaluation Platform — Human-in-the-loop MQM evaluation for translation quality
- Bhashini APIs — Official Indian government translation, ASR, and TTS APIs (free)
- Kashmiri WordNet — Lexical database with synsets for semantic NLP
- sacreBLEU — Standardized MT evaluation (BLEU, ChrF++, TER)
- Apertium — Open-source rule-based MT platform
🚀 How to Get Started with Kashmiri NLP
If you're new to Kashmiri language AI, here's a recommended learning path:
- Understand the Language — Read about Kashmiri's dual script system, morphological complexity, and sociolinguistic context on Wikipedia and in the research papers above.
- Start with IndicTrans2 — Use AI4Bharat's IndicTrans2 as your MT baseline — it has the best zero-shot Kashmiri performance of any open model.
- Fine-tune on Kashmiri Data — Download the 270K parallel corpus and fine-tune IndicTrans2 or mBART-50 using LoRA/QLoRA on Kaggle's free GPU.
- Evaluate Properly — Use sacreBLEU for automatic metrics and our KashmirAI platform for human evaluation with native speakers.
- Contribute Back — Open-source your models on HuggingFace and submit them to our Resource Hub to help the community.
🌐 Why This Hub Matters
Kashmiri is part of India's Eighth Schedule but remains digitally marginalized. Of the world's approximately 7,000 languages, fewer than 100 have meaningful AI support. Every new dataset, model, and tool brings Kashmiri closer to digital equality.
Our Resource Hub is open-source, free, and community-driven. We verify every resource manually and update the collection regularly. If you know of a resource we're missing, get in touch.
Explore the Full Resource Hub
All 50+ resources in a searchable, filterable interface — with direct links to HuggingFace, GitHub, and paper DOIs.