The Complete Guide to Kashmiri NLP Resources: Datasets, Models, Papers & Tools (2026)

If you're a researcher, developer, or linguist looking to work on Kashmiri language AI, you've probably experienced the frustration of scattered resources — datasets buried across HuggingFace profiles, papers spread across arXiv and IEEE, and models with no clear documentation. This guide is our answer to that problem: a single, comprehensive, regularly-updated reference to every open-source Kashmiri NLP resource we've been able to verify.

🎯 What This Guide Covers

18 Datasets — Parallel corpora, monolingual text, OCR, audio, and benchmarks
14 Research Papers — Peer-reviewed publications from Nature, IEEE, ACL, and more
13 Pre-Trained Models — From IndicTrans2 to Kashmiri BERT and ASR models
6 Tools & Platforms — Translation APIs, evaluation toolkits, and language assistants

🌍 Why Kashmiri Needs Dedicated NLP Resources

Kashmiri (کٲشُر / कॉशुर, ISO 639-1: ks) is an Indo-Aryan language spoken by over 7 million people primarily in the Kashmir Valley of India. Despite this sizeable speaker population, Kashmiri remains one of the most underrepresented languages in modern AI:

Google Translate does not support Kashmiri (as of 2026)
ChatGPT and Gemini have minimal Kashmiri language capability
Kashmiri uses dual writing systems — Perso-Arabic Nastaliq (RTL) and Devanagari (LTR)
Extensive code-switching with Urdu and Hindi in everyday speech
Complex morphology with pronominal clitics and verb agreement patterns

These challenges make it impossible for general-purpose NLP systems to handle Kashmiri well. Dedicated, curated resources are essential — and that's exactly what our Resource Hub provides.

📦 Kashmiri Datasets: The Complete List

Here are the 18 verified, publicly-available datasets for Kashmiri NLP as of April 2026:

Resource	Source	Key Detail
Kashmiri-English Dataset 270K	HuggingFace	Largest parallel corpus — 270K pairs
KS-LIT-3M	HuggingFace	3.1M word pretraining corpus
Kashmiri-English Parallel Corpus (30K)	HuggingFace	30K+ pairs, multi-directory
KashmirAI Parallel Corpus	KashmirAI	Human-evaluated MQM quality scores
Kashmiri Text Corpus (Cleaned 2025)	HuggingFace	Deduplicated, preprocessed
Kashmiri Audio Corpus (Segmented)	HuggingFace	1,955 ASR-ready segments
600K-KS-OCR Dataset	HuggingFace	602K synthetic Nastaliq images
Kashmiri Spoken Words Dataset	HuggingFace	12 frequent words + audio
Kashmiri Sample Text Recognition	HuggingFace	5K OCR samples
Kashmiri Data Corpus (SLR122)	OpenSLR	Transcribed audio recordings
Kashmiri Wikipedia Dump	HuggingFace	Full Kashmiri Wikipedia text
FLORES-200 Benchmark	GitHub/Meta	Standardized MT evaluation
IN22 Benchmark	GitHub/AI4Bharat	India-specific MT eval

The Kashmiri-English 270K corpus by SMUQamar is the cornerstone for most neural MT research. For language model pretraining, the KS-LIT-3M dataset (3.1 million words / 16.4 million characters) is indispensable. For OCR, the 600K-KS-OCR dataset provides 602,000 synthetic images across three Nastaliq typefaces.

🤖 Pre-Trained Models Supporting Kashmiri

13 verified models support Kashmiri, ranging from massive multilingual systems to Kashmiri-specific fine-tuned models:

Resource	Source	Key Detail
IndicTrans2 (AI4Bharat)	HuggingFace	State-of-the-art Indian language MT
NLLB-200 (Meta AI)	HuggingFace	200-language translation, 1.3B & 3.3B
Kashmiri BERT	HuggingFace	First Kashmiri-specific BERT model
IndicConformer ASR	HuggingFace	First dedicated Kashmiri ASR model
mBART-large-50	HuggingFace	Multilingual seq2seq, 50 languages
IndicBARTSS	HuggingFace	Indic-focused BART for generation
IndicBERTv2	HuggingFace	Embeddings backbone for Indian langs
BLOOM-560M	HuggingFace	Zero-shot classification via cross-lingual transfer
XLS-R 300M (Wav2Vec2)	HuggingFace	Fine-tunable for Kashmiri speech
Llama3 8B Kashmiri	HuggingFace	Fine-tuned for Kashmiri QA/translation

For machine translation, IndicTrans2 and NLLB-200 are the strongest starting points. For text classification and NLU, Kashmiri BERT and IndicBERTv2 provide powerful embeddings. For speech recognition, IndicConformer and XLS-R 300M are fine-tunable with relatively small amounts of labeled Kashmiri audio.

📄 Key Research Papers

14 peer-reviewed papers form the academic backbone of Kashmiri NLP. Here are the most impactful:

Resource	Source	Key Detail
IndicTrans2 (Gala et al.)	ACL 2023 (Transactions)	State-of-the-art Indic MT covering Kashmiri
Deep Neural Architectures for Kashmiri-English MT	Scientific Reports (Nature), 2025	Transformer + GRU + Attention comparisons
Kashmiri News Classification	Scientific Reports (Nature), 2025	ParsBERT achieves F1=0.98
NLLB-200 (Costa-jussà et al.)	arXiv 2022	200-language translation benchmark
600K-KS-OCR Corpus Paper	arXiv 2026	First large-scale Kashmiri OCR research
Low-Resource Indian MT with DPO	IEEE 2025	DPO alignment for Indian MT
POS Tagging with CRF (80–94%)	Academic 2023	First systematic Kashmiri POS tagging

All papers are linked with DOIs and direct URLs in our full Resource Hub.

🛠️ Tools & Platforms

Beyond datasets and models, several tools enable practical Kashmiri NLP work:

KashmiriGPT — AI chatbot for Kashmiri language with Roman + Perso-Arabic input
KashmirAI Evaluation Platform — Human-in-the-loop MQM evaluation for translation quality
Bhashini APIs — Official Indian government translation, ASR, and TTS APIs (free)
Kashmiri WordNet — Lexical database with synsets for semantic NLP
sacreBLEU — Standardized MT evaluation (BLEU, ChrF++, TER)
Apertium — Open-source rule-based MT platform

🚀 How to Get Started with Kashmiri NLP

If you're new to Kashmiri language AI, here's a recommended learning path:

Understand the Language — Read about Kashmiri's dual script system, morphological complexity, and sociolinguistic context on Wikipedia and in the research papers above.
Start with IndicTrans2 — Use AI4Bharat's IndicTrans2 as your MT baseline — it has the best zero-shot Kashmiri performance of any open model.
Fine-tune on Kashmiri Data — Download the 270K parallel corpus and fine-tune IndicTrans2 or mBART-50 using LoRA/QLoRA on Kaggle's free GPU.
Evaluate Properly — Use sacreBLEU for automatic metrics and our KashmirAI platform for human evaluation with native speakers.
Contribute Back — Open-source your models on HuggingFace and submit them to our Resource Hub to help the community.

🌐 Why This Hub Matters

Kashmiri is part of India's Eighth Schedule but remains digitally marginalized. Of the world's approximately 7,000 languages, fewer than 100 have meaningful AI support. Every new dataset, model, and tool brings Kashmiri closer to digital equality.

Our Resource Hub is open-source, free, and community-driven. We verify every resource manually and update the collection regularly. If you know of a resource we're missing, get in touch.

🚀

Explore the Full Resource Hub

All 50+ resources in a searchable, filterable interface — with direct links to HuggingFace, GitHub, and paper DOIs.

Browse Resources →Contribute as Evaluator →