🤗
HuggingFace⭐ Featured

Kashmiri-English Dataset 270K

SMUQamar

Largest open Kashmiri-English parallel corpus with ~270,000 sentence pairs built from digitized literary texts, manually authored dialogues, and filtered legacy data. Foundation of the first NMT benchmark for Kashmiri.

Parallel Corpus270K PairsNMT Benchmark
🤗
HuggingFace

Kashmiri-English Parallel Corpus (30K)

SMUQamar

30,000+ Kashmiri→English sentence pairs organized into raw, cleaned, and processed directories. A widely-used open parallel corpus for Kashmiri MT research.

Parallel CorpusMachine Translation30K+ Pairs
📚
HuggingFace⭐ Featured

KS-LIT-3M — 3.1 Million Word Pretraining Dataset

Omarrran (HNM)

Meticulously cleaned 3.1-million-word (16.4M characters) Kashmiri text corpus specifically designed for pretraining LLMs from scratch. The largest open Kashmiri text resource.

LLM Pretraining3.1M WordsCC-BY-4.0
🔬
KashmirAI⭐ Featured

KashmirAI Parallel Corpus

KashmirAI Research

Our curated Kashmiri→English parallel corpus with human-evaluated quality scores (MQM framework). Built for MT training and benchmarking.

Parallel CorpusHuman-EvaluatedMQM
📝
HuggingFace

Kashmiri Text Corpus Cleaned (2025)

Omarrran (HNM)

Cleaned and processed Kashmiri text dataset designed for linguistic research and model training. Preprocessed and deduplicated for quality.

Text CorpusCleaned2025
🔤
HuggingFace

600K-KS-OCR Dataset

Omarrran (HNM)

~602,000 synthetic word-level images for Kashmiri OCR in three typefaces (Naskh, Nastaleeq, Nakash). Includes ground-truth transcriptions compatible with CRNN and TrOCR.

OCR602K ImagesPerso-ArabicCC-BY-4.0
🎙️
OpenSLR

Kashmiri Data Corpus (OpenSLR-122)

OpenSLR

Transcribed audio recordings from native Kashmiri speakers for Automatic Speech Recognition (ASR) development. The foundational Kashmiri speech dataset. GPL-3.0 licensed.

SpeechASRAudioGPL-3.0
🔊
HuggingFace

Kashmiri Audio Corpus (Segmented)

programindz

1,955 segmented speech samples (16kHz, 16-bit, mono WAV) derived from OpenSLR-122. Ready-to-use format for ASR training pipelines.

Speech1.9K Segments16kHzGPL-3.0
🗣️
HuggingFace

Kashmiri Spoken Words Dataset

Omarrran

Processed audio data of 12 frequently spoken Kashmiri words. Designed for spoken word recognition and voice command research.

SpeechWord RecognitionAudio
📷
HuggingFace

Kashmiri Sample Text Recognition (OCR)

Omarrran

5,000 samples of Kashmiri text images paired with text labels for OCR model training and testing.

OCR5K SamplesImage-Text
🌐
HuggingFace

Kashmiri Multilingual Dictionary

Omarrran

Multilingual dictionary dataset with entries in English, Kashmiri, Urdu, Chinese, and Turkish — including example sentences for cross-lingual research.

DictionaryMultilingual5 Languages
⚙️
GitHub

kscp — Kashmiri Speech Corpus Processing

erstan

Tools for processing audio and text data from OpenSLR Kashmiri Data Corpus. Includes preprocessing pipelines, data loaders, and segmentation utilities.

Speech ProcessingToolsASR Pipeline
📝
GitHub

Kashmiri Text Dataset Collection

mzmmoazam

Data and tools to collect Kashmiri text from various online sources and dictionaries. Includes word pronunciations, PDFs, HTML files, and CSV data.

Text CorpusWeb ScrapingData Collection
📖
GitHub

Kaeshir Database

izan-majeed

Open-source database of Kashmiri words with English meanings. Installable via pip (`pip install kashmiri`). Great for lexicon building and dictionary apps.

DictionaryLexiconpip installPython
🇮🇳
GitHub

IndicNLP Catalog (incl. NLLB-Seed)

AI4Bharat

Comprehensive catalog of NLP resources for Indian languages including Kashmiri. Tracks datasets like NLLB-Seed parallel data and connects to FLORES-200 benchmarks.

Indic NLPNLLBCatalog22 Languages
🌍
GitHub

FLORES-200 Benchmark

Meta AI (FAIR)

Multilingual MT evaluation benchmark covering 200+ languages including Kashmiri. Human-translated sentences from Wikipedia for standardized evaluation.

Benchmark200+ LanguagesEvaluation
📊
GitHub

IN22 Benchmark (IndicTrans2)

AI4Bharat

India-specific multi-domain evaluation benchmark (IN22-Gen + IN22-Conv) for 22 scheduled Indian languages. The standard for evaluating Indian MT models.

BenchmarkIndian LanguagesMulti-Domain
🏛️
Government

Bhasini / ULCA Portal

MeitY, Govt. of India

Official Government of India platform for Indian language datasets and models under the National Language Translation Mission. Central repository for Kashmiri digital resources.

OfficialIndian LanguagesGovernment
🤝

Know a Resource We're Missing?

Help us build the most comprehensive Kashmiri NLP resource hub. If you know of a dataset, paper, model, or tool that should be listed here, we'd love to hear from you.

📧 Contact Us⭐ Star on GitHub

Explore More

📦Our DatasetView details about our Kashmiri parallel corpus🔬ResearchRead about our ongoing MT researchEvaluateContribute as a native Kashmiri speaker📖BlogLatest updates and findings

About the Kashmiri NLP Resource Hub

The Kashmiri NLP Resource Hub by KashmirAI Research is the most comprehensive, open-source collection of Kashmiri language AI resources on the internet. We aggregate every publicly available dataset, research paper, pre-trained model, and NLP tool for the Kashmiri language (کٲشُر / कॉशुर, ISO 639-1: ks) — a language spoken by over 7 million people across Kashmir, India, and the global diaspora.

Kashmiri is classified as a low-resource language in NLP, meaning it has very limited training data compared to high-resource languages like English, Chinese, or Hindi. Google Translate does not support Kashmiri. Major AI systems like ChatGPT and Gemini have minimal Kashmiri capability. This makes dedicated research infrastructure — like this hub — essential for advancing Kashmiri language technology.

Kashmiri Language Datasets for Machine Learning

We catalog 18 open datasets for Kashmiri NLP, including parallel corpora (Kashmiri-English sentence pairs for machine translation), monolingual text corpora for language model pretraining, OCR image datasets for Nastaliq script recognition, audio corpora for automatic speech recognition (ASR), and standardized evaluation benchmarks like FLORES-200 and IN22. Key resources include the Kashmiri-English Dataset 270K (the largest open parallel corpus), the KS-LIT-3M pretraining corpus (3.1 million words), and the 600K-KS-OCR dataset with 602,000 synthetic Nastaliq images.

Pre-Trained Models for Kashmiri

We list 13 pre-trained models that support Kashmiri, from large multilingual systems to Kashmiri-specific fine-tuned models. These include IndicTrans2 by AI4Bharat (state-of-the-art for Indian language MT), Meta AI's NLLB-200 (200-language translation), the first dedicated Kashmiri BERT model, IndicConformer for speech recognition, and transformer models like mBART-50, IndicBARTSS, and BLOOM-560M. Most are available for free on Hugging Face.

Academic Research Papers on Kashmiri NLP

Our hub curates 14 peer-reviewed papers on Kashmiri NLP from top venues including Scientific Reports (Nature), IEEE, ACL, arXiv, and INDIACom. Research topics span neural machine translation, news classification, OCR for Perso-Arabic script, part-of-speech tagging, morphological analysis, and cross-lingual transfer learning. Each paper listing includes DOI links, author information, and publication venue for easy citation.

Frequently Asked Questions

What datasets are available for Kashmiri NLP?

There are 18+ open datasets available, including the Kashmiri-English 270K parallel corpus, KS-LIT-3M pretraining dataset (3.1M words), 600K-KS-OCR dataset, Kashmiri Audio Corpus, FLORES-200 and IN22 benchmarks, OpenSLR SLR122, and Wikipedia dumps. These cover machine translation, language modeling, OCR, and speech recognition.

Are there pre-trained AI models for Kashmiri?

Yes — 13+ models support Kashmiri, including IndicTrans2 (AI4Bharat), NLLB-200 (Meta), Kashmiri BERT, IndicConformer ASR, mBART-50, and XLS-R Wav2Vec2. Most are free on Hugging Face.

Is Kashmiri supported by Google Translate?

No. As of 2026, Google Translate does not support Kashmiri (کٲشُر). This is one reason why dedicated research platforms like KashmirAI Research exist — to advance Kashmiri language technology where commercial AI has not yet reached.

How can I contribute to Kashmiri NLP?

You can evaluate translations as a native speaker on our platform, submit resources to this hub, or use the open-source datasets and models listed here to train your own Kashmiri AI systems. Visit the About page to get in touch.

Who maintains this resource hub?

This hub is maintained by Faizan Ayoub and the KashmirAI Research team. It is an open-source, community-driven initiative to accelerate AI for the Kashmiri language.

This page is updated regularly. Last updated: April 2026. Resources are verified manually. Read our blog post about this resource hub →