The most comprehensive collection of Kashmiri language AI resources — every dataset, research paper, model, and tool, curated for researchers, developers, and language enthusiasts.
Largest open Kashmiri-English parallel corpus with ~270,000 sentence pairs built from digitized literary texts, manually authored dialogues, and filtered legacy data. Foundation of the first NMT benchmark for Kashmiri.
30,000+ Kashmiri→English sentence pairs organized into raw, cleaned, and processed directories. A widely-used open parallel corpus for Kashmiri MT research.
Meticulously cleaned 3.1-million-word (16.4M characters) Kashmiri text corpus specifically designed for pretraining LLMs from scratch. The largest open Kashmiri text resource.
Our curated Kashmiri→English parallel corpus with human-evaluated quality scores (MQM framework). Built for MT training and benchmarking.
Cleaned and processed Kashmiri text dataset designed for linguistic research and model training. Preprocessed and deduplicated for quality.
~602,000 synthetic word-level images for Kashmiri OCR in three typefaces (Naskh, Nastaleeq, Nakash). Includes ground-truth transcriptions compatible with CRNN and TrOCR.
Transcribed audio recordings from native Kashmiri speakers for Automatic Speech Recognition (ASR) development. The foundational Kashmiri speech dataset. GPL-3.0 licensed.
1,955 segmented speech samples (16kHz, 16-bit, mono WAV) derived from OpenSLR-122. Ready-to-use format for ASR training pipelines.
Processed audio data of 12 frequently spoken Kashmiri words. Designed for spoken word recognition and voice command research.
5,000 samples of Kashmiri text images paired with text labels for OCR model training and testing.
Multilingual dictionary dataset with entries in English, Kashmiri, Urdu, Chinese, and Turkish — including example sentences for cross-lingual research.
Tools for processing audio and text data from OpenSLR Kashmiri Data Corpus. Includes preprocessing pipelines, data loaders, and segmentation utilities.
Data and tools to collect Kashmiri text from various online sources and dictionaries. Includes word pronunciations, PDFs, HTML files, and CSV data.
Open-source database of Kashmiri words with English meanings. Installable via pip (`pip install kashmiri`). Great for lexicon building and dictionary apps.
Comprehensive catalog of NLP resources for Indian languages including Kashmiri. Tracks datasets like NLLB-Seed parallel data and connects to FLORES-200 benchmarks.
Multilingual MT evaluation benchmark covering 200+ languages including Kashmiri. Human-translated sentences from Wikipedia for standardized evaluation.
India-specific multi-domain evaluation benchmark (IN22-Gen + IN22-Conv) for 22 scheduled Indian languages. The standard for evaluating Indian MT models.
Official Government of India platform for Indian language datasets and models under the National Language Translation Mission. Central repository for Kashmiri digital resources.
Help us build the most comprehensive Kashmiri NLP resource hub. If you know of a dataset, paper, model, or tool that should be listed here, we'd love to hear from you.
The Kashmiri NLP Resource Hub by KashmirAI Research is the most comprehensive, open-source collection of Kashmiri language AI resources on the internet. We aggregate every publicly available dataset, research paper, pre-trained model, and NLP tool for the Kashmiri language (کٲشُر / कॉशुर, ISO 639-1: ks) — a language spoken by over 7 million people across Kashmir, India, and the global diaspora.
Kashmiri is classified as a low-resource language in NLP, meaning it has very limited training data compared to high-resource languages like English, Chinese, or Hindi. Google Translate does not support Kashmiri. Major AI systems like ChatGPT and Gemini have minimal Kashmiri capability. This makes dedicated research infrastructure — like this hub — essential for advancing Kashmiri language technology.
We catalog 18 open datasets for Kashmiri NLP, including parallel corpora (Kashmiri-English sentence pairs for machine translation), monolingual text corpora for language model pretraining, OCR image datasets for Nastaliq script recognition, audio corpora for automatic speech recognition (ASR), and standardized evaluation benchmarks like FLORES-200 and IN22. Key resources include the Kashmiri-English Dataset 270K (the largest open parallel corpus), the KS-LIT-3M pretraining corpus (3.1 million words), and the 600K-KS-OCR dataset with 602,000 synthetic Nastaliq images.
We list 13 pre-trained models that support Kashmiri, from large multilingual systems to Kashmiri-specific fine-tuned models. These include IndicTrans2 by AI4Bharat (state-of-the-art for Indian language MT), Meta AI's NLLB-200 (200-language translation), the first dedicated Kashmiri BERT model, IndicConformer for speech recognition, and transformer models like mBART-50, IndicBARTSS, and BLOOM-560M. Most are available for free on Hugging Face.
Our hub curates 14 peer-reviewed papers on Kashmiri NLP from top venues including Scientific Reports (Nature), IEEE, ACL, arXiv, and INDIACom. Research topics span neural machine translation, news classification, OCR for Perso-Arabic script, part-of-speech tagging, morphological analysis, and cross-lingual transfer learning. Each paper listing includes DOI links, author information, and publication venue for easy citation.
There are 18+ open datasets available, including the Kashmiri-English 270K parallel corpus, KS-LIT-3M pretraining dataset (3.1M words), 600K-KS-OCR dataset, Kashmiri Audio Corpus, FLORES-200 and IN22 benchmarks, OpenSLR SLR122, and Wikipedia dumps. These cover machine translation, language modeling, OCR, and speech recognition.
Yes — 13+ models support Kashmiri, including IndicTrans2 (AI4Bharat), NLLB-200 (Meta), Kashmiri BERT, IndicConformer ASR, mBART-50, and XLS-R Wav2Vec2. Most are free on Hugging Face.
No. As of 2026, Google Translate does not support Kashmiri (کٲشُر). This is one reason why dedicated research platforms like KashmirAI Research exist — to advance Kashmiri language technology where commercial AI has not yet reached.
You can evaluate translations as a native speaker on our platform, submit resources to this hub, or use the open-source datasets and models listed here to train your own Kashmiri AI systems. Visit the About page to get in touch.
This hub is maintained by Faizan Ayoub and the KashmirAI Research team. It is an open-source, community-driven initiative to accelerate AI for the Kashmiri language.
This page is updated regularly. Last updated: April 2026. Resources are verified manually. Read our blog post about this resource hub →