We are currently building and quality-filtering the corpus. Public release on Hugging Face is planned for Q2 2026.
30,000+ sentence pairs across formal, conversational, and literary domains
Kashmiri (ks) → English (en), covering Nastaliq and Roman script variants
Human-evaluated using our MQM evaluation platform with inter-annotator agreement tracking
Creative Commons CC-BY 4.0 — free to use for research and commercial applications
Will be released on Hugging Face Datasets for easy access via the datasets library
JSON and TSV formats with source, reference, metadata, and quality scores
Kashmiri is spoken by over 7 million people but has no publicly available large-scale parallel corpus. This absence means that Kashmiri is excluded from multilingual NLP benchmarks, machine translation leaderboards, and commercial translation APIs. Without data, there can be no progress.
Our dataset will be the first resource of its kind — enabling researchers worldwide to train, evaluate, and compare Kashmiri translation systems on a standardized benchmark. Every translation evaluation submitted through our platform directly contributes to building and validating this dataset.
Native Kashmiri speakers can contribute by evaluating translations on our platform.
Start Evaluating →Reach out to be notified when the dataset is publicly released on Hugging Face.
Contact Us →