📦 Open Dataset

Kashmiri Parallel Corpus

The first open Kashmiri→English parallel corpus for NLP research — curated, quality-filtered, and human-evaluated. Releasing on Hugging Face for the global research community.

🔬
Dataset in active development

We are currently building and quality-filtering the corpus. Public release on Hugging Face is planned for Q2 2026.

📊

Scale

30,000+ sentence pairs across formal, conversational, and literary domains

🌐

Languages

Kashmiri (ks) → English (en), covering Nastaliq and Roman script variants

Quality

Human-evaluated using our MQM evaluation platform with inter-annotator agreement tracking

📝

License

Creative Commons CC-BY 4.0 — free to use for research and commercial applications

🤗

Distribution

Will be released on Hugging Face Datasets for easy access via the datasets library

📄

Format

JSON and TSV formats with source, reference, metadata, and quality scores

Why This Dataset Matters

Kashmiri is spoken by over 7 million people but has no publicly available large-scale parallel corpus. This absence means that Kashmiri is excluded from multilingual NLP benchmarks, machine translation leaderboards, and commercial translation APIs. Without data, there can be no progress.

Our dataset will be the first resource of its kind — enabling researchers worldwide to train, evaluate, and compare Kashmiri translation systems on a standardized benchmark. Every translation evaluation submitted through our platform directly contributes to building and validating this dataset.

🗣️

Help Build the Dataset

Native Kashmiri speakers can contribute by evaluating translations on our platform.

Start Evaluating →
📧

Get Notified at Release

Reach out to be notified when the dataset is publicly released on Hugging Face.

Contact Us →