Building the Kashmiri Parallel Corpus: Methodology & Challenges

A parallel corpus — a collection of text in one language aligned with its translation in another — is the foundation of any machine translation system. For Kashmiri, creating this resource from scratch is one of the most challenging and impactful contributions we can make to the AI research community.

Why There Is No Existing Kashmiri Corpus

Unlike languages with centuries of printing traditions or large internet presences, Kashmiri has historically been an oral language. Formal writing in Kashmiri has only become widespread in the last few decades, and most existing digital Kashmiri text is either:

In social media posts with inconsistent spelling and heavy code-switching
Scanned PDFs of old newspapers or books (not machine-readable)
Small academic datasets with fewer than 5,000 sentence pairs
Non-parallel — meaning the Kashmiri and English texts are not direct translations of each other

Our Data Pipeline

Source Collection

Gathering raw Kashmiri text from J&K government documents, educational materials, digital archives, and manually transcribed audio sources.

Alignment

Matching Kashmiri sentences with English translations using automated alignment tools and manual verification.

Script Normalization

Standardizing character encodings, handling Nastaliq diacritics, and cleaning Devanagari variants.

Quality Filtering

Removing duplicates, length-mismatched pairs, and sentences with excessive code-switching that would confuse translation models.

Human Validation

Native Kashmiri speakers rate sentence pairs through our evaluation platform, flagging severe translation errors.

Current Status & Planned Release

30K+

Sentence pairs collected

~85%

Quality filter pass rate

Q2 2026

Hugging Face release

CC-BY 4.0

Open license

We are targeting a release of 30,000+ high-quality sentence pairs on Hugging Face Datasets under a CC-BY 4.0 license — free for academic and commercial use. See our dataset page for more details.

🗣️

Contribute to the Corpus

Every evaluation you submit helps validate and improve our parallel corpus. Native Kashmiri speakers welcome.

Start Evaluating →