A parallel corpus — a collection of text in one language aligned with its translation in another — is the foundation of any machine translation system. For Kashmiri, creating this resource from scratch is one of the most challenging and impactful contributions we can make to the AI research community.
Why There Is No Existing Kashmiri Corpus
Unlike languages with centuries of printing traditions or large internet presences, Kashmiri has historically been an oral language. Formal writing in Kashmiri has only become widespread in the last few decades, and most existing digital Kashmiri text is either:
- In social media posts with inconsistent spelling and heavy code-switching
- Scanned PDFs of old newspapers or books (not machine-readable)
- Small academic datasets with fewer than 5,000 sentence pairs
- Non-parallel — meaning the Kashmiri and English texts are not direct translations of each other
Our Data Pipeline
Gathering raw Kashmiri text from J&K government documents, educational materials, digital archives, and manually transcribed audio sources.
Matching Kashmiri sentences with English translations using automated alignment tools and manual verification.
Standardizing character encodings, handling Nastaliq diacritics, and cleaning Devanagari variants.
Removing duplicates, length-mismatched pairs, and sentences with excessive code-switching that would confuse translation models.
Native Kashmiri speakers rate sentence pairs through our evaluation platform, flagging severe translation errors.
Current Status & Planned Release
We are targeting a release of 30,000+ high-quality sentence pairs on Hugging Face Datasets under a CC-BY 4.0 license — free for academic and commercial use. See our dataset page for more details.
Contribute to the Corpus
Every evaluation you submit helps validate and improve our parallel corpus. Native Kashmiri speakers welcome.
Start Evaluating →