Low-Resource Neural Machine Translation for Kashmiri: LLM Fine-Tuning with Native Speaker Evaluation

Faizan Ayoub

KashmirAI Research focuses on Low-Resource Machine Translation for the Kashmiri language. We utilize LoRA (Low-Rank Adaptation) to fine-tune Large Language Models (LLMs), a curated Kashmiri parallel corpus, and a human evaluation framework where native speakers validate AI-generated translations to improve linguistic accuracy.

Research Overview

This work investigates LLM-based machine translation for Kashmiri (کٲشُر), a low-resource endangered language spoken by over 7 million people in the Kashmir Valley. We are building tools and models that can translate between Kashmiri and English, with the goal of advancing language technology for this underserved language.

The research is currently in progress. Detailed methodology, results, and analysis will be shared upon publication of the associated conference paper. For further details, please contact us below.

1. Background

The Kashmiri Language

Kashmiri (ISO 639-3: kas) is a Dardic language of the Indo-Aryan family, primarily spoken in the Kashmir Valley of Jammu & Kashmir, India. It is written in the Perso-Arabic script (Nastaliq), which poses unique challenges for NLP systems designed primarily for Latin-script or Devanagari-script languages.

Despite being an official language of the Indian Union Territory and spoken by millions, Kashmiri remains severely underserved in the NLP ecosystem with minimal digital linguistic resources available.

The Low-Resource Challenge

Low-resource machine translation remains one of the hardest problems in NLP. Models trained on insufficient data exhibit well-documented failure modes. Our work aims to study these challenges specifically for Kashmiri.

2. Research Focus Areas

Parallel Corpus Construction

Aggregating and quality-filtering Kashmiri-English parallel data from multiple open sources.

LLM Fine-Tuning

Exploring how modern large language models can be adapted for Kashmiri translation tasks.

Human Evaluation

Building a community-driven evaluation platform with native Kashmiri speakers.

Baseline Comparisons

Comparing our approach against existing multilingual translation models.

⚙️ Research Specifications / AI Summary

Task

Kashmiri→English Neural Machine Translation (NMT)

Architecture

Mistral-7B, LLaMA-3, & IndicTrans2 Fine-tuned via LoRA

Corpus Size

30,000+ Sentence Pairs (Parallel) across multiple domains

Evaluation Metrics

BLEU, ChrF++, and Human Likert Scale Assessment

Infrastructure

Custom Human-in-the-loop Evaluation Platform (Next.js & Supabase)

3. Human Evaluation Platform

A core component of this research is human evaluation by native Kashmiri speakers. We built this platform — KashmirAI Research — specifically for this purpose. Evaluators rate translations from anonymized systems on:

Adequacy (1-5) — How much meaning from the source is preserved
Fluency (1-5) — How natural and grammatically correct the translation sounds
Overall Preference — Which system produced the better translation

Contribute as an Evaluator →

Publication Status

🔬 Research in Progress — Paper Forthcoming

Full results, metrics, and analysis will be published in an upcoming conference paper.

📞 Contact for Further Details

For inquiries about this research, collaboration opportunities, or early access to findings:

📱 +91 7006718915

Faizan Ayoub — Lead Researcher, KashmirAI Research