Datasets

Community-built datasets for Telugu and Indian language AI research, collected at the grassroots level across regions and communities.

Swecha Gonthuka ASR | Research

Swecha Gonthuka Dataset

స్వేఛ్ఛ గోంతుక Releasing Soon

Collected as part of the Swecha Gonthuka initiative by Swecha.org, a Telugu free software organisation. Volunteers across Telugu-speaking regions recorded speech samples through a structured collection platform.

Statistics

1200+ hrs

Total Duration

20,000+

Volunteers

Telugu (te-IN)

Language

Collection

Format

Audio format: WAV (PCM 16-bit) | Sample rate: 16 kHz | Channels: Mono | Transcription: Telugu Unicode text Each data point pairs a WAV audio file with its corresponding Telugu transcription. Audio is sampled at 16 kHz, the standard input requirement for wav2vec2-based models.

Supported Tasks

Automatic Speech Recognition (ASR): Training or fine-tuning models (e.g. wav2vec2, Whisper) to convert Telugu speech to text.

Cultural Data

Releasing Soon

Cultural Data represents the true understanding of the ground-level spectrum of cultures, languages, and complex nuances of dialects and their patterns spread across regions. This corpus is achieved through collecting diverse corpus types at the grassroots level, working directly with communities.

Media Types

Audio, Video, Image, Text, Documents (PDF)

Metadata

Corpus Metadata: Title, Description, Geo Location, Language (Indic Languages), Release Rights Contributor Metadata: Date of Birth, Gender, Language Proficiencies (Indic Languages), Places Lived (Geo, Tagged), Short Biography, Current Place, From Place

Telugu Documents

Releasing Soon

Viswam.AI is researching the digitisation and processing of Telugu printed materials — books, magazines, newspapers, and other documents — spanning multiple decades and domains. The corpus is estimated at approximately 50 lakh pages and represents a significant resource for Telugu language AI research.

Statistics

~50 Lakh

Pages

Telugu

Primary Language

Source Types

Books, Magazines, Newspapers, Documents

Research Applications

Rooted in Community, Built for the Global South

Language is more than just text—it is history, humor, culture and identity. At ViswamAI, we believe that truly effective AI must understand the lived context of the people it serves. We develop high-quality, ethically sourced datasets that capture the genuine linguistic wealth, regional dialects, and cultural nuances of South Indian languages.

Our Data Ecosystem: A Collaborative Approach to Data Dignity

Standard AI datasets are often blindly scraped from the internet, erasing regional identity and colloquial truths. ViswamAI takes a grassroots, socio-technical approach. We collaborate across four distinct pillars to ensure our data is authentic, clean, and culturally grounded:

Grassroots Student Engagement

Through targeted college internships, skill-development drives, and hands-on hackathons like our annual Summer of AI and AI Days, we work with tech-forward youth to collect, format, and validate data while building real-world AI applications.

Expert Linguistic Curation

We partner closely with local linguists, language scholars, and literary groups. Their expert oversight ensures that complex cultural nuances, historical context, idioms, and grammatical integrity are preserved.

Crowdsourced Regional Voluntairsm

To capture language as it is actually lived, our volunteer network reaches out directly to the general population across diverse demographics in Telangana and neighboring states. This helps us document varied oral traditions, rural accents, and localized dialects.

Academic & Industry Synergy

We collaborate with research institutions and industry leaders to maintain rigorous data standards, robust bias mitigation frameworks, and state-of-the-art data engineering practices.

Upcoming Datasets:

Viswam ASR

Releasing Soon

🤝 Contribute to the ViswamAI Ecosystem >

Whether you are a student looking to join our next hackathon, a linguist passionate about heritage preservation, or an academic researcher, there is a place for you.

Explore Internships & Hackathons Partner as an Expert/Volunteer Get Notified on Dataset Releases

Related: Swecha Gonthuka ASR Research

Datasets

Swecha Gonthuka Dataset

Statistics

Collection

Format

Supported Tasks

Cultural Data

Categories

Media Types

Metadata

Telugu Documents

Statistics

Source Types

Research Applications

Rooted in Community, Built for the Global South

Our Data Ecosystem: A Collaborative Approach to Data Dignity

Upcoming Datasets:

🤝 Contribute to the ViswamAI Ecosystem >