Datasets

Datasets

Community-built datasets for Telugu and Indian language AI research, collected at the grassroots level across regions and communities.

Swecha Gonthuka Dataset

స్వేఛ్ఛ గోంతుక Releasing Soon
Collected as part of the Swecha Gonthuka initiative by Swecha.org, a Telugu free software organisation. Volunteers across Telugu-speaking regions recorded speech samples through a structured collection platform.

Statistics

1200+ hrs
Total Duration
20,000+
Volunteers
Telugu (te-IN)
Language

Collection

Format

Audio format: WAV (PCM 16-bit) | Sample rate: 16 kHz | Channels: Mono | Transcription: Telugu Unicode text Each data point pairs a WAV audio file with its corresponding Telugu transcription. Audio is sampled at 16 kHz, the standard input requirement for wav2vec2-based models.

Supported Tasks

Automatic Speech Recognition (ASR): Training or fine-tuning models (e.g. wav2vec2, Whisper) to convert Telugu speech to text.

Cultural Data

Releasing Soon
Cultural Data represents the true understanding of the ground-level spectrum of cultures, languages, and complex nuances of dialects and their patterns spread across regions. This corpus is achieved through collecting diverse corpus types at the grassroots level, working directly with communities.

Categories

Fables, Events, Music, Places, Food, Literature, Architecture, Skills, Images, Flora & Fauna, Education, Vegetation, People, Culture, Folk Tales, Folk Songs, Traditional Skills, Local Cultural History, Local History, Food & Agriculture, Newspapers Older Than 1980s, Medical Camp, Internship, Stand-Up, Mathematics

Media Types

Audio, Video, Image, Text, Documents (PDF)

Metadata

Corpus Metadata: Title, Description, Geo Location, Language (Indic Languages), Release Rights Contributor Metadata: Date of Birth, Gender, Language Proficiencies (Indic Languages), Places Lived (Geo, Tagged), Short Biography, Current Place, From Place

Telugu Documents

Releasing Soon
Viswam.AI is researching the digitisation and processing of Telugu printed materials — books, magazines, newspapers, and other documents — spanning multiple decades and domains. The corpus is estimated at approximately 50 lakh pages and represents a significant resource for Telugu language AI research.

Statistics

~50 Lakh
Pages
Telugu
Primary Language

Source Types

Books, Magazines, Newspapers, Documents

Research Applications

Rooted in Community, Built for the Global South

Language is more than just text—it is history, humor, culture and identity. At ViswamAI, we believe that truly effective AI must understand the lived context of the people it serves. We develop high-quality, ethically sourced datasets that capture the genuine linguistic wealth, regional dialects, and cultural nuances of South Indian languages.

Our Data Ecosystem: A Collaborative Approach to Data Dignity

Standard AI datasets are often blindly scraped from the internet, erasing regional identity and colloquial truths. ViswamAI takes a grassroots, socio-technical approach. We collaborate across four distinct pillars to ensure our data is authentic, clean, and culturally grounded:
Grassroots Student Engagement
Through targeted college internships, skill-development drives, and hands-on hackathons like our annual Summer of AI and AI Days, we work with tech-forward youth to collect, format, and validate data while building real-world AI applications.
Expert Linguistic Curation
We partner closely with local linguists, language scholars, and literary groups. Their expert oversight ensures that complex cultural nuances, historical context, idioms, and grammatical integrity are preserved.
Crowdsourced Regional Voluntairsm
To capture language as it is actually lived, our volunteer network reaches out directly to the general population across diverse demographics in Telangana and neighboring states. This helps us document varied oral traditions, rural accents, and localized dialects.
Academic & Industry Synergy
We collaborate with research institutions and industry leaders to maintain rigorous data standards, robust bias mitigation frameworks, and state-of-the-art data engineering practices.

Upcoming Datasets:

Viswam ASR

Releasing Soon

🤝 Contribute to the ViswamAI Ecosystem >

Whether you are a student looking to join our next hackathon, a linguist passionate about heritage preservation, or an academic researcher, there is a place for you.