Datasets
Datasets
Community-built datasets for Telugu and Indian language AI research, collected at the grassroots level across regions and communities.
Swecha Gonthuka Dataset
స్వేఛ్ఛ గోంతుక Releasing SoonCollected as part of the Swecha Gonthuka initiative by Swecha.org, a Telugu free software organisation. Volunteers across Telugu-speaking regions recorded speech samples through a structured collection platform.
Statistics
1200+ hrs
Total Duration
20,000+
Volunteers
Telugu (te-IN)
Language
Collection
Format
Audio format: WAV (PCM 16-bit) | Sample rate: 16 kHz | Channels: Mono | Transcription: Telugu Unicode text Each data point pairs a WAV audio file with its corresponding Telugu transcription. Audio is sampled at 16 kHz, the standard input requirement for wav2vec2-based models.
Supported Tasks
Automatic Speech Recognition (ASR): Training or fine-tuning models (e.g. wav2vec2, Whisper) to convert Telugu speech to text.
Cultural Data
Releasing SoonCultural Data represents the true understanding of the ground-level spectrum of cultures, languages, and complex nuances of dialects and their patterns spread across regions. This corpus is achieved through collecting diverse corpus types at the grassroots level, working directly with communities.
Categories
Fables, Events, Music, Places, Food, Literature, Architecture, Skills, Images, Flora & Fauna, Education, Vegetation, People, Culture, Folk Tales, Folk Songs, Traditional Skills, Local Cultural History, Local History, Food & Agriculture, Newspapers Older Than 1980s, Medical Camp, Internship, Stand-Up, Mathematics
Media Types
Audio, Video, Image, Text, Documents (PDF)
Metadata
Corpus Metadata: Title, Description, Geo Location, Language (Indic Languages), Release Rights Contributor Metadata: Date of Birth, Gender, Language Proficiencies (Indic Languages), Places Lived (Geo, Tagged), Short Biography, Current Place, From Place
Telugu Documents
Releasing SoonViswam.AI is researching the digitisation and processing of Telugu printed materials — books, magazines, newspapers, and other documents — spanning multiple decades and domains. The corpus is estimated at approximately 50 lakh pages and represents a significant resource for Telugu language AI research.
Statistics
~50 Lakh
Pages
Telugu
Primary Language
Source Types
Books, Magazines, Newspapers, Documents
Research Applications
Rooted in Community, Built for the Global South
Language is more than just text—it is history, humor, culture and identity. At ViswamAI, we believe that truly effective AI must understand the lived context of the people it serves. We develop high-quality, ethically sourced datasets that capture the genuine linguistic wealth, regional dialects, and cultural nuances of South Indian languages.
Our Data Ecosystem: A Collaborative Approach to Data Dignity
Standard AI datasets are often blindly scraped from the internet, erasing regional identity and colloquial truths. ViswamAI takes a grassroots, socio-technical approach. We collaborate across four distinct pillars to ensure our data is authentic, clean, and culturally grounded:
Grassroots Student Engagement
Through targeted college internships, skill-development drives, and hands-on hackathons like our annual Summer of AI and AI Days, we work with tech-forward youth to collect, format, and validate data while building real-world AI applications.
Expert Linguistic Curation
We partner closely with local linguists, language scholars, and literary groups. Their expert oversight ensures that complex cultural nuances, historical context, idioms, and grammatical integrity are preserved.
Crowdsourced Regional Voluntairsm
To capture language as it is actually lived, our volunteer network reaches out directly to the general population across diverse demographics in Telangana and neighboring states. This helps us document varied oral traditions, rural accents, and localized dialects.
Academic & Industry Synergy
We collaborate with research institutions and industry leaders to maintain rigorous data standards, robust bias mitigation frameworks, and state-of-the-art data engineering practices.
Upcoming Datasets:
Viswam ASR
Releasing Soon🤝 Contribute to the ViswamAI Ecosystem >
Whether you are a student looking to join our next hackathon, a linguist passionate about heritage preservation, or an academic researcher, there is a place for you.