Datasets
Community-built datasets for Telugu and Indian language AI research, collected at the grassroots level across regions and communities.
Swecha Gonthuka Dataset
స్వేఛ్ఛ గోంతుకCollection
Collected as part of the Swecha Gonthuka initiative by Swecha.org, a Telugu free software organisation. Volunteers across Telugu-speaking regions recorded speech samples through a structured collection platform.
- 20,000+ volunteers contributed recordings across Telugu-speaking regions.
- Structured to include speakers from varied demographics, age groups, and regions.
- Multiple dialectical tones of Telugu are represented to improve model generalisability.
- Built incrementally through coordinated community drives.
Statistics
Format
Each data point pairs a WAV audio file with its corresponding Telugu transcription. Audio is sampled at 16 kHz, the standard input requirement for wav2vec2-based models.
Supported Tasks
Access
Releasing SoonThe Swecha Gonthuka dataset will be made available on Hugging Face with gated access. Access will be reviewed to ensure use aligns with the Viswam.AI Dataset License for local-language technology.
Cultural Data
Releasing SoonWhat is Cultural Data?
Cultural Data represents the true understanding of the ground-level spectrum of cultures, languages, and complex nuances of dialects and their patterns spread across regions. This corpus is achieved through collecting diverse corpus types at the grassroots level, working directly with communities.
Categories (25 types)
Full category label specifications: datasets handbook ↗
Media Types
Metadata
- Title
- Description
- Geo Location
- Language (Indic Languages)
- Release Rights
- Date of Birth
- Gender
- Language Proficiencies (Indic Languages)
- Places Lived (Geo, Tagged)
- Short Biography
- Current Place
- From Place
Telugu Documents
Releasing SoonOverview
Viswam.AI is researching the digitisation and processing of Telugu printed materials — books, magazines, newspapers, and other documents — spanning multiple decades and domains. The corpus is estimated at approximately 50 lakh pages and represents a significant resource for Telugu language AI research.
Scale
Source Types
Research Applications
- Optical Character Recognition (OCR) for Telugu script.
- Large-scale pretraining corpora for Telugu language models.
- Document understanding and information extraction.
- Historical language and script analysis.