Datasets

Datasets

Community-built datasets for Telugu and Indian language AI research, collected at the grassroots level across regions and communities.

Swecha Gonthuka Dataset

స్వేఛ్ఛ గోంతుక

Collection

Collected as part of the Swecha Gonthuka initiative by Swecha.org, a Telugu free software organisation. Volunteers across Telugu-speaking regions recorded speech samples through a structured collection platform.

  • 20,000+ volunteers contributed recordings across Telugu-speaking regions.
  • Structured to include speakers from varied demographics, age groups, and regions.
  • Multiple dialectical tones of Telugu are represented to improve model generalisability.
  • Built incrementally through coordinated community drives.

Statistics

1200+ hrs
Total Duration
20,000+
Volunteers
Telugu (te-IN)
Language

Format

Audio format WAV (PCM 16-bit)
Sample rate 16 kHz
Channels Mono
Transcription Telugu Unicode text

Each data point pairs a WAV audio file with its corresponding Telugu transcription. Audio is sampled at 16 kHz, the standard input requirement for wav2vec2-based models.

Supported Tasks

Automatic Speech Recognition (ASR)
Training or fine-tuning models (e.g. wav2vec2, Whisper) to convert Telugu speech to text.

Access

Releasing Soon

The Swecha Gonthuka dataset will be made available on Hugging Face with gated access. Access will be reviewed to ensure use aligns with the Viswam.AI Dataset License for local-language technology.

Cultural Data

Releasing Soon

What is Cultural Data?

Cultural Data represents the true understanding of the ground-level spectrum of cultures, languages, and complex nuances of dialects and their patterns spread across regions. This corpus is achieved through collecting diverse corpus types at the grassroots level, working directly with communities.

Categories (25 types)

FablesEventsMusicPlacesFoodLiteratureArchitectureSkillsImagesFlora & FaunaEducationVegetationPeopleCultureFolk TalesFolk SongsTraditional SkillsLocal Cultural HistoryLocal HistoryFood & AgricultureNewspapers Older Than 1980sMedical CampInternshipStand-UpMathematics

Full category label specifications: datasets handbook ↗

Media Types

Audio
Video
Image
Text
Documents (PDF)

Metadata

Corpus Metadata
  • Title
  • Description
  • Geo Location
  • Language (Indic Languages)
  • Release Rights
Contributor Metadata
  • Date of Birth
  • Gender
  • Language Proficiencies (Indic Languages)
  • Places Lived (Geo, Tagged)
  • Short Biography
  • Current Place
  • From Place

Telugu Documents

Releasing Soon

Overview

Viswam.AI is researching the digitisation and processing of Telugu printed materials — books, magazines, newspapers, and other documents — spanning multiple decades and domains. The corpus is estimated at approximately 50 lakh pages and represents a significant resource for Telugu language AI research.

Scale

~50 Lakh
Pages
Telugu
Primary Language

Source Types

Books
Magazines
Newspapers
Documents

Research Applications

  • Optical Character Recognition (OCR) for Telugu script.
  • Large-scale pretraining corpora for Telugu language models.
  • Document understanding and information extraction.
  • Historical language and script analysis.