Laughing Matters: A Goldmine of Indonesian Comedy Data
In the world of comedy, laughter is the ultimate reward. Now, imagine a treasure trove of laughter, all neatly organized and ready for study. This is exactly what a recent data collection project has achieved, focusing on Indonesian stand-up comedy.
The Dataset
The project gathered a massive amount of data from Kompas TV's YouTube channel. Over 3,900 videos were analyzed, resulting in a dataset packed with:
- 2.8 million words
- 6,124 sentences
- 17,394 instances of audience laughter (carefully annotated)
Data Details
This data is not just raw text. Each entry includes:
- Video title
- URL
- Original and cleaned transcripts
The cleaning process involved:
- Removing timestamps
- Removing tags
- Normalizing whitespace
This makes the data perfect for natural language processing (NLP) tasks.
Importance of the Dataset
So, why is this dataset important? It opens up new avenues for research in:
- Humor detection
- Speech emotion recognition
- Cultural studies
For example, researchers can use this data to train models that can predict when laughter is likely to occur. This is particularly valuable for low-resource languages like Indonesian, where such datasets are rare.
Accessibility and Impact
The dataset is openly accessible on Mendeley Data, adhering to ethical standards and platform policies. It fills a significant gap in Indonesian language corpora, especially in the entertainment and humor domain. This makes it a valuable resource for both academic research and applied projects in:
- Computational linguistics
- Human-centered AI
Conclusion
In essence, this dataset is a laugh-out-loud opportunity for researchers to dive deep into the world of Indonesian comedy and uncover the secrets behind what makes us laugh.