Laughing Matters: A Goldmine of Indonesian Comedy Data

In the world of comedy, laughter is the ultimate reward. Now, imagine a treasure trove of laughter, all neatly organized and ready for study. This is exactly what a recent data collection project has achieved, focusing on Indonesian stand-up comedy. The project gathered a massive amount of data from Kompas TV's YouTube channel. Over 3, 900 videos were analyzed, resulting in a dataset packed with 2. 8 million words and 6, 124 sentences. But what makes this dataset truly special is the 17, 394 instances of audience laughter that have been carefully annotated. This data is not just raw text. Each entry includes the video title, URL, and both the original and cleaned transcripts. The cleaning process involved removing timestamps, tags, and normalizing whitespace. This makes the data perfect for natural language processing (NLP) tasks.

So, why is this dataset important? It opens up new avenues for research in humor detection, speech emotion recognition, and even cultural studies. For example, researchers can use this data to train models that can predict when laughter is likely to occur. This is particularly valuable for low-resource languages like Indonesian, where such datasets are rare. The dataset is openly accessible on Mendeley Data, adhering to ethical standards and platform policies. It fills a significant gap in Indonesian language corpora, especially in the entertainment and humor domain. This makes it a valuable resource for both academic research and applied projects in computational linguistics and human-centered AI. In essence, this dataset is a laugh-out-loud opportunity for researchers to dive deep into the world of Indonesian comedy and uncover the secrets behind what makes us laugh.

actions