Article

Obsidian Source: Drafts / Audio Tokenization

Summary

Pending synthesis from local Obsidian source.

Original source title: Audio Tokenization

Extracted Preview

Undergraduate research in NITW. Let that sink in. The problem seemed real good, and we can make something out of it if some work is put into it.

Problem Statement

The task at hand is to classify mood(or vibe if you will) for an audio file using some form of Transformer architecture(either after converting to text like Whisper, or Wav2Vec or something similar). We can make a classification model, and train it from scratch if we can gather data(or modify it for our own case).

Whisper

Whisper by OpenAI is a speech processing system - which works out of box(unsupervised methods were used for audio encoding, but decoding is a bit of a task in itself)

A Transformer seq2seq model is trained on various speech processing tasks. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder(this allows for a single model to replace many stages of traditional speech processing pipeline). The multitask training format uses a set of special tokens that serve as a task specifiers(or classification targets)

Our case?

A problem of Music Emotion Detection :

For our case, we might need to have labelled music samples data, with mood :

https://github.com/fdlm/listening-moods

A longer way requires using the spotipy package, get 30sec of the song, and use some form of audio seq2seq model thing to classify into some kinda mood. More planning is required for this step, or something similar like that.

Integration Notes

  • Source folder: /home/yashs/Documents/Docs/Obsidian/Research-Notes
  • Local source: /home/yashs/Documents/Docs/Obsidian/Research-Notes/Drafts/Audio Tokenization.md
  • Raw copy: raw/obsidian/research-notes/Drafts/Audio Tokenization.md

Links Created Or Updated

Open Questions