Python

Whisper에 단어별 타임스탬프가 필요할 때 - whisperX or whisper-timestamped

비아 VIA 2023. 6. 18. 22:39

지난번에 'Whisper API로 음성을 텍스트로 추출하기' 를 했었다.

https://jellyfishdeveloper.tistory.com/entry/python%EC%9C%BC%EB%A1%9C-%EC%9C%A0%ED%8A%9C%EB%B8%8C-%EC%9E%90%EB%A7%89-%EB%8B%A4%EC%9A%B4%EB%A1%9C%EB%93%9C-%EB%B2%88%EC%97%AD%ED%95%B4%EB%B3%B4%EA%B8%B0-feat-ChatGPT-Whisper-api

python으로 유튜브 자막 다운로드, 번역해보기 (feat. ChatGPT, Whisper api)

유튜브로 영어 자막 스크립트를 추출하고 해당 자막을 chatGPT를 이용해서 한국어로 번역해보려고 한다. 만약 자막이 없으면 Whisper API를 통해 음성을 텍스트로 추출하고 해당 텍스트를 번역해볼

jellyfishdeveloper.tistory.com

Issue

음성을 추출하는 기능은 정말 훌륭했지만

whisper API의 아쉬운 점은 타임스탬프를 문장 단위로만 주고

단어 단위의 타임스탬프가 없다는 점이었다.

예를 들어 "ChatGPT is an artificial intelligence chatbot developed by OpenAI based on the company's Generative Pre-trained Transformer (GPT) series of large language models." 라는 문장이 있다면

타임스탬프를 문장의 시작과 끝에만 줘서 00:00:01에 문장 시작, 00:10:00에 문장 끝 정도만 알 수 있다.

하지만 나같은 경우는 자막을 만들고싶었기 때문에 단어별 타임스탬프가 필요했다.

단어별로 타임스탬프가 있어야 더 정확하게 자막을 만들 수도 있고

whisper가 애매하게 문장을 끝나는 경우가 있어서 번역을 할때 부자연스럽거나 오역이 발생하고는 했기 때문이다.

Solution

그래서 찾은게 WhisperX와 whisper-timestamped이다.

둘 다 whisper를 활용해서 만든 오픈소스 프로젝트인데 단어별 타임스탬프를 사용할 수 있게 해준다.

WhisperX

https://github.com/m-bain/whisperX

GitHub - m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization) - GitHub - m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarizat...

github.com

우선 만들어놓은 가상환경에 whisperX를 설치해본다

pip install git+https://github.com/m-bain/whisperx.git

비디오 및 오디오 관련된 각종 처리를 도와주는 ffmpeg도 설치해야 한다.

아래 링크를 참고해볼 수 있다.

https://github.com/openai/whisper#setup

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

github.com

나는 device를 cpu로 설정해서 테스트해보았다.

result 부분에서 whisperx가 정렬해준 단어 기준 타임스탬프를 확인해볼 수 있다.

import whisperx
import gc 

device = "cpu" 
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment


# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"])

아직 시작한지 얼마 안된 오픈소스이다보니 구현하면서 생각보다 오류도 많고

설명이나 자료도 충분히 있지 않아서 애를 먹었다.

개발자들이 오픈소스 컨트리뷰터가 되는건 이런 오류를 마주하다가 답답한걸 못참고

"내가 고치는게 낫겠다!!!"하면서 기여하게 되는건 아닐까싶었다.

오류를 만날때마다 해당 레포의 issue와 pull request를 찾아보는 재미도 있었다.

whisperX에도 아직 todo list가 많았다. (기회가 되면 나도 기여해보고싶다는 욕심이 생겼다!)

whisper-timestamped

whisper-timestamped도 단어별 타임스탬프를 지원한다.

whisperX보다 각종 기능은 좀 부족하지만 이거저것 설치해야하는게 더 적어서 가벼운(?) 느낌이라 좋았다.

https://github.com/linto-ai/whisper-timestamped

GitHub - linto-ai/whisper-timestamped: Multilingual Automatic Speech Recognition with word-level timestamps and confidence

Multilingual Automatic Speech Recognition with word-level timestamps and confidence - GitHub - linto-ai/whisper-timestamped: Multilingual Automatic Speech Recognition with word-level timestamps and...

github.com

whisper-timestamped도 ffmpeg설치가 필요하다. (위의 링크를 참고해보자)

pip3 install git+https://github.com/linto-ai/whisper-timestamped

본인이 필요한 기능에 따라 라이브러리를 추가로 설치할 수도 있고

light 버전을 위한 설치 설명도 해놓은 점이 좋았다.

사용 방법도 whisper, whisperX와 비슷하다.

import whisper_timestamped as whisper

audio = whisper.load_audio("AUDIO.wav")

model = whisper.load_model("tiny", device="cpu")

result = whisper.transcribe(model, audio, language="fr")

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))

결과는 아래처럼 나온다. 마찬가지도 단어별 타임스탬프가 나오기 때문에 더 세밀하게 원하는 기능을 추가할 수 있다.

{
  "text": " Bonjour! Est-ce que vous allez bien?",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.5,
      "end": 1.2,
      "text": " Bonjour!",
      "tokens": [ 25431, 2298 ],
      "temperature": 0.0,
      "avg_logprob": -0.6674491882324218,
      "compression_ratio": 0.8181818181818182,
      "no_speech_prob": 0.10241222381591797,
      "confidence": 0.51,
      "words": [
        {
          "text": "Bonjour!",
          "start": 0.5,
          "end": 1.2,
          "confidence": 0.51
        }
      ]
    },
    {
      "id": 1,
      "seek": 200,
      "start": 2.02,
      "end": 4.48,
      "text": " Est-ce que vous allez bien?",
      "tokens": [ 50364, 4410, 12, 384, 631, 2630, 18146, 3610, 2506, 50464 ],
      "temperature": 0.0,
      "avg_logprob": -0.43492694334550336,
      "compression_ratio": 0.7714285714285715,
      "no_speech_prob": 0.06502953916788101,
      "confidence": 0.595,
      "words": [
        {
          "text": "Est-ce",
          "start": 2.02,
          "end": 3.78,
          "confidence": 0.441
        },
        {
          "text": "que",
          "start": 3.78,
          "end": 3.84,
          "confidence": 0.948
        },
        {
          "text": "vous",
          "start": 3.84,
          "end": 4.0,
          "confidence": 0.935
        },
        {
          "text": "allez",
          "start": 4.0,
          "end": 4.14,
          "confidence": 0.347
        },
        {
          "text": "bien?",
          "start": 4.14,
          "end": 4.48,
          "confidence": 0.998
        }
      ]
    }
  ],
  "language": "fr"
}

개인적으로는 whisperX보다 정확도가 살짝 더 낮은듯한 경험을 하기도 했고

내가 만드는 기능에 정확도가 중요했기 때문에 whisperX를 사용하기는 했지만

whisper-timestamped도 간편하게 사용하기는 충분히 좋다고 생각한다.

Outro

간단한 오픈소스 라이브러리 사용같지만 신생(?) 오픈소스이다보니

갖가지 오류들을 피할 수 없었다. 그럴때마다 Issue를 찾아보며 나랑 비슷한 경험을 한 개발자들의 대화를 살펴보고

해결책을 모색해보는 재미가 있었다.

지금 잘 사용하고 있는 라이브러리들도 이런 좌충우돌을 겪으면서 안정화되어왔을텐데

문제가 생길때마다 개인이나 회사가 해결하는게 아니라 집단지성이 함께하면 얼마나 큰 힘이 되는지도 알 수 있었다.

저작자표시 (새창열림)

'Python' 카테고리의 다른 글

python으로 유튜브 자막 다운로드, 번역해보기 (feat. ChatGPT, Whisper api) (1)	2023.04.09
[Locust] 부하테스트 툴 사용 방법 (2)	2023.03.26
[Python] 파이썬 classmethod와 staticmethod 차이 (0)	2023.03.12

현재글Whisper에 단어별 타임스탬프가 필요할 때 - whisperX or whisper-timestamped

문과개발자, til, 클린코드, 북클럽, 노개북, 글또, 오픈소스컨트리뷰션, 노마드코더, 신입개발자, 개발자,

Today :
Yesterday :

Whisper에 단어별 타임스탬프가 필요할 때 - whisperX or whisper-timestamped

Issue

Solution

whisper-timestamped

Outro

'Python' 카테고리의 다른 글

'Python'의 다른글

티스토리툴바

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Whisper에 단어별 타임스탬프가 필요할 때 - whisperX or whisper-timestamped

Issue

Solution

whisper-timestamped

Outro

'Python' 카테고리의 다른 글

'Python'의 다른글

관련글

티스토리툴바