ChatGPT로 AI 캐릭터 만드는 법

카테고리 없음 2024. 5. 25. 22:02

안녕하세요

저는 애니메이션의 캐릭터를 학습해 인공지능을 만든

영상을 유투브에 투고한 적이 있습니다.

https://www.youtube.com/watch?v=oB8C4FcZDD0

★★영상 안 봤으면 보고 댓글 + 좋아요 날려주기 !!!★★
----

<업데이트>

이후에 이 영상이 조회수를 많이 빨아먹어

이때다 싶어 모기처럼 빨아먹으려고 두개의 영상을 올렸으나

성과가 개같이 망했습니다 ㅠㅠ

역주행하면 이쪽도 코드를 올릴 생각이 있으니

다른 비슷한 영상도 사랑해주세용 아듀!!!

두 인공지능 대화시키는 내용
- 이 프로젝트가 쓴 TTS보다 음성품질이 높고 생성시간이 길며 학습이 쉽고, 설정이 쉽습니다.

https://youtu.be/0eWRZG3_z0w

VR공간에서 챗봇을 만드는 내용
- 이 프로젝트보다 TTS 설정이 쉽고 음성품질이 낮으며 생성시간이 짧습니다.

https://youtu.be/w982zt5s9Jg

이쪽도 구경하면 좋아요+댓글달아주기!!!

댓글을 통한 유투브 결제 후원도받아용 아잉~~~

---------

원래는 소스코드까지 공개하기에는 이것저것 정리할 게 많아서 공개할 의향이 전혀 없었으나

상당히 많은 분들이 공개를 원하셔서 코드를 정리하여 올려보도록 하겠습니다.

몇개는 구형 윈도우나 맥북에서도 충분히 돌아갈 만한 내용이지만

중간에 우분투와 인텔 고성능 GPU가 필요한 부분이 있으니

다 해보지 마시고 걸러서 하시길 바랍니다!!

그럼 레츠기릿!!!

위스퍼 모델로 음성 파일을 바로 문자로 변환

https://github.com/openai/whisper

파이선 3.8-3.11 버전에서 위스퍼 모델을 설치합니다.

pip install -U openai-whisper

추가적으로 ffmpeg라는 프로그램도 필요합니다.

환경에 따라서 설치해줍시다.

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

그냥 음성파일을 문자로 받아적으려면 이거로도 충분합니다만

마이크의 음성을 wav로 저장하고 다시 문자로 받기 위해서 pyaudio도 설치합니다.

pip install pyaudio wave

이제 아래 코드를 실행시켜 봅시다

마이크에서 말한 일본어를 텍스트로 바꿔주는 코드입니다.

import pyaudio
import wave
import whisper

model = whisper.load_model("medium")

def transcribe_directly():
    sample_rate = 16000
    bits_per_sample = 16
    chunk_size = 1024
    audio_format = pyaudio.paInt16
    channels = 1

    def callback(in_data, frame_count, time_info, status):
        wav_file.writeframes(in_data)
        return None, pyaudio.paContinue

    # Open the wave file for writing
    wav_file = wave.open('./output.wav', 'wb')
    wav_file.setnchannels(channels)
    wav_file.setsampwidth(bits_per_sample // 8)
    wav_file.setframerate(sample_rate)

    # Initialize PyAudio
    audio = pyaudio.PyAudio()

    # Start recording audio
    stream = audio.open(format=audio_format,
                        channels=channels,
                        rate=sample_rate,
                        input=True,
                        frames_per_buffer=chunk_size,
                        stream_callback=callback)

    input("Press Enter to stop recording...")
    # Stop and close the audio stream
    stream.stop_stream()
    stream.close()
    audio.terminate()

    # Close the wave file
    wav_file.close()
    result = model.transcribe("./output.wav", language="ja")
    return result['text']

text = transcribe_directly()
print(text)

한국어를 인식하려면 result = model.transcribe("./output.wav", language="ko")를,

영어를 인식하려면 result = model.transcribe("./output.wav", language="en") 을 써주세요.

챗GPT를 이용한 인공지능 캐릭터 생성

챗지피티를 이용해 캐릭터를 만들어봅시다.

챗지피티 API는 과금형이므로 결제를 해서

돈을 좀 충전해준다음에 쓰시길 바랍니다.

가지고 놀기엔 5달러면 충분

어시스턴트 API 를 이용해서 프리렌을 학습해 주겠습니다.

https://platform.openai.com/assistants/

에 들어가서 챗지피티에 소정의 돈을 입금하신 뒤에

어시스턴트를 추가하고 아래와 같이 Instructions을 추가해줍니다.

Your job is to imagine that you are Frieren, whose character link is https://frieren.fandom.com/wiki/Frieren. As Frieren, you need to embody her calm and composed nature while speaking in Japanese. 


Using the detailed character traits and typical behaviors extracted from the character setting document (frieren.pdf), along with analysis of dialogue patterns from output.txt, create responses that emulate Frieren's speech and behavior.  Also refer to 'dialog.txt' file, which is the dialogue between Frieren and other person. 

Her dialog style is like below:
- "まあいいや"
- "今日の買い出し当番私だったのに寝坊しちゃったから"
- "直接の感謝じゃないよ この村の人たちはヒンメルを信じていたんだ"
- "まあいいや"
- "友人から預かった子を七に送るつもりはないよ"
Note that she do not speak in honorific terms. She usually finish her sentence with 'いるよ', 'たんだ', or 'からね'.


Please respond as Frieren would, using her Japanese language skills and knowledge of her character abilities and experiences. Consider my personality, thoughts, and feelings as Frieren, and provide insight and nuance in your responses. Do not quote within your answer. Be concise in your words.

인스트럭션은 다음과 같이 구성되어있습니다.

https://frieren.fandom.com/wiki/Frieren 을 참고해라
조용한 성격을 반영해라
일본어로 말해라
캐릭터 설정을 표현하는 pdf, 대사록을 담은 txt를 참고해라
일어로 된 말투를 참고하고, 경어를 쓰지 마라, ‘이루요’, ‘단다’, ‘카라네’ 로 말을 끝내라.
프리렌인 척 말해라
길게 말하지 마라

자료가 별로 없어서 캐릭터를 따라하게 만드는데 좀 시행착오가 많았는데요

다른 인스트럭션을 쓰실 때 참고하시길 바랍니다.

언어는 서로 달라도 상관없어 보입니다. 위키와 인스트럭션은 영어, 설정은 한국어, 말투는 일어를 섞어 썼는데도 잘 동작했습니다. 굳이 맞출 필요가 없습니다.
대화집이나 설정을 잘 넣는다고 알아서 완전히 따라해주진 않았습니다. 그래서 ‘경어를 써라’ 같은 말도 넣었고 예시 일본어 문장들도 추가해주고 성격도 어느정도 요약해서 넣어주었습니다.
길게 말하지 마라는 지시어가 중요합니다. 없으면 엄청 길게 주절주절 말합니다.

이런 지시어를 처음부터 잘 쓰는 건 어렵기에
저는 https://www.feedough.com/ai-prompt-generator/에서

‘Act as Frieren whose charcter link is https://frieren.fandom.com/wiki/Frieren ’ 라는 지시문을 주어서 어느정도 초안을 작성하고 진행했습니다.

저같은경우는 일본어가 유창하여 일본어로 대화했습니다만

‘한국어로 대답해라’ 정도로 지시어를 변경해서 가지고 놀수도 있어 보입니다.

아래 세개의 파일들을 File search에 넣어 보시고 Playground에서 시험삼아 몇개 말을 해 보시길 바랍니다.

https://drive.google.com/file/d/1DWyrR-l4FdbCofZF7FEn4t943_6DJCOm/view?usp=sharing

pdf 파일자체는 나무위키를 긁은 거라 별게 없고, output.txt나 dialog.txt는 일일히 수집하였는데요.

애니메이션에서 캐릭터가 한 말을 일일히 찾아서 재생시킨 다음에,

재생시킨 음성을 아까 만든 마이크 → 텍스트 변환 코드를 이용해 일어로 일일히 받아적어서 모았습니다.

플레이그라운드에서 잘 동작했으면

어시스턴트의 ID를 받아적어 사용합니다.

대화형 챗봇을 구현한 코드는 다음과 같습니다.

먼저 openai Python 바인딩을 설치해 주시고

pip install openai

공식 Assistant API 예제와 별 차이 없지만 코드 첨부합니다.

import time

from openai import OpenAI
import re


client = OpenAI(
  api_key=''
)
assistant = client.beta.assistants.retrieve(
    assistant_id=''
)
thread = client.beta.threads.create()

def wait_on_run(run, thread):
    while run.status == "queued" or run.status == "in_progress":
        run = client.beta.threads.runs.retrieve(
            thread_id=thread.id,
            run_id=run.id,
        )
        time.sleep(0.5)
    return run

while True:
    content = input('Enter your message:')
    message = client.beta.threads.messages.create(
        thread_id=thread.id,
        role='user',
        content=content
    )



    # Execute our run
    run = client.beta.threads.runs.create(
        thread_id=thread.id,
        assistant_id=assistant.id,
    )

    # Wait for completion
    wait_on_run(run, thread)
    # Retrieve all the messages added after our last user message
    messages = client.beta.threads.messages.list(
        thread_id=thread.id, order="asc", after=message.id
    )
    response_text = ""
    for message in messages:
        for c in message.content:
            response_text += c.text.value
    clean_text = re.sub('【.*?】', '', response_text)
    print(clean_text)

XTTS 라이브러리를 이용한 인공지능 TTS 생성

주의!!!

여기서부터 난이도 급상승합니다. 함부로 따라하지 마세요. 개발기술 테크닉은 과감히 생략하겠습니다.

https://github.com/coqui-ai/TTS

라이브러리를 이용해서 파인튜닝을 할 겁니다.

그전에 음성파일을 일일히 수집해야되는데

캐릭터가 말하는 wav 파일을 준비해 주어야 합니다.

이건 수집해야 되는데 제가 따로 wav 파일을 제공하진 않겠습니다.

일일히 모험을 통해 알아가도록 합시다

저같은경우 애니메이션 6화까지 대사를 일일히 따서 447개를 수집하였습니다.

고음질 음원파일을 다운받고

Ultimate Vocal Remover를 이용해서 배경음악들을 전부 날려주도록 합시다.

https://ultimatevocalremover.com/

Ultimate Vocal Remover

The best vocal remover application on the internet, and it's totally free and open source! Available on Windows, Mac, & Linux

ultimatevocalremover.com

MDX Model을 MDX23C-InstVoc HQ를 선택해서 음원을 바꿔준 다음

소음이 없는 깨끗한 목소리를 수집해주시길 바랍니다. 다른 캐릭터와의 대화가 바스락거리거나 발걸음이 섞이면 과감하게 제거해 주도록 합시다.

캐릭터가 감정이 실려서 화나거나 놀란 목소리는 제거해 주시길 바라며

감정이 가능한 들어가지 않은 평온한 목소리만을 수집해주시면 되겠습니다.

이 수집한 목소리 각각의 대사를 적은 한 파일이 더 필요합니다.

내용은 아래와 같은 식입니다.

output0|どうだった?|どうだった?
output1|逃げとこ。|逃げとこ。
output10|殴ったの?|殴ったの?
output100|元からでしょ|元からでしょ
output101|街中だと見えにくいね|街中だと見えにくいね
output102|じゃあ次|じゃあ次
output103|50年後|50年後
output104|もっと綺麗に見える場所知ってるから案内するよ|もっと綺麗に見える場所知ってるから案内するよ
output105|じゃあ私はここで|じゃあ私はここで
output106|魔法の収集を続けるよ|魔法の収集を続けるよ
output107|100年くらいは中央諸国を巡る予定だから|100年くらいは中央諸国を巡る予定だから
output108|そう…困ったな…召喚に使うのに…|そう…困ったな…召喚に使うのに…
output109|そういえば魔王城で拾ったやつヒンメルに預けたままだっけ|そういえば魔王城で拾ったやつヒンメルに預けたままだっけ
output11|クソババアか|クソババアか
output110|わかんない|わかんない
output111|もうすぐエーラ流星の時期だし|もうすぐエーラ流星の時期だし
output112|ついでに取りに行くか|ついでに取りに行くか
output113|確かここら辺…|確かここら辺…

이런 코드를 이용해서 대본을 생성해줄 수 있겠습니다.

import os
import csv
import whisper

model = whisper.load_model("large")
def transcribe_audio_to_csv(directory):
    # Load the whisper model
    # Get a list of all .wav files in the directory
    wav_files = sorted([f for f in os.listdir(directory) if f.endswith('.wav')])
    # Open the CSV file for writing
    with open('output.csv', 'w', newline='', encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile, delimiter='|')

        # Process each .wav file
        for wav_file in wav_files:
            # Transcribe the audio file
            result = model.transcribe(os.path.join(directory, wav_file), language="ja", fp16=False)
            transcribed_text = result["text"]

            # Write the filename and transcribed text to the CSV file
            print(f"Transcribed {wav_file}: {transcribed_text}")
            writer.writerow([wav_file, transcribed_text ,transcribed_text])

# Use the function
transcribe_audio_to_csv('./freiren_dataset')

아무튼 수집을 하면 학습을 해야 하는데 XTTS 라이브러리는 python 3.7이상 3.11 미만 버전, 우분투 18~20 에서만 지원한다고 설명하고 있습니다.

윈도우나 맥에서 학습을 진행하면 학습환경 세팅을 보장하지 못하니 우분투에서 학습을 진행하도록 합시다.

저는 vast.ai 사이트에서 우분투서버를 임대하여 학습에 사용하였습니다.

Pytorch 2.2.0 cuda 12.1 devel
50 GB 디스크스페이스
1x RTX 4090

로 진행했으며 임대가격은 1시간당 0.4달러입니다.

학습이 끝나면 바로 인스턴스를 삭제하시길 바랍니다.

인스턴스를 빌리거나 하면 다음 커맨드로 환경을 설치합니다.

pip install tts mecab-python3 cutlet unidic-lite

학습을 시키는 코드를 실행합니다.

import os

from trainer import Trainer, TrainerArgs

from TTS.config.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.layers.xtts.trainer.gpt_trainer import GPTArgs, GPTTrainer, GPTTrainerConfig, XttsAudioConfig
from TTS.utils.manage import ModelManager

# Logging parameters
RUN_NAME = "GPT_XTTS_v2.0_freiren_FT"
PROJECT_NAME = "XTTS_trainer"
DASHBOARD_LOGGER = "tensorboard"
LOGGER_URI = None

# Set here the path that the checkpoints will be saved. Default: ./run/training/
OUT_PATH = os.path.join(os.path.dirname(os.path.abspath(__file__)), "run", "training")
print(OUT_PATH)

# Training Parameters
OPTIMIZER_WD_ONLY_ON_WEIGHTS = True  # for multi-gpu training please make it False
START_WITH_EVAL = True  # if True it will star with evaluation
BATCH_SIZE = 3  # set here the batch size
GRAD_ACUMM_STEPS = 84  # set here the grad accumulation steps
# Note: we recommend that BATCH_SIZE * GRAD_ACUMM_STEPS need to be at least 252 for more efficient training. You can increase/decrease BATCH_SIZE but then set GRAD_ACUMM_STEPS accordingly.

# Define here the dataset that you want to use for the fine-tuning on.
config_dataset = BaseDatasetConfig(
    formatter="ljspeech",
    dataset_name="freiren",
    path="/workspace",
    meta_file_train="/workspace/metadata.csv",
    language="ja",
)

# Add here the configs of the datasets
DATASETS_CONFIG_LIST = [config_dataset]

# Define the path where XTTS v2.0.1 files will be downloaded
CHECKPOINTS_OUT_PATH = os.path.join(OUT_PATH, "XTTS_v2.0_original_model_files/")
os.makedirs(CHECKPOINTS_OUT_PATH, exist_ok=True)


# DVAE files
DVAE_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/dvae.pth"
MEL_NORM_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/mel_stats.pth"

# Set the path to the downloaded files
DVAE_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(DVAE_CHECKPOINT_LINK))
MEL_NORM_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(MEL_NORM_LINK))


# download DVAE files if needed
if not os.path.isfile(DVAE_CHECKPOINT) or not os.path.isfile(MEL_NORM_FILE):
    print(" > Downloading DVAE files!")
    ModelManager._download_model_files([MEL_NORM_LINK, DVAE_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True)


# Download XTTS v2.0 checkpoint if needed
TOKENIZER_FILE_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/vocab.json"
XTTS_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/model.pth"

# XTTS transfer learning parameters: You we need to provide the paths of XTTS model checkpoint that you want to do the fine tuning.
TOKENIZER_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(TOKENIZER_FILE_LINK))  # vocab.json file
XTTS_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_CHECKPOINT_LINK))  # model.pth file
print(XTTS_CHECKPOINT)

# download XTTS v2.0 files if needed
if not os.path.isfile(TOKENIZER_FILE) or not os.path.isfile(XTTS_CHECKPOINT):
    print(" > Downloading XTTS v2.0 files!")
    ModelManager._download_model_files(
        [TOKENIZER_FILE_LINK, XTTS_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True
    )


# Training sentences generations
SPEAKER_REFERENCE = [
    "/workspace/wavs/output49.wav",
    "/workspace/wavs/output50.wav"# speaker reference to be used in training test sentences
]
LANGUAGE = config_dataset.language

def main():
    # init args and config
    model_args = GPTArgs(
        max_conditioning_length=132300,  # 6 secs
        min_conditioning_length=66150,  # 3 secs
        debug_loading_failures=False,
        max_wav_length=255995,  # ~11.6 seconds
        max_text_length=200,
        mel_norm_file=MEL_NORM_FILE,
        dvae_checkpoint=DVAE_CHECKPOINT,
        xtts_checkpoint=XTTS_CHECKPOINT,  # checkpoint path of the model that you want to fine-tune
        tokenizer_file=TOKENIZER_FILE,
        gpt_num_audio_tokens=1026,
        gpt_start_audio_token=1024,
        gpt_stop_audio_token=1025,
        gpt_use_masking_gt_prompt_approach=True,
        gpt_use_perceiver_resampler=True,
    )
    # define audio config
    audio_config = XttsAudioConfig(sample_rate=22050, dvae_sample_rate=22050, output_sample_rate=24000)
    # training parameters config
    config = GPTTrainerConfig(
        output_path=OUT_PATH,
        model_args=model_args,
        run_name=RUN_NAME,
        project_name=PROJECT_NAME,
        run_description="""
            GPT XTTS training
            """,
        dashboard_logger=DASHBOARD_LOGGER,
        logger_uri=LOGGER_URI,
        audio=audio_config,
        batch_size=BATCH_SIZE,
        batch_group_size=48,
        eval_batch_size=BATCH_SIZE,
        num_loader_workers=8,
        eval_split_max_size=256,
        print_step=50,
        plot_step=100,
        log_model_step=1000,
        save_step=10000,
        save_n_checkpoints=1,
        save_checkpoints=True,
        # target_loss="loss",
        print_eval=False,
        # Optimizer values like tortoise, pytorch implementation with modifications to not apply WD to non-weight parameters.
        optimizer="AdamW",
        optimizer_wd_only_on_weights=OPTIMIZER_WD_ONLY_ON_WEIGHTS,
        optimizer_params={"betas": [0.9, 0.96], "eps": 1e-8, "weight_decay": 1e-2},
        lr=5e-06,  # learning rate
        lr_scheduler="StepLR",
        # it was adjusted accordly for the new step scheme
        lr_scheduler_params={"step_size": 50, "gamma": 0.5, "last_epoch": -1},
        test_sentences=[
            {
                "text": "やったー！アクア様登場だよ！",
                "speaker_wav": SPEAKER_REFERENCE,
                "language": LANGUAGE,
            },
            {
                "text": "水の女神様、アクアさまが貴方にサービス♪ 今日は何か素敵なことをしてあげようかしら？",
                "speaker_wav": SPEAKER_REFERENCE,
                "language": LANGUAGE,
            },
            {
                "text": "もちろん、アクア様のことだから、どんな困難だってバッチリ対処してみせるわ。",
                "speaker_wav": SPEAKER_REFERENCE,
                "language": LANGUAGE,
            },
            {
                "text": "冒険？もちろん、冒険ってのは私の得意分野よ。",
                "speaker_wav": SPEAKER_REFERENCE,
                "language": LANGUAGE,
            },
            {
                "text": "これからもみんなに祝福!",
                "speaker_wav": SPEAKER_REFERENCE,
                "language": LANGUAGE,
            },
            {
                "text": "この子きっと消えちゃうわよ!",
                "speaker_wav": SPEAKER_REFERENCE,
                "language": LANGUAGE,
            },
            {
                "text": "ちょっとウィズあんたこそなんとかできないの?",
                "speaker_wav": SPEAKER_REFERENCE,
                "language": LANGUAGE,
            },
            {
                "text": "この分だと任せといても大丈夫よ",
                "speaker_wav": SPEAKER_REFERENCE,
                "language": LANGUAGE,
            },
        ],
    )

    # init the model from config
    model = GPTTrainer.init_from_config(config)
    
    # load training samples
    train_samples, eval_samples = load_tts_samples(
        DATASETS_CONFIG_LIST,
        eval_split=True,
        eval_split_max_size=config.eval_split_max_size,
        eval_split_size=config.eval_split_size,
    )

    # init the trainer and 🚀
    trainer = Trainer(
        TrainerArgs(
            restore_path=None,  # xtts checkpoint is restored via xtts_checkpoint key so no need of restore it using Trainer restore_path parameter
            skip_train_epoch=False,
            start_with_eval=START_WITH_EVAL,
            grad_accum_steps=GRAD_ACUMM_STEPS,
        ),
        config,
        output_path=OUT_PATH,
        model=model,
        train_samples=train_samples,
        eval_samples=eval_samples,
    )
    trainer.fit()


if __name__ == "__main__":
    main()

실행중에 텐서보드를 열면 학습이 진행중인 오디오를 볼수가 있는데요

학습의 난이도는 전적으로 준 음성의 음질이 우수하냐, 소음이 없으냐에 달려 있습니다.

2시간정도 학습을 시키면 어느정도 음성이 잘 나오기에

프로그램을 종료하시고

런 폴더에서 코드들을 다운받으시길 바랍니다.

~.pth 가 학습된 모델이고

XTTS_v2.0_original_model_files 에서 따로 vocab.json 을 복사해서 폴더에 넣어주시길 바랍니다.

모델들 중 하나를 골라서 model.pth로 이름을 변경해 주시고

아래 코드를 통해서 문자를 음성으로 변환시킬 수 있습니다.

import os
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

print("Loading model...")
config = XttsConfig()
config.load_json("./workspace/run/training/GPTTrain/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="./workspace/run/training/GPTTrain/", use_deepspeed=False)
model.cuda()

print("Computing speaker latents...")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["./workspace/wavs/output0.wav"])

print("Inference...")
out = model.inference(
    "この分だと任せといても大丈夫よ 魔法の収集を続けるよ 100年くらいは中央諸国を巡る予定だから",
    "ja",
    gpt_cond_latent,
    speaker_embedding,
    temperature=0.7, # Add custom parameters here
)
torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

저같은경우는 FastAPI 서버를 이용해서 서버가 음성학습을 시키도록 작업을 해두었습니다. 그 코드는 다음과 같습니다.

from fastapi import FastAPI
from fastapi.responses import FileResponse
import subprocess
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts


# Make the GET request
params = {
    'query': 'text'
}
config = XttsConfig()
config.load_json("./workspace/run/training/GPTTrain/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_dir="./workspace/run/training/GPTTrain/",
    use_deepspeed=True
)
model.cuda()

print("Computing speaker latents...")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["./workspace/wavs/output0.wav"]
)

app = FastAPI()

@app.get("/tts")
def get_wav(query):
    print(query)
    print("Inference...")
    out = model.inference(
        query,
        "ja",
        gpt_cond_latent,
        speaker_embedding,
        temperature=0.7,  # Add custom parameters here
    )
    torchaudio.save(
        "xtts.wav",
        torch.tensor(out["wav"]).unsqueeze(0),
        24000
    )
    command = ["ffmpeg", "-y", "-i", "xtts.wav", "-ar",
               "44100", "-ac", "1", "-acodec", "pcm_s16le", "output_wav.wav"]
    subprocess.run(command, check=True)
    return FileResponse('./output_wav.wav', media_type='audio/wav')

VB Cable와 VTube Studio로 캐릭터 모방

이제 영상처럼 내가 일어로 말을 걸면 캐릭터가 대답하는 기능을 구현하겠습니다.

VTube Studio https://denchisoft.com/

를 스팀에서 다운받고,

https://youtu.be/rHHJHiWg66c?si=PTpxik9wILA6rWNn

2D 프리렌 라이브 모델을 설치해 줍시다.

VTube Studio는 마이크의 음성에 맞게 캐릭터가 입을 움직이게 하는 기능이 있습니다.

이기능을 그대로 쓰면 음성이 재생될 때가 아니라 내가 말을 할 때 캐릭터가 말을 하는 대참사가 일어납니다.

그러기 위해서 가상 케이블인 VB Cable을 설치해 줄겁니다.

캐릭터 음성을 VB Cable의 마이크에 재생을 하고 그걸 VTube Studio에서 입을 움직이게 하는 겁니다.

VTube Studio에서 마이크로폰을 VB-Cable로 설정해주고

Mouth Open, Mouth Form의 인풋을 ‘VoiceFrequencyPlusMouthSmile’ 과 ‘VoiceVolumePlusMouthOpen’ 으로 설정해 줍니다.

이제 음성을 VB Cable의 인풋으로 설정해야 하는데

pyAudio에서 VB Cable의 인풋이 몇번 인덱스인지 알아내야 합니다.

아래 코드를 이용하면 VB Cable 입력의 인덱스를 알 수 있습니다.

import pyaudio
import wave
import subprocess

p = pyaudio.PyAudio()

devices = p.get_device_count()
# Iterate through all devices
for i in range(devices):
    # Get the device info
    device_info = p.get_device_info_by_index(i)
    # Check if this device is a microphone (an input device)
    if device_info.get('maxInputChannels') > 0:
        print(f"입력: {device_info.get('name')} , Device Index: {device_info.get('index')}")
    else:
        print(f"출력: {device_info.get('name')} , Device Index: {device_info.get('index')}")

제 컴퓨터에선 5번입니다.

이 내용을 기반으로 VB Cable 입력과 스피커에서 음성을 동시재생하면 VTube Studio에서도

캐릭터의 입이 맞춰서 움직이게 됩니다.

import pyaudio
import wave
import subprocess

p = pyaudio.PyAudio()

devices = p.get_device_count()
# Iterate through all devices
for i in range(devices):
    # Get the device info
    device_info = p.get_device_info_by_index(i)
    # Check if this device is a microphone (an input device)
    if device_info.get('maxInputChannels') > 0:
        print(f"입력: {device_info.get('name')} , Device Index: {device_info.get('index')}")
    else:
        print(f"출력: {device_info.get('name')} , Device Index: {device_info.get('index')}")


def play_wav_file(filename):
    # Open the file
    # Get the number of audio I/O devices
    p = pyaudio.PyAudio()

    command = ["ffmpeg", "-y", "-i", filename, "-ar", "44100", "-ac", "1", "-acodec", "pcm_s16le", "output_wav.wav"]
    subprocess.run(command, check=True)
    wav_file = wave.open("output_wav.wav", 'rb')

    # Create a PyAudio instance
    p = pyaudio.PyAudio()

    # Open a stream
    stream = p.open(format=p.get_format_from_width(wav_file.getsampwidth()),
                    channels=wav_file.getnchannels(),
                    rate=wav_file.getframerate(),
                    output_device_index=5,
                    output=True)
    streamTwo = p.open(format=p.get_format_from_width(wav_file.getsampwidth()),
                    channels=wav_file.getnchannels(),
                    rate=wav_file.getframerate(),
                    output=True)

    # Read data from the file
    data = wav_file.readframes(1024)

    # Play the file by streaming the data
    while data:
        stream.write(data)
        streamTwo.write(data)
        data = wav_file.readframes(1024)

    # Close the stream and terminate the PyAudio instance
    stream.stop_stream()
    stream.close()
    p.terminate()

# Use the function
play_wav_file('xtts.wav')

종합

지금까지 작업을 전부 종합해서 코드로 만들면 이런 코드가 됩니다.

import pyaudio
import wave
import requests
import re

# Load the Whisper model once
import time
from openai import OpenAI

client = OpenAI(
  api_key=''
)
assistant = client.beta.assistants.retrieve(
    assistant_id=''
)
thread = client.beta.threads.create()

def wait_on_run(run, thread):
    while run.status == "queued" or run.status == "in_progress":
        run = client.beta.threads.runs.retrieve(
            thread_id=thread.id,
            run_id=run.id,
        )
        time.sleep(0.5)
    return run


def get_response(content):
    message = client.beta.threads.messages.create(
        thread_id=thread.id,
        role='user',
        content=content
    )



    # Execute our run
    run = client.beta.threads.runs.create(
        thread_id=thread.id,
        assistant_id=assistant.id,
    )

    # Wait for completion
    wait_on_run(run, thread)
    # Retrieve all the messages added after our last user message
    messages = client.beta.threads.messages.list(
        thread_id=thread.id, order="asc", after=message.id
    )
    response_text = ""
    for message in messages:
        for c in message.content:
            response_text += c.text.value
    clean_text = re.sub('【.*?】', '', response_text)
    return clean_text

def play_wav_file(filename):
    # Open the file
    wav_file = wave.open(filename, 'rb')

    # Create a PyAudio instance
    p = pyaudio.PyAudio()

    # Open a stream
    stream = p.open(format=p.get_format_from_width(wav_file.getsampwidth()),
                    channels=wav_file.getnchannels(),
                    rate=wav_file.getframerate(),
                    output_device_index=6,
                    output=True)
    streamTwo = p.open(format=p.get_format_from_width(wav_file.getsampwidth()),
                       channels=wav_file.getnchannels(),
                       rate=wav_file.getframerate(),
                       output=True)

    # Read data from the file
    data = wav_file.readframes(1024)

    # Play the file by streaming the data
    while data:
        stream.write(data)
        # streamTwo.write(data)
        data = wav_file.readframes(1024)

    # Close the stream and terminate the PyAudio instance
    stream.stop_stream()
    stream.close()
    streamTwo.stop_stream()
    streamTwo.close()
    p.terminate()

def make_tts(content):
    params = {
        'query': content,
    }
    response = requests.get('http://79.116.154.98:35040/tts', params=params)

    # Check if the request was successful
    if response.status_code == 200:
        # Write the content to a file
        with open('downloaded.wav', 'wb') as file:
            file.write(response.content)
    else:
        print(f"Request failed with status code {response.status_code}")
    play_wav_file('downloaded.wav')

def transcribe_directly():
    sample_rate = 16000
    bits_per_sample = 16
    chunk_size = 1024
    audio_format = pyaudio.paInt16
    channels = 1

    def callback(in_data, frame_count, time_info, status):
        wav_file.writeframes(in_data)
        return None, pyaudio.paContinue

    wav_file = wave.open('output.wav', 'wb')
    wav_file.setnchannels(channels)
    wav_file.setsampwidth(bits_per_sample // 8)
    wav_file.setframerate(sample_rate)

    audio = pyaudio.PyAudio()
    input("Press Enter to start recording...")
    stream = audio.open(format=audio_format,
                        channels=channels,
                        rate=sample_rate,
                        input=True,
                        frames_per_buffer=chunk_size,
                        input_device_index=0,
                        stream_callback=callback)

    input("Press Enter to stop recording...")
    stream.stop_stream()
    stream.close()
    audio.terminate()

    wav_file.close()

    audio_file = open('output.wav', "rb")
    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format = "text",
        language="ja"
    )
    # print(transcription)
    return transcription




while True:
  content = transcribe_directly()
  print(content)

  response = get_response(content)
  print(response)
  make_tts(response)

이렇게 AI 캐릭터를 ChatGPT로 만들어보는 코드를 알아보았습니다.

혹시 궁금한 점이 있으면

유투브의

AI 프리렌 - ChatGPT로 캐릭터 인격 만들기

댓글로 달아주시면 확인하고 답변 드리도록 하겠습니다

감사합니다!!

저작자표시 (새창열림)

ABOUT ME

세상 끝 조그만 서가 세상 끝 조그만 서가