markliou's murmur: Exception: Vocab size mismatch (model has 61952, but XXX/tokenizer.model has 61875). Add the --pad-vocab option and try again.

2024年1月14日星期日

Exception: Vocab size mismatch (model has 61952, but XXX/tokenizer.model has 61875). Add the --pad-vocab option and try again.

情境

透過 llama.cpp工具進行gguf轉檔時，在轉檔過程中出這樣的錯誤訊息。

說明

這種錯誤源自於token model跟vocabulary檔案的output不同。通常都是vocabulary會因為客製化fine-tune過程中，使用的corpus不同，造成在進行cutomize token時，vocabulary轉變了。但進行fine-tune的人沒有發現這問題，設定檔就對不起來。

work arround的做法就是增補vocabulary檔案。想要看vocabulary資訊跟tokenize model資訊可以從:

config.json 當中有 "vocab_size"
tokenizer.json 當中有 token model的輸出數量

解決辦法

1. 所有的下載方式都可以沿用原本的做法，透過script進行下載(範例: https://github.com/markliou/tool_scripts/blob/master/python/llama.cpp_tools/down_hf_model_snapshot.py )

2. 修改vocabulary的數量，直接在vocabulary補上padding。這邊涉及json重寫，所以還是用別人寫好的python工具會比較簡單:

from pathlib import Path
from transformers import AutoTokenizer

pad_no = 61952 - 61875 # 這邊放上vocalbulary和token的數量差異
tokenizer_model_name = 'Breeze-7B-Instruct-v0.1' # 這邊放上要轉換的模型

model_path = 'output'
new_tokens = [f"<pad{i}>" for i in range(31)]

tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer.add_tokens(new_tokens)

tokenizer.save_pretrained(Path(model_path))
tokenizer.save_vocabulary(model_path)

上面的範例就轉了Llama-2-7b-hf的模型，並且把重建好的json放到output的資料夾裡面。接下來檢查一下建立好的json有沒有異常。確認沒有異常後，就是把output裡面所有的東西全部copy到模型的資料夾，上面的範例就是放到 NousResearch/Llama-2-7b-hf 資料夾底下。

3. 接下來就直接使用llama.cpp/convert.py進行轉檔即可。

ref:

https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b/discussions/1

https://github.com/ggerganov/llama.cpp/discussions/2948

markliou's murmur

2024年1月14日星期日

Exception: Vocab size mismatch (model has 61952, but XXX/tokenizer.model has 61875). Add the --pad-vocab option and try again.

情境

說明

解決辦法

ref:

沒有留言:

張貼留言

DGX 分散運算儲存系統筆記

網頁

搜尋此網誌

2024年1月14日 星期日

Exception: Vocab size mismatch (model has 61952, but XXX/tokenizer.model has 61875). Add the --pad-vocab option and try again.

情境

說明

解決辦法

ref:

沒有留言:

張貼留言

DGX 分散運算儲存系統筆記

2024年1月14日星期日