翻訳モジュールのドキュメントを更新し、セットアップ手順やAPI使用例を追加。型注釈を強化し、関数の戻り値を明示化。エラーハンドリングを改善し、コードの可読性を向上。

This commit is contained in:
misyaguziya
2025-10-09 17:30:48 +09:00
parent 7d24b3839c
commit b26129af68
4 changed files with 273 additions and 125 deletions

View File

@@ -1,3 +1,95 @@
## 翻訳モジュール (models.translation)
このドキュメントは `models/translation` 配下に対して行った最近の変更点、セットアップ手順、API の使い方、テスト方針、トラブルシュートをまとめたものです。
### 概要
- モジュールの責務: テキストの翻訳を行う高レベルの `Translator` クラス、言語コードのマッピング、CTranslate2 用の重み・トークナイザのダウンロード/検証ユーティリティを提供します。
- 変更点の狙い: 型注釈と docstring を追加し、`translation_utils.py` のダウンロード/検証ロジックをシンプルで堅牢な実装へ置換しました。これにより初回セットアップの手順が明確になります。
### 主な変更点(サマリ)
- `translation_translator.py`: 型注釈、docstring を追記。外部依存は存在するが、例外が発生してもモジュールが壊れないように保護されています。
- `translation_languages.py`: 言語コードマッピングの説明を追加。
- `translation_utils.py`: 重みファイルの検証SHA-256 ハッシュ照合、zip 展開、`transformers.AutoTokenizer` を使ったトークナイザ取得、ダウンロード進捗用のコールバックを備えた実装へ置換。
### インストール(依存関係)
必須ではないものが含まれます。開発・最小稼働に必要なパッケージはプロジェクト全体の要件に従ってください。
主に使うパッケージ:
- `requests` — ダウンロード処理
- `transformers` — トークナイザ取得AutoTokenizer
- `ctranslate2` — CTranslate2 を使う場合(ランタイムのみ、テストではモック推奨)
推奨インストール例(任意):
```powershell
pip install requests transformers ctranslate2
```
DeepL や `translators` といった外部 API ラッパーはオプショナルです。CI やローカルテストではモックして動作確認してください。
### 初回セットアップ / 重みの準備
`translation_utils.py` に含まれるユーティリティ関数:
- `checkCTranslate2Weight(root: str, weight_type: str = "small") -> bool`
- 指定した `root/weights/ctranslate2/<model_dir>` 以下に必要なファイルが存在し、既知のハッシュと一致するかをチェックします。
- `downloadCTranslate2Weight(root: str, weight_type: str = "small", callback: Optional[Callable[[float], None]] = None, end_callback: Optional[Callable[[], None]] = None) -> None`
- 重みを ZIP 形式でダウンロードして展開します。
- `callback(progress: float)` は 0.0〜1.0 の進捗通知に使えます。
- `end_callback()` は処理完了時に呼び出されます。
- `downloadCTranslate2Tokenizer(path: str, weight_type: str = "small") -> None`
- `transformers.AutoTokenizer.from_pretrained` を利用してトークナイザをダウンロード/キャッシュします(`cache_dir` に保存)。
呼び出し例(簡単):
```python
from models.translation import translation_utils as tu
# ルートディレクトリ(プロジェクトルートなど)
root = "."
if not tu.checkCTranslate2Weight(root, "small"):
tu.downloadCTranslate2Weight(root, "small", callback=lambda p: print(f"{p*100:.1f}%"))
tu.downloadCTranslate2Tokenizer(root, "small")
```
注意: 大きなモデル(`large`)はダウンロードに時間とディスク容量を要します。
### API 使用例 (`Translator` の簡易例)
以下は `Translator` の想定されるシンプルな使い方です(実装は `translation_translator.py` を参照してください)。
```python
from models.translation.translation_translator import Translator
tr = Translator()
result = tr.translate("Hello", src_lang="en", target_lang="ja")
if result:
print(result)
else:
print("翻訳に失敗しました")
```
戻り値とエラー: 既存のコードベースとの互換性を重視し、失敗時は False を返すケースがあります。API 呼び出し前に戻り値の型を確認してください。
### テスト方針
- 外部サービスDeepL、web 翻訳ラッパー、ctranslate2、transformersはユニットテストでモックします。
- 推奨: `pytest``unittest.mock` を使い、`Translator.translate` の成功パス・失敗パスを検証するテストを追加してください。
簡単なテスト設計:
- 正常系: ctranslate2 経由の翻訳が正しく呼ばれる(モックで期待レスポンスを返す)
- フォールバック系: ctranslate2 が利用できない場合に別の翻訳経路を辿る(モック)
### トラブルシュート
- `ModuleNotFoundError` (例: `sudachidict_full`) — transliteration/別モジュールで必要な辞書が無い場合。該当パッケージのインストールか、当該機能を無効にしてください。
- ハッシュ不一致 — ダウンロード済みファイルの破損が疑われます。該当ファイルを削除して再ダウンロードしてください。
- `transformers` のトークナイザが取得できない場合、ネットワークやキャッシュ先の権限を確認してください。
### 変更履歴
- 2025-10-09: 型注釈と docstring の追加、`translation_utils.py` を再実装してダウンロード/検証ロジックを整理。
---
このドキュメントは簡潔な参照用です。必要なら実行例やさらに詳細なトラブルシュート手順(コマンド出力例、ログの取り方など)を追加します。
# models/translation — 詳細設計 # models/translation — 詳細設計
構成ファイル: 構成ファイル:

View File

@@ -1,4 +1,13 @@
translation_lang = {} """Language code mappings for supported translation backends.
Provides `translation_lang` mapping keyed by backend name with `source` and
`target` maps used by `Translator.getLanguageCode`.
"""
from typing import Dict
translation_lang: Dict[str, Dict[str, Dict[str, str]]] = {}
dict_deepl_languages = { dict_deepl_languages = {
"Arabic":"ar", "Arabic":"ar",
"Bulgarian":"bg", "Bulgarian":"bg",
@@ -37,10 +46,7 @@ dict_deepl_languages = {
"Chinese Simplified":"zh", "Chinese Simplified":"zh",
"Chinese Traditional":"zh" "Chinese Traditional":"zh"
} }
translation_lang["DeepL"] = { translation_lang["DeepL"] = {"source": dict_deepl_languages, "target": dict_deepl_languages}
"source":dict_deepl_languages,
"target":dict_deepl_languages,
}
dict_deepl_api_source_languages = { dict_deepl_api_source_languages = {
"Japanese":"ja", "Japanese":"ja",
@@ -109,10 +115,7 @@ dict_deepl_api_target_languages = {
"Chinese Simplified":"zh", "Chinese Simplified":"zh",
"Chinese Traditional":"zh" "Chinese Traditional":"zh"
} }
translation_lang["DeepL_API"] = { translation_lang["DeepL_API"] = {"source": dict_deepl_api_source_languages, "target": dict_deepl_api_target_languages}
"source": dict_deepl_api_source_languages,
"target": dict_deepl_api_target_languages,
}
dict_google_languages = { dict_google_languages = {
"Japanese":"ja", "Japanese":"ja",
@@ -179,10 +182,7 @@ dict_google_languages = {
# "Basque":"eu", # "Basque":"eu",
"Irish":"ga" "Irish":"ga"
} }
translation_lang["Google"] = { translation_lang["Google"] = {"source": dict_google_languages, "target": dict_google_languages}
"source":dict_google_languages,
"target":dict_google_languages,
}
dict_bing_languages = { dict_bing_languages = {
"Japanese":"ja", "Japanese":"ja",
@@ -247,10 +247,7 @@ dict_bing_languages = {
"Punjabi":"pa", "Punjabi":"pa",
"Irish":"ga" "Irish":"ga"
} }
translation_lang["Bing"] = { translation_lang["Bing"] = {"source": dict_bing_languages, "target": dict_bing_languages}
"source":dict_bing_languages,
"target":dict_bing_languages,
}
dict_papago_languages = { dict_papago_languages = {
"German": "de", "German": "de",
@@ -270,10 +267,7 @@ dict_papago_languages = {
"Chinese Traditional":"zh-TW", "Chinese Traditional":"zh-TW",
} }
translation_lang["Papago"] = { translation_lang["Papago"] = {"source": dict_papago_languages, "target": dict_papago_languages}
"source":dict_papago_languages,
"target":dict_papago_languages,
}
dict_ctranslate2_languages = { dict_ctranslate2_languages = {
"English": "en", "English": "en",
@@ -378,7 +372,4 @@ dict_ctranslate2_languages = {
"Sundanese": "su" "Sundanese": "su"
} }
translation_lang["CTranslate2"] = { translation_lang["CTranslate2"] = {"source": dict_ctranslate2_languages, "target": dict_ctranslate2_languages}
"source":dict_ctranslate2_languages,
"target":dict_ctranslate2_languages,
}

View File

@@ -4,6 +4,7 @@ try:
from translators import translate_text as other_web_Translator from translators import translate_text as other_web_Translator
ENABLE_TRANSLATORS = True ENABLE_TRANSLATORS = True
except Exception: except Exception:
other_web_Translator = None # type: ignore
ENABLE_TRANSLATORS = False ENABLE_TRANSLATORS = False
from .translation_languages import translation_lang from .translation_languages import translation_lang
@@ -14,22 +15,37 @@ import transformers
from utils import errorLogging, getBestComputeType from utils import errorLogging, getBestComputeType
import warnings import warnings
from typing import Any, Optional, Tuple
warnings.filterwarnings("ignore") warnings.filterwarnings("ignore")
# Translator
class Translator():
def __init__(self):
self.deepl_client = None
self.ctranslate2_translator = None
self.ctranslate2_tokenizer = None
self.is_loaded_ctranslate2_model = False
self.is_changed_translator_parameters = False
self.is_enable_translators = ENABLE_TRANSLATORS
def authenticationDeepLAuthKey(self, authkey): class Translator:
"""High-level translator facade.
This class wraps multiple backends (DeepL, DeepL API, Google, Bing, Papago,
and CTranslate2 local models). Optional dependencies may be unavailable at
runtime; methods degrade gracefully and return False or an empty string on
failure (kept compatible with existing behavior).
"""
def __init__(self) -> None:
self.deepl_client: Optional[DeepLClient] = None
self.ctranslate2_translator: Any = None
self.ctranslate2_tokenizer: Any = None
self.is_loaded_ctranslate2_model: bool = False
self.is_changed_translator_parameters: bool = False
self.is_enable_translators: bool = ENABLE_TRANSLATORS
def authenticationDeepLAuthKey(self, authkey: str) -> bool:
"""Authenticate DeepL API with the provided key.
Returns True on success, False on failure.
"""
result = True result = True
try: try:
self.deepl_client = DeepLClient(authkey) self.deepl_client = DeepLClient(authkey)
# quick smoke test
self.deepl_client.translate_text(" ", target_lang="EN-US") self.deepl_client.translate_text(" ", target_lang="EN-US")
except Exception: except Exception:
errorLogging() errorLogging()
@@ -37,7 +53,12 @@ class Translator():
result = False result = False
return result return result
def changeCTranslate2Model(self, path, model_type, device="cpu", device_index=0, compute_type="auto"): def changeCTranslate2Model(self, path: str, model_type: str, device: str = "cpu", device_index: int = 0, compute_type: str = "auto") -> None:
"""Load a CTranslate2 model from weights.
This sets internal translator/tokenizer objects and flips
``is_loaded_ctranslate2_model`` on success.
"""
self.is_loaded_ctranslate2_model = False self.is_loaded_ctranslate2_model = False
directory_name = ctranslate2_weights[model_type]["directory_name"] directory_name = ctranslate2_weights[model_type]["directory_name"]
tokenizer = ctranslate2_weights[model_type]["tokenizer"] tokenizer = ctranslate2_weights[model_type]["tokenizer"]
@@ -52,7 +73,7 @@ class Translator():
device_index=device_index, device_index=device_index,
compute_type=compute_type, compute_type=compute_type,
inter_threads=1, inter_threads=1,
intra_threads=4 intra_threads=4,
) )
try: try:
self.ctranslate2_tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer, cache_dir=tokenizer_path) self.ctranslate2_tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer, cache_dir=tokenizer_path)
@@ -62,17 +83,21 @@ class Translator():
self.ctranslate2_tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer, cache_dir=tokenizer_path) self.ctranslate2_tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer, cache_dir=tokenizer_path)
self.is_loaded_ctranslate2_model = True self.is_loaded_ctranslate2_model = True
def isLoadedCTranslate2Model(self): def isLoadedCTranslate2Model(self) -> bool:
return self.is_loaded_ctranslate2_model return self.is_loaded_ctranslate2_model
def isChangedTranslatorParameters(self): def isChangedTranslatorParameters(self) -> bool:
return self.is_changed_translator_parameters return self.is_changed_translator_parameters
def setChangedTranslatorParameters(self, is_changed): def setChangedTranslatorParameters(self, is_changed: bool) -> None:
self.is_changed_translator_parameters = is_changed self.is_changed_translator_parameters = is_changed
def translateCTranslate2(self, message, source_language, target_language): def translateCTranslate2(self, message: str, source_language: str, target_language: str) -> Any:
result = False """Translate using a loaded CTranslate2 model.
Returns a string on success or False on failure (keeps legacy behavior).
"""
result: Any = False
if self.is_loaded_ctranslate2_model is True: if self.is_loaded_ctranslate2_model is True:
try: try:
self.ctranslate2_tokenizer.src_lang = source_language self.ctranslate2_tokenizer.src_lang = source_language
@@ -86,7 +111,11 @@ class Translator():
return result return result
@staticmethod @staticmethod
def getLanguageCode(translator_name, target_country, source_language, target_language): def getLanguageCode(translator_name: str, target_country: str, source_language: str, target_language: str) -> Tuple[str, str]:
"""Resolve a friendly language name to translator-specific codes.
Returns (source_code, target_code).
"""
match translator_name: match translator_name:
case "DeepL_API": case "DeepL_API":
if target_language == "English": if target_language == "English":
@@ -101,66 +130,63 @@ class Translator():
target_language = "Portuguese Brazilian" target_language = "Portuguese Brazilian"
case _: case _:
pass pass
source_language=translation_lang[translator_name]["source"][source_language] source_language = translation_lang[translator_name]["source"][source_language]
target_language=translation_lang[translator_name]["target"][target_language] target_language = translation_lang[translator_name]["target"][target_language]
return source_language, target_language return source_language, target_language
def translate(self, translator_name, source_language, target_language, target_country, message): def translate(self, translator_name: str, source_language: str, target_language: str, target_country: str, message: str) -> Any:
"""Translate `message` using the named translator backend.
Returns translated string on success, or False on failure. When
source_language == target_language the original message is returned.
"""
try: try:
if source_language == target_language: if source_language == target_language:
return message return message
result = "" result: Any = ""
source_language, target_language = self.getLanguageCode(translator_name, target_country, source_language, target_language) source_language, target_language = self.getLanguageCode(translator_name, target_country, source_language, target_language)
match translator_name: match translator_name:
case "DeepL": case "DeepL":
if self.is_enable_translators is True: if self.is_enable_translators is True and other_web_Translator is not None:
result = other_web_Translator( result = other_web_Translator(
query_text=message, query_text=message,
translator="deepl", translator="deepl",
from_language=source_language, from_language=source_language,
to_language=target_language, to_language=target_language,
) )
case "DeepL_API": case "DeepL_API":
if self.is_enable_translators is True: if self.is_enable_translators is True:
if self.deepl_client is None: if self.deepl_client is None:
result = False result = False
else: else:
result = self.deepl_client.translate_text( result = self.deepl_client.translate_text(message, source_lang=source_language, target_lang=target_language).text
message,
source_lang=source_language,
target_lang=target_language,
).text
case "Google": case "Google":
if self.is_enable_translators is True: if self.is_enable_translators is True and other_web_Translator is not None:
result = other_web_Translator( result = other_web_Translator(
query_text=message, query_text=message,
translator="google", translator="google",
from_language=source_language, from_language=source_language,
to_language=target_language, to_language=target_language,
) )
case "Bing": case "Bing":
if self.is_enable_translators is True: if self.is_enable_translators is True and other_web_Translator is not None:
result = other_web_Translator( result = other_web_Translator(
query_text=message, query_text=message,
translator="bing", translator="bing",
from_language=source_language, from_language=source_language,
to_language=target_language, to_language=target_language,
) )
case "Papago": case "Papago":
if self.is_enable_translators is True: if self.is_enable_translators is True and other_web_Translator is not None:
result = other_web_Translator( result = other_web_Translator(
query_text=message, query_text=message,
translator="papago", translator="papago",
from_language=source_language, from_language=source_language,
to_language=target_language, to_language=target_language,
)
case "CTranslate2":
result = self.translateCTranslate2(
message=message,
source_language=source_language,
target_language=target_language,
) )
case "CTranslate2":
result = self.translateCTranslate2(message=message, source_language=source_language, target_language=target_language)
except Exception: except Exception:
errorLogging() errorLogging()
result = False result = False

View File

@@ -3,13 +3,22 @@ from zipfile import ZipFile
from os import path as os_path from os import path as os_path
from os import makedirs as os_makedirs from os import makedirs as os_makedirs
from requests import get as requests_get from requests import get as requests_get
from typing import Callable from typing import Callable, Optional
import hashlib import hashlib
import transformers import transformers
from utils import errorLogging from utils import errorLogging
"""Utilities for downloading and verifying CTranslate2 weights and tokenizers.
This module provides a small, dependency-light set of helpers used by the
translation layer. It purposely keeps behavior resilient: network errors are
logged (via utils.errorLogging) and the functions return/complete without
raising, which matches the repository's defensive style.
"""
ctranslate2_weights = { ctranslate2_weights = {
"small": { # M2M-100 418M-parameter model "small": {
"url": "https://github.com/misyaguziya/VRCT-weights/releases/download/v1.0/m2m100_418m.zip", "url": "https://github.com/misyaguziya/VRCT-weights/releases/download/v1.0/m2m100_418m.zip",
"directory_name": "m2m100_418m", "directory_name": "m2m100_418m",
"tokenizer": "facebook/m2m100_418M", "tokenizer": "facebook/m2m100_418M",
@@ -17,9 +26,9 @@ ctranslate2_weights = {
"model.bin": "e7c26a9abb5260abd0268fbe3040714070dec254a990b4d7fd3f74c5230e3acb", "model.bin": "e7c26a9abb5260abd0268fbe3040714070dec254a990b4d7fd3f74c5230e3acb",
"sentencepiece.model": "d8f7c76ed2a5e0822be39f0a4f95a55eb19c78f4593ce609e2edbc2aea4d380a", "sentencepiece.model": "d8f7c76ed2a5e0822be39f0a4f95a55eb19c78f4593ce609e2edbc2aea4d380a",
"shared_vocabulary.txt": "bd440aa21b8ca3453fc792a0018a1f3fe68b3464aadddd4d16a4b72f73c86d8c", "shared_vocabulary.txt": "bd440aa21b8ca3453fc792a0018a1f3fe68b3464aadddd4d16a4b72f73c86d8c",
} },
}, },
"large": { # M2M-100 1.2B-parameter model "large": {
"url": "https://github.com/misyaguziya/VRCT-weights/releases/download/v1.0/m2m100_12b.zip", "url": "https://github.com/misyaguziya/VRCT-weights/releases/download/v1.0/m2m100_12b.zip",
"directory_name": "m2m100_12b", "directory_name": "m2m100_12b",
"tokenizer": "facebook/m2m100_1.2b", "tokenizer": "facebook/m2m100_1.2b",
@@ -27,77 +36,107 @@ ctranslate2_weights = {
"model.bin": "abb7bf4ba7e5e016b6e3ed480c752459b2f783ac8fca372e7587675e5bf3a919", "model.bin": "abb7bf4ba7e5e016b6e3ed480c752459b2f783ac8fca372e7587675e5bf3a919",
"sentencepiece.model": "d8f7c76ed2a5e0822be39f0a4f95a55eb19c78f4593ce609e2edbc2aea4d380a", "sentencepiece.model": "d8f7c76ed2a5e0822be39f0a4f95a55eb19c78f4593ce609e2edbc2aea4d380a",
"shared_vocabulary.txt": "bd440aa21b8ca3453fc792a0018a1f3fe68b3464aadddd4d16a4b72f73c86d8c", "shared_vocabulary.txt": "bd440aa21b8ca3453fc792a0018a1f3fe68b3464aadddd4d16a4b72f73c86d8c",
} },
}, },
} }
def calculate_file_hash(file_path, block_size=65536):
def calculate_file_hash(file_path: str, block_size: int = 65536) -> str:
hash_object = hashlib.sha256() hash_object = hashlib.sha256()
with open(file_path, "rb") as f:
with open(file_path, 'rb') as file: for block in iter(lambda: f.read(block_size), b""):
for block in iter(lambda: file.read(block_size), b''):
hash_object.update(block) hash_object.update(block)
return hash_object.hexdigest() return hash_object.hexdigest()
def checkCTranslate2Weight(root, weight_type="small"):
weight_directory_name = ctranslate2_weights[weight_type]["directory_name"]
hash_data = ctranslate2_weights[weight_type]["hash"]
files = [
"model.bin",
"sentencepiece.model",
"shared_vocabulary.txt"
]
path = os_path.join(root, "weights", "ctranslate2")
# check already downloaded def checkCTranslate2Weight(root: str, weight_type: str = "small") -> bool:
already_downloaded = False """Return True if the requested weight files exist and match their hashes.
if all(os_path.exists(os_path.join(path, weight_directory_name, file)) for file in files):
# check hash
for file in files:
original_hash = hash_data[file]
current_hash = calculate_file_hash(os_path.join(path, weight_directory_name, file))
if original_hash != current_hash:
break
already_downloaded = True
return already_downloaded
def downloadCTranslate2Weight(root, weight_type="small", callback=None, end_callback=None): This function intentionally avoids raising: callers use the boolean to
url = ctranslate2_weights[weight_type]["url"] decide whether to (re)download weights.
filename = "weight.zip" """
path = os_path.join(root, "weights", "ctranslate2") weight_info = ctranslate2_weights.get(weight_type)
os_makedirs(path, exist_ok=True) if weight_info is None:
return False
if checkCTranslate2Weight(root, weight_type) is False: weight_directory_name = weight_info["directory_name"]
hash_data = weight_info["hash"]
files = ["model.bin", "sentencepiece.model", "shared_vocabulary.txt"]
base_path = os_path.join(root, "weights", "ctranslate2")
# quick existence check
for f in files:
p = os_path.join(base_path, weight_directory_name, f)
if not os_path.exists(p):
return False
# verify hashes
for f in files:
p = os_path.join(base_path, weight_directory_name, f)
try: try:
with tempfile.TemporaryDirectory() as tmp_path: if calculate_file_hash(p) != hash_data[f]:
res = requests_get(url, stream=True) return False
file_size = int(res.headers.get('content-length', 0))
total_chunk = 0
with open(os_path.join(tmp_path, filename), 'wb') as file:
for chunk in res.iter_content(chunk_size=1024*2000):
file.write(chunk)
if isinstance(callback, Callable):
total_chunk += len(chunk)
callback(total_chunk/file_size)
with ZipFile(os_path.join(tmp_path, filename)) as zf:
zf.extractall(path)
except Exception: except Exception:
errorLogging() errorLogging()
return False
return True
if isinstance(end_callback, Callable):
end_callback()
def downloadCTranslate2Tokenizer(path, weight_type="small"): def downloadCTranslate2Weight(root: str, weight_type: str = "small", callback: Optional[Callable[[float], None]] = None, end_callback: Optional[Callable[[], None]] = None) -> None:
directory_name = ctranslate2_weights[weight_type]["directory_name"] """Download and extract ctranslate2 weights for the given type.
tokenizer = ctranslate2_weights[weight_type]["tokenizer"]
tokenizer_path = os_path.join(path, "weights", "ctranslate2", directory_name, "tokenizer")
callback receives a float between 0 and 1 for progress when available.
end_callback is invoked after success or failure to allow caller cleanup.
"""
weight_info = ctranslate2_weights.get(weight_type)
if weight_info is None:
return
url = weight_info["url"]
filename = "weight.zip"
dst_path = os_path.join(root, "weights", "ctranslate2")
os_makedirs(dst_path, exist_ok=True)
if checkCTranslate2Weight(root, weight_type):
if callable(end_callback):
end_callback()
return
try: try:
os_makedirs(tokenizer_path, exist_ok=True) with tempfile.TemporaryDirectory() as tmp_path:
transformers.AutoTokenizer.from_pretrained(tokenizer, cache_dir=tokenizer_path) res = requests_get(url, stream=True, timeout=30)
total = int(res.headers.get("content-length", 0) or 0)
written = 0
out_path = os_path.join(tmp_path, filename)
with open(out_path, "wb") as out:
for chunk in res.iter_content(chunk_size=1024 * 1024):
if not chunk:
continue
out.write(chunk)
written += len(chunk)
if callable(callback) and total:
try:
callback(written / total)
except Exception:
errorLogging()
with ZipFile(out_path) as zf:
zf.extractall(dst_path)
except Exception: except Exception:
errorLogging() errorLogging()
tokenizer_path = os_path.join("./weights", "ctranslate2", directory_name, "tokenizer") finally:
transformers.AutoTokenizer.from_pretrained(tokenizer, cache_dir=tokenizer_path) if callable(end_callback):
end_callback()
def downloadCTranslate2Tokenizer(root: str, weight_type: str = "small") -> None:
"""Ensure a tokenizer for the requested weight is available (cached).
This will attempt to download the tokenizer via Hugging Face's transformers
and cache it under the weights directory. It logs failures instead of
raising to keep runtime resilient during startup.
"""
weight_info = ctranslate2_weights.get(weight_type)
if weight_info is None:
return
directory_name = weight_info["directory_name"]
tokenizer_name = weight_info["tokenizer"]
tokenizer_cache = os_path.join(root, "weights", "ctranslate2", directory_name, "tokenizer")
try:
os_makedirs(tokenizer_cache, exist_ok=True)
transformers.AutoTokenizer.from_pretrained(tokenizer_name, cache_dir=tokenizer_cache)
except Exception:
errorLogging()