vllm 部署 qwen2.5 报错2.5 报错404 已解决
vllm启动千问推理服务报错404 研究发现应该调用的url和qwen2不同
以下列举三种qwen2.5推理服务调用命令 实测有效
1.sh文件 curl调用
curl http://127.0.0.1:9904/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/model/Qwen2.5-72B-Instruct",
"messages": [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'
2. py文件 openAI调用
from openai import OpenAI
import json
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:9902/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="/model/QwQ-32B-Preview",
messages=[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
],
temperature=0.7,
top_p=0.8,
max_tokens=512,
extra_body={
"repetition_penalty": 1.05,
},
)
print("Chat response:", chat_response.json())
3. py文件 request调用
import requests
# 设置服务端地址和端口
host = "localhost"
port = 9902
api_url = f"http://{host}:{port}/v1/completions" # 使用 OpenAI 兼容的 API 路径
# 设置请求参数
payload = {
"model": "/model/QwQ-32B-Preview", # 模型名称
"prompt": "描述一下北京的秋天", # 输入提示
"max_tokens": 512, # 最大生成长度
"temperature": 0.7, # 温度参数,控制生成文本的多样性
"top_p": 0.95 # 核心采样概率
}
# 发送 POST 请求
response = requests.post(api_url, json=payload)
# 检查响应
if response.status_code == 200:
result = response.json()
print("##################### Result: ", result)
print("##################### Generated text:", result["choices"][0]["text"])
else:
print("Error:", response.status_code, response.text)