将大语言模型(LLM)的推理端点部署在Azure Functions这类Serverless平台上,初看起来是一个完美的组合:按需扩展、无服务器运维负担、与云生态系统无缝集成。但在真实的项目中,这种看似简单的架构背后隐藏着巨大的成本和稳定性风险。一个未经保护和优化的LLM Function App端点,在面对流量突发或恶意请求时,会迅速演变成一场财务灾难。每一次调用不仅消耗计算资源,还可能涉及昂贵的第三方模型API费用。
我们面临的具体挑战是:如何在一个对成本高度敏感的生产环境中,既要利用Azure Functions的弹性,又要为后端的LLM服务构建一个坚固的、具备高级缓存、精细化流控和统一可观测性的前置层?
方案A: 纯粹的Azure原生方案
最直接的思路是完全依赖Azure生态内的组件。通常这意味着使用Azure API Management (APIM) 作为Azure Functions的前置网关。
graph TD Client --> APIM[Azure API Management]; APIM -- Inbound Policies --> FunctionApp[Azure Function LLM Endpoint]; FunctionApp -- OpenAI/LLM API Call --> LLM; FunctionApp --> APIM; APIM -- Outbound Policies --> Client;
优势分析:
- 全托管服务: APIM和Functions都是PaaS服务,极大地减轻了运维团队的负担。无需关心底层基础设施的维护、补丁和扩展。
- 集成度高: 与Azure AD、Application Insights、Log Analytics等原生集成良好,身份验证和基础监控的配置相对直接。
- 基础策略支持: APIM提供了内置的缓存策略 (
cache-lookup
,cache-store
) 和速率限制策略 (rate-limit-by-key
)。
劣势与陷阱:
- 成本模型僵化: APIM的定价层级是阶梯式的,对于需要高性能和VNet集成的场景,必须选用Premium层,这本身就是一笔不菲的固定开销,与Serverless的按需付费理念背道而驰。
- 缓存策略的局限性: APIM的缓存键通常基于URL路径、查询参数或Header。对于LLM服务,请求的核心是POST Body中的
prompt
内容。要在APIM中实现基于请求体的哈希作为缓存键,需要编写复杂的、性能可能不佳的Inbound Policy XML,这在实践中非常笨拙且难以维护。一个常见的错误是忽略这一点,导致缓存完全无效。 - 流控不够灵活: 虽然
rate-limit-by-key
可以工作,但要实现更复杂的场景,例如“用户A在套餐1下每分钟10次请求,用户B在套餐2下每分钟60次请求,并且整个系统总QPS不超过1000”,这类组合策略的实现同样需要复杂的Policy配置,动态调整能力弱。 - 技术栈锁定: 该方案将整个API生命周期管理与Azure深度绑定。如果未来需要混合云部署,或将部分后端迁移至其他平台(如Kubernetes),迁移成本会非常高。
在真实项目中,我们发现APIM的缓存对于LLM这种POST请求为主、关键信息在Body内的场景,几乎形同虚设。每一次相似的请求都在穿透APIM,直接调用昂贵的Function App,成本失控的风险极高。
方案B: APISIX网关 + Serverless后端的混合架构
为了解决上述问题,我们评估了第二种方案:引入一个自部署的高性能API网关——Apache APISIX,作为流量的入口,后端再连接到Azure Functions。APISIX可以部署在Azure Kubernetes Service (AKS) 或虚拟机上。
graph TD subgraph Azure Cloud FunctionApp[Azure Function LLM Endpoint] AKS[AKS Cluster] end Client --> APISIX[APISIX Ingress / Gateway on AKS]; subgraph APISIX Processing direction LR A[Auth Plugin] --> B(Rate Limit Plugin) --> C{Cache Plugin}; end APISIX --> C; C -- Cache Hit --> APISIX_Response[Serve from Cache]; C -- Cache Miss --> FunctionApp; FunctionApp -- LLM API Call --> LLM[External LLM Service]; FunctionApp --> APISIX_Cache_Write[Write to Cache]; APISIX_Cache_Write --> APISIX_Response; APISIX_Response --> Client;
优势分析:
- 极致的灵活性与性能: APISIX基于Nginx和LuaJIT构建,性能极高。其插件机制完全是动态的,可以通过API实时更新,无需重启服务。
- 强大的缓存能力: APISIX的
proxy-cache
插件支持使用Nginx变量作为缓存键。我们可以轻易地将请求体的MD5哈希值设置为缓存键,这完美解决了LLM服务的缓存痛点。 - 精细化、分布式的流控:
limit-req
,limit-conn
,limit-count
等一系列插件可以组合使用,并且支持基于Redis的分布式限流,保证在APISIX集群扩展时策略依然精确有效。可以轻松实现基于JWT claim、API Key等动态维度的复杂限流逻辑。 - 云原生与开源生态: APISIX是CNCF毕业项目,与Prometheus, SkyWalking, Fluentd等云原生工具无缝集成。这使得我们可以构建一个与特定云厂商解耦的、统一的可观测性平台。
- 成本效益: 虽然需要承担AKS集群的运维成本,但通过精确缓存可以节省大量的LLM调用费用。在流量达到一定规模后,总拥有成本(TCO)远低于APIM Premium方案。
劣势考量:
- 运维责任: 团队需要负责APISIX集群的部署、维护和高可用性保障。这是一个从PaaS向IaaS/CaaS的责任转移,需要相应的Kubernetes运维经验。
- 引入新组件: 架构中增加了一个新的技术组件,带来了学习成本和潜在的故障点。
最终决策:
考虑到我们的业务场景中存在大量重复性或高频次的查询(例如,客服知识库问答、特定格式的数据提取),缓存能带来的成本节约是决定性的。APISIX提供的基于请求体的缓存能力是方案A无法有效替代的。因此,我们选择方案B。运维成本的增加被认为是可接受的,因为它换来了对系统行为的完全控制权和长期的架构灵活性。
核心实现与代码
以下是构建此架构的关键代码和配置。
1. Azure Function (Python v2 Programming Model)
这是后端的LLM服务,使用Python实现。代码必须健壮,包含日志、配置管理和错误处理。
function_app.py
:
import azure.functions as func
import openai
import logging
import os
import json
from hashlib import md5
# --- Configuration ---
# A common mistake is hardcoding keys. Always use environment variables.
# In Azure Functions, these are configured in "Application settings".
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
OPENAI_ENDPOINT = os.environ.get("OPENAI_ENDPOINT")
OPENAI_API_VERSION = os.environ.get("OPENAI_API_VERSION", "2023-07-01-preview")
OPENAI_MODEL_NAME = os.environ.get("OPENAI_MODEL_NAME", "gpt-4")
# Initialize the Azure OpenAI client
# It's a best practice to initialize clients outside the function handler
# to reuse connections across multiple invocations.
try:
client = openai.AzureOpenAI(
api_key=OPENAI_API_KEY,
azure_endpoint=OPENAI_ENDPOINT,
api_version=OPENAI_API_VERSION,
)
except Exception as e:
logging.error(f"Failed to initialize AzureOpenAI client: {e}")
client = None
# --- Function App Definition ---
app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)
@app.route(route="generate", methods=["POST"])
def generate_completion(req: func.HttpRequest) -> func.HttpResponse:
"""
HTTP trigger function to generate text completion using an LLM.
"""
# Generate a unique invocation ID for tracing purposes.
invocation_id = req.headers.get("X-Request-ID", "N/A")
logging.info(f"Invocation ID [{invocation_id}]: Python HTTP trigger function processed a request.")
if not client:
logging.error(f"Invocation ID [{invocation_id}]: OpenAI client is not initialized.")
return func.HttpResponse(
"Internal Server Error: LLM service is not configured.",
status_code=500,
mimetype="application/json"
)
try:
req_body = req.get_json()
except ValueError:
logging.error(f"Invocation ID [{invocation_id}]: Invalid JSON in request body.")
return func.HttpResponse(
json.dumps({"error": "Invalid JSON format"}),
status_code=400,
mimetype="application/json"
)
prompt = req_body.get("prompt")
if not prompt or not isinstance(prompt, str):
logging.warning(f"Invocation ID [{invocation_id}]: 'prompt' field is missing or not a string.")
return func.HttpResponse(
json.dumps({"error": "'prompt' must be a non-empty string"}),
status_code=400,
mimetype="application/json"
)
# --- Core Logic ---
try:
logging.info(f"Invocation ID [{invocation_id}]: Sending request to LLM for prompt hash: {md5(prompt.encode()).hexdigest()}")
response = client.chat.completions.create(
model=OPENAI_MODEL_NAME,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
temperature=0.7,
max_tokens=800,
)
completion_text = response.choices[0].message.content
logging.info(f"Invocation ID [{invocation_id}]: Successfully received completion from LLM.")
# Structure the response for the client
response_body = {
"data": {
"completion": completion_text,
"model": response.model,
"usage": {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"total_tokens": response.usage.total_tokens,
}
},
"status": "success"
}
return func.HttpResponse(
json.dumps(response_body),
status_code=200,
mimetype="application/json"
)
except openai.APIError as e:
# Handle specific OpenAI errors
logging.error(f"Invocation ID [{invocation_id}]: OpenAI API returned an error: {e.status_code} - {e.response}")
return func.HttpResponse(
json.dumps({"error": "The LLM service failed to process the request.", "details": str(e)}),
status_code=502, # Bad Gateway, as we are a gateway to the LLM service
mimetype="application/json"
)
except Exception as e:
# Generic error handler
logging.error(f"Invocation ID [{invocation_id}]: An unexpected error occurred: {e}", exc_info=True)
return func.HttpResponse(
json.dumps({"error": "An internal server error occurred."}),
status_code=500,
mimetype="application/json"
)
host.json
: 确保启用了详细的应用程序日志。
{
"version": "2.0",
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
},
"logLevel": {
"default": "Information"
}
},
"extensions": {
"http": {
"routePrefix": "api"
}
}
}
2. APISIX 核心配置 (YAML for APISIX Declarative Configuration)
假设APISIX已部署在AKS上,我们使用ApisixRoute
和ApisixUpstream
等CRD进行声明式配置。
llm-service.yaml
:
# llm-service.yaml
apiVersion: apisix.apache.org/v2
kind: ApisixUpstream
metadata:
name: azure-function-llm-upstream
spec:
# The 'nodes' field should contain the hostname of your Azure Function App.
# The scheme is HTTPS.
nodes:
- host: your-function-app-name.azurewebsites.net
port: 443
weight: 100
scheme: https
# Use pass_host: 'pass' to forward the original Host header from the client.
# Azure Functions uses the Host header for routing to the correct app.
pass_host: pass
# Use 'keepalive' to reuse connections to the function app, reducing TLS handshake overhead.
keepalive:
requests: 1000
idle_timeout: 60s
---
apiVersion: apisix.apache.org/v2
kind: ApisixRoute
metadata:
name: route-llm-service
spec:
http:
- name: llm-generate-rule
match:
hosts:
- "api.yourdomain.com"
paths:
- "/v1/llm/generate"
# The upstream created above
upstream_name: azure-function-llm-upstream
# The actual path on the Azure Function App.
# Note: We are rewriting the public path to the internal function path.
rewrite:
path: "/api/generate"
plugins:
# --- Plugin 1: Authentication ---
# Here we assume a user's API key is passed in a header 'X-API-KEY'.
# For production, jwt-auth is a much better choice.
- name: key-auth
enable: true
config:
# The key is deliberately not set here. It should be configured
# on a Consumer object for security. This is just for demonstration.
key: "user-key-in-header"
# --- Plugin 2: Caching - The Core of Cost Control ---
- name: proxy-cache
enable: true
config:
# This is the most critical part.
# We use a combination of variables to create a unique cache key.
# $consumer_name ensures caches are per-user.
# $request_body is the raw POST body. We hash it for efficiency and to handle large bodies.
# Here we use Nginx's built-in MD5.
cache_key:
- "$consumer_name"
- "$uri"
- "body_md5:$md5_of_request_body"
cache_ttl: 3600s # Cache valid for 1 hour. This depends on business logic.
cache_zone: llm_cache_zone # Requires pre-configuration in apisix config.yaml
cache_http_status: [200] # Only cache successful responses.
# This tells APISIX to read the request body, which is needed for the cache_key.
request_body_read: true
# --- Plugin 3: Rate Limiting - The Resilience Layer ---
- name: limit-req
enable: true
config:
# Rate limit is 10 requests per minute (60 seconds).
rate: 10
burst: 5
# The key is the consumer name, so each authenticated user gets their own rate limit bucket.
key_type: "consumer_name"
rejected_code: 429
rejected_msg: "Too many requests. Please try again later."
# --- Plugin 4: Request Rewriting & Header Injection ---
- name: proxy-rewrite
enable: true
config:
# Azure Functions requires a function key for authentication.
# We add it here at the gateway level. The client doesn't need to know it.
# This key should be stored securely, e.g., in an APISIX secret.
headers:
x-functions-key: "{{AZURE_FUNCTION_KEY_FROM_SECRET}}"
# Forward a unique request ID for end-to-end tracing.
X-Request-ID: "$request_id"
在这个配置中:
-
ApisixUpstream
定义了后端服务,即Azure Function App的地址。pass_host: pass
是关键,因为Azure的前端需要正确的主机头来路由。 -
proxy-cache
是成本控制的核心。cache_key
的构造极为重要,它结合了消费者名称、URI和请求体的MD5值,确保了只有完全相同的用户发出的完全相同的prompt才会命中缓存。 -
limit-req
为每个API消费者提供了独立的请求速率限制,防止单个用户的滥用行为拖垮整个服务或产生意外费用。 -
proxy-rewrite
插件在请求发送到后端前注入了x-functions-key
。这是一个重要的安全实践,它将后端的认证凭证与客户端解耦,凭证由网关统一管理。
架构的扩展性与局限性
这种混合架构模式的扩展性非常好。如果未来需要引入另一个由不同团队开发的、基于AWS Lambda的LLM模型进行A/B测试,只需在APISIX中新增一个upstream和一条路由规则,通过traffic-split
插件即可实现流量的按比例分配,而对客户端完全透明。这是纯云厂商方案难以优雅实现的。
然而,该架构也存在其适用边界。首先,我们引入的缓存策略只对具有高重复性的幂等请求有效。对于高度个性化、上下文连续的对话式AI应用,请求体几乎每次都不同,缓存命中率会趋近于零,proxy-cache
插件的价值将大大降低。在这种场景下,成本控制的重点需要转移到更智能的模型路由(例如,简单问题用小模型,复杂问题用大模型)或者通过prompt engineering来压缩token使用量。
其次,APISIX集群本身的运维并非没有成本。虽然它性能卓越,但一个高可用的APISIX集群需要配套的etcd集群、监控告警体系(Prometheus/Grafana)以及专业的运维知识。对于流量极低、对成本不敏感的内部应用,直接使用Azure Functions + APIM的简单组合可能在总体上更具成本效益。这个架构的价值在于当服务规模化、对性能、成本和控制力有更高要求时,它提供了一个专业且灵活的解决方案。