利用APISIX插件体系为Azure Functions上的LLM服务构建韧性与成本控制层

架构设计

文章字数: 3.2k

阅读时长: 14 分

将大语言模型（LLM）的推理端点部署在Azure Functions这类Serverless平台上，初看起来是一个完美的组合：按需扩展、无服务器运维负担、与云生态系统无缝集成。但在真实的项目中，这种看似简单的架构背后隐藏着巨大的成本和稳定性风险。一个未经保护和优化的LLM Function App端点，在面对流量突发或恶意请求时，会迅速演变成一场财务灾难。每一次调用不仅消耗计算资源，还可能涉及昂贵的第三方模型API费用。

我们面临的具体挑战是：如何在一个对成本高度敏感的生产环境中，既要利用Azure Functions的弹性，又要为后端的LLM服务构建一个坚固的、具备高级缓存、精细化流控和统一可观测性的前置层？

方案A: 纯粹的Azure原生方案

最直接的思路是完全依赖Azure生态内的组件。通常这意味着使用Azure API Management (APIM) 作为Azure Functions的前置网关。

graph TD
    Client --> APIM[Azure API Management];
    APIM -- Inbound Policies --> FunctionApp[Azure Function LLM Endpoint];
    FunctionApp -- OpenAI/LLM API Call --> LLM;
    FunctionApp --> APIM;
    APIM -- Outbound Policies --> Client;

优势分析:

全托管服务: APIM和Functions都是PaaS服务，极大地减轻了运维团队的负担。无需关心底层基础设施的维护、补丁和扩展。
集成度高: 与Azure AD、Application Insights、Log Analytics等原生集成良好，身份验证和基础监控的配置相对直接。
基础策略支持: APIM提供了内置的缓存策略 (cache-lookup, cache-store) 和速率限制策略 (rate-limit-by-key)。

劣势与陷阱:

成本模型僵化: APIM的定价层级是阶梯式的，对于需要高性能和VNet集成的场景，必须选用Premium层，这本身就是一笔不菲的固定开销，与Serverless的按需付费理念背道而驰。
缓存策略的局限性: APIM的缓存键通常基于URL路径、查询参数或Header。对于LLM服务，请求的核心是POST Body中的prompt内容。要在APIM中实现基于请求体的哈希作为缓存键，需要编写复杂的、性能可能不佳的Inbound Policy XML，这在实践中非常笨拙且难以维护。一个常见的错误是忽略这一点，导致缓存完全无效。
流控不够灵活: 虽然rate-limit-by-key可以工作，但要实现更复杂的场景，例如“用户A在套餐1下每分钟10次请求，用户B在套餐2下每分钟60次请求，并且整个系统总QPS不超过1000”，这类组合策略的实现同样需要复杂的Policy配置，动态调整能力弱。
技术栈锁定: 该方案将整个API生命周期管理与Azure深度绑定。如果未来需要混合云部署，或将部分后端迁移至其他平台（如Kubernetes），迁移成本会非常高。

在真实项目中，我们发现APIM的缓存对于LLM这种POST请求为主、关键信息在Body内的场景，几乎形同虚设。每一次相似的请求都在穿透APIM，直接调用昂贵的Function App，成本失控的风险极高。

方案B: APISIX网关 + Serverless后端的混合架构

为了解决上述问题，我们评估了第二种方案：引入一个自部署的高性能API网关——Apache APISIX，作为流量的入口，后端再连接到Azure Functions。APISIX可以部署在Azure Kubernetes Service (AKS) 或虚拟机上。

graph TD
    subgraph Azure Cloud
        FunctionApp[Azure Function LLM Endpoint]
        AKS[AKS Cluster]
    end

    Client --> APISIX[APISIX Ingress / Gateway on AKS];
    subgraph APISIX Processing
        direction LR
        A[Auth Plugin] --> B(Rate Limit Plugin) --> C{Cache Plugin};
    end
    APISIX --> C;
    C -- Cache Hit --> APISIX_Response[Serve from Cache];
    C -- Cache Miss --> FunctionApp;
    FunctionApp -- LLM API Call --> LLM[External LLM Service];
    FunctionApp --> APISIX_Cache_Write[Write to Cache];
    APISIX_Cache_Write --> APISIX_Response;
    APISIX_Response --> Client;

优势分析:

极致的灵活性与性能: APISIX基于Nginx和LuaJIT构建，性能极高。其插件机制完全是动态的，可以通过API实时更新，无需重启服务。
强大的缓存能力: APISIX的proxy-cache插件支持使用Nginx变量作为缓存键。我们可以轻易地将请求体的MD5哈希值设置为缓存键，这完美解决了LLM服务的缓存痛点。
精细化、分布式的流控: limit-req, limit-conn, limit-count等一系列插件可以组合使用，并且支持基于Redis的分布式限流，保证在APISIX集群扩展时策略依然精确有效。可以轻松实现基于JWT claim、API Key等动态维度的复杂限流逻辑。
云原生与开源生态: APISIX是CNCF毕业项目，与Prometheus, SkyWalking, Fluentd等云原生工具无缝集成。这使得我们可以构建一个与特定云厂商解耦的、统一的可观测性平台。
成本效益: 虽然需要承担AKS集群的运维成本，但通过精确缓存可以节省大量的LLM调用费用。在流量达到一定规模后，总拥有成本（TCO）远低于APIM Premium方案。

劣势考量:

运维责任: 团队需要负责APISIX集群的部署、维护和高可用性保障。这是一个从PaaS向IaaS/CaaS的责任转移，需要相应的Kubernetes运维经验。
引入新组件: 架构中增加了一个新的技术组件，带来了学习成本和潜在的故障点。

最终决策:

考虑到我们的业务场景中存在大量重复性或高频次的查询（例如，客服知识库问答、特定格式的数据提取），缓存能带来的成本节约是决定性的。APISIX提供的基于请求体的缓存能力是方案A无法有效替代的。因此，我们选择方案B。运维成本的增加被认为是可接受的，因为它换来了对系统行为的完全控制权和长期的架构灵活性。

核心实现与代码

以下是构建此架构的关键代码和配置。

1. Azure Function (Python v2 Programming Model)

这是后端的LLM服务，使用Python实现。代码必须健壮，包含日志、配置管理和错误处理。

function_app.py:

import azure.functions as func
import openai
import logging
import os
import json
from hashlib import md5

# --- Configuration ---
# A common mistake is hardcoding keys. Always use environment variables.
# In Azure Functions, these are configured in "Application settings".
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
OPENAI_ENDPOINT = os.environ.get("OPENAI_ENDPOINT")
OPENAI_API_VERSION = os.environ.get("OPENAI_API_VERSION", "2023-07-01-preview")
OPENAI_MODEL_NAME = os.environ.get("OPENAI_MODEL_NAME", "gpt-4")

# Initialize the Azure OpenAI client
# It's a best practice to initialize clients outside the function handler
# to reuse connections across multiple invocations.
try:
    client = openai.AzureOpenAI(
        api_key=OPENAI_API_KEY,
        azure_endpoint=OPENAI_ENDPOINT,
        api_version=OPENAI_API_VERSION,
    )
except Exception as e:
    logging.error(f"Failed to initialize AzureOpenAI client: {e}")
    client = None

# --- Function App Definition ---
app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)

@app.route(route="generate", methods=["POST"])
def generate_completion(req: func.HttpRequest) -> func.HttpResponse:
    """
    HTTP trigger function to generate text completion using an LLM.
    """
    # Generate a unique invocation ID for tracing purposes.
    invocation_id = req.headers.get("X-Request-ID", "N/A")
    logging.info(f"Invocation ID [{invocation_id}]: Python HTTP trigger function processed a request.")

    if not client:
        logging.error(f"Invocation ID [{invocation_id}]: OpenAI client is not initialized.")
        return func.HttpResponse(
            "Internal Server Error: LLM service is not configured.",
            status_code=500,
            mimetype="application/json"
        )

    try:
        req_body = req.get_json()
    except ValueError:
        logging.error(f"Invocation ID [{invocation_id}]: Invalid JSON in request body.")
        return func.HttpResponse(
            json.dumps({"error": "Invalid JSON format"}),
            status_code=400,
            mimetype="application/json"
        )

    prompt = req_body.get("prompt")
    if not prompt or not isinstance(prompt, str):
        logging.warning(f"Invocation ID [{invocation_id}]: 'prompt' field is missing or not a string.")
        return func.HttpResponse(
            json.dumps({"error": "'prompt' must be a non-empty string"}),
            status_code=400,
            mimetype="application/json"
        )

    # --- Core Logic ---
    try:
        logging.info(f"Invocation ID [{invocation_id}]: Sending request to LLM for prompt hash: {md5(prompt.encode()).hexdigest()}")
        
        response = client.chat.completions.create(
            model=OPENAI_MODEL_NAME,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=800,
        )
        
        completion_text = response.choices[0].message.content
        
        logging.info(f"Invocation ID [{invocation_id}]: Successfully received completion from LLM.")
        
        # Structure the response for the client
        response_body = {
            "data": {
                "completion": completion_text,
                "model": response.model,
                "usage": {
                    "prompt_tokens": response.usage.prompt_tokens,
                    "completion_tokens": response.usage.completion_tokens,
                    "total_tokens": response.usage.total_tokens,
                }
            },
            "status": "success"
        }
        
        return func.HttpResponse(
            json.dumps(response_body),
            status_code=200,
            mimetype="application/json"
        )

    except openai.APIError as e:
        # Handle specific OpenAI errors
        logging.error(f"Invocation ID [{invocation_id}]: OpenAI API returned an error: {e.status_code} - {e.response}")
        return func.HttpResponse(
            json.dumps({"error": "The LLM service failed to process the request.", "details": str(e)}),
            status_code=502, # Bad Gateway, as we are a gateway to the LLM service
            mimetype="application/json"
        )
    except Exception as e:
        # Generic error handler
        logging.error(f"Invocation ID [{invocation_id}]: An unexpected error occurred: {e}", exc_info=True)
        return func.HttpResponse(
             json.dumps({"error": "An internal server error occurred."}),
             status_code=500,
             mimetype="application/json"
        )

host.json: 确保启用了详细的应用程序日志。

{
  "version": "2.0",
  "logging": {
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "excludedTypes": "Request"
      }
    },
    "logLevel": {
      "default": "Information"
    }
  },
  "extensions": {
    "http": {
      "routePrefix": "api"
    }
  }
}

2. APISIX 核心配置 (YAML for APISIX Declarative Configuration)

假设APISIX已部署在AKS上，我们使用ApisixRoute和ApisixUpstream等CRD进行声明式配置。

llm-service.yaml:

# llm-service.yaml
apiVersion: apisix.apache.org/v2
kind: ApisixUpstream
metadata:
  name: azure-function-llm-upstream
spec:
  # The 'nodes' field should contain the hostname of your Azure Function App.
  # The scheme is HTTPS.
  nodes:
    - host: your-function-app-name.azurewebsites.net
      port: 443
      weight: 100
  scheme: https
  # Use pass_host: 'pass' to forward the original Host header from the client.
  # Azure Functions uses the Host header for routing to the correct app.
  pass_host: pass
  # Use 'keepalive' to reuse connections to the function app, reducing TLS handshake overhead.
  keepalive:
    requests: 1000
    idle_timeout: 60s
---
apiVersion: apisix.apache.org/v2
kind: ApisixRoute
metadata:
  name: route-llm-service
spec:
  http:
    - name: llm-generate-rule
      match:
        hosts:
          - "api.yourdomain.com"
        paths:
          - "/v1/llm/generate"
      # The upstream created above
      upstream_name: azure-function-llm-upstream
      # The actual path on the Azure Function App.
      # Note: We are rewriting the public path to the internal function path.
      rewrite:
        path: "/api/generate"
      plugins:
        # --- Plugin 1: Authentication ---
        # Here we assume a user's API key is passed in a header 'X-API-KEY'.
        # For production, jwt-auth is a much better choice.
        - name: key-auth
          enable: true
          config:
            # The key is deliberately not set here. It should be configured
            # on a Consumer object for security. This is just for demonstration.
            key: "user-key-in-header"
            
        # --- Plugin 2: Caching - The Core of Cost Control ---
        - name: proxy-cache
          enable: true
          config:
            # This is the most critical part.
            # We use a combination of variables to create a unique cache key.
            # $consumer_name ensures caches are per-user.
            # $request_body is the raw POST body. We hash it for efficiency and to handle large bodies.
            # Here we use Nginx's built-in MD5.
            cache_key:
              - "$consumer_name"
              - "$uri"
              - "body_md5:$md5_of_request_body"
            cache_ttl: 3600s # Cache valid for 1 hour. This depends on business logic.
            cache_zone: llm_cache_zone # Requires pre-configuration in apisix config.yaml
            cache_http_status: [200] # Only cache successful responses.
            # This tells APISIX to read the request body, which is needed for the cache_key.
            request_body_read: true

        # --- Plugin 3: Rate Limiting - The Resilience Layer ---
        - name: limit-req
          enable: true
          config:
            # Rate limit is 10 requests per minute (60 seconds).
            rate: 10
            burst: 5
            # The key is the consumer name, so each authenticated user gets their own rate limit bucket.
            key_type: "consumer_name"
            rejected_code: 429
            rejected_msg: "Too many requests. Please try again later."
            
        # --- Plugin 4: Request Rewriting & Header Injection ---
        - name: proxy-rewrite
          enable: true
          config:
            # Azure Functions requires a function key for authentication.
            # We add it here at the gateway level. The client doesn't need to know it.
            # This key should be stored securely, e.g., in an APISIX secret.
            headers:
              x-functions-key: "{{AZURE_FUNCTION_KEY_FROM_SECRET}}"
              # Forward a unique request ID for end-to-end tracing.
              X-Request-ID: "$request_id"

在这个配置中：

ApisixUpstream 定义了后端服务，即Azure Function App的地址。pass_host: pass是关键，因为Azure的前端需要正确的主机头来路由。
proxy-cache 是成本控制的核心。cache_key的构造极为重要，它结合了消费者名称、URI和请求体的MD5值，确保了只有完全相同的用户发出的完全相同的prompt才会命中缓存。
limit-req 为每个API消费者提供了独立的请求速率限制，防止单个用户的滥用行为拖垮整个服务或产生意外费用。
proxy-rewrite 插件在请求发送到后端前注入了x-functions-key。这是一个重要的安全实践，它将后端的认证凭证与客户端解耦，凭证由网关统一管理。

架构的扩展性与局限性

这种混合架构模式的扩展性非常好。如果未来需要引入另一个由不同团队开发的、基于AWS Lambda的LLM模型进行A/B测试，只需在APISIX中新增一个upstream和一条路由规则，通过traffic-split插件即可实现流量的按比例分配，而对客户端完全透明。这是纯云厂商方案难以优雅实现的。

然而，该架构也存在其适用边界。首先，我们引入的缓存策略只对具有高重复性的幂等请求有效。对于高度个性化、上下文连续的对话式AI应用，请求体几乎每次都不同，缓存命中率会趋近于零，proxy-cache插件的价值将大大降低。在这种场景下，成本控制的重点需要转移到更智能的模型路由（例如，简单问题用小模型，复杂问题用大模型）或者通过prompt engineering来压缩token使用量。

其次，APISIX集群本身的运维并非没有成本。虽然它性能卓越，但一个高可用的APISIX集群需要配套的etcd集群、监控告警体系（Prometheus/Grafana）以及专业的运维知识。对于流量极低、对成本不敏感的内部应用，直接使用Azure Functions + APIM的简单组合可能在总体上更具成本效益。这个架构的价值在于当服务规模化、对性能、成本和控制力有更高要求时，它提供了一个专业且灵活的解决方案。

APISIX Azure Functions LLM Serverless API网关

本篇

利用APISIX插件体系为Azure Functions上的LLM服务构建韧性与成本控制层

2023-11-20 架构设计

APISIX Azure Functions LLM Serverless API网关

在React前端与Python spaCy服务间构建基于OpenTelemetry的端到端追踪体系

2023-10-27 可观测性

前端 spaCy OpenTelemetry 分布式追踪 Node.js