Optimization process

Evaluation

Optimization through model

The selection of model directly impacts your accuracy, latency, and cost with smaller models provide lower latency and cost.

The first step in optimization is to achieve a good enough accuracy for the production purpose. Use the most powerful model to engineer the prompt and RAG that produces satisfactory results (accuracy, consistent behaviour). Collect prompt-results pairs for evaluations, few-shot learning, or fine-tuning.

Then, try smaller model using zero-shot or a few-shot learning (with the collected prompt-result pairs). Check whether it maintains accuracy while reducing cost and latency.

If the accuracy is not satisfactory, you may try to fine-tine the smaller model with the prompt-result pairs generated.

Optimization through task and pipeline design

Do you need the LLM in the first place?

Consider whether the task:

  1. Should be done in the first place. Don’t automate something that should not be done at all.
  2. There are classical methods that are a better fit

Consider also the constrains for input and output:

  • If your output is highly constrained, hard code it
  • If you input in highly constrained, generate a few responses in advance, and show it to the user. Don’t show the same response more than once to the user

Minimizing output tokens

Generation of output tokens takes the most of the task time (inference time scales almost linearly with number of output tokens). Asking the model to provide shorter answers (be concise, in 3 sentences, in 20 words, etc.).

Providing examples or fine-tuning for shorter answers may also help.

With structured outputs, consider shortening function names, omit named arguments, coalesce parameters and the link.

Consider using max_tokens and stop_tokens to interrupt the generation.

Minimizing input tokens

Processing of input tokens scales very poorly to task time (50% cut in prompt length results in about 1-5% improvement).

If you work with lengthy prompts, you may try to fine-tune the model, filter the context input (prune RAG result, clean HMTL) or leverage prompt caching.

Minimizing number of requests

You may consider reducing the number of request made. For example, in a pipeline you may have multiple steps, each step making separate request. You may try to combine instructions into a single prompt divided into sections that represent steps to reduce the number of requests.

Leverage structured output to get responses per step.

Parallelize steps

If your pipeline allows it, parallelize calls to the model. If your pipeline is strictly sequential, you may also try speculative execution.

If Step 2 depends on Step 1, and Step 1 result can be assumed (for example it is a classification task), execute both steps at the same time. Step 2 will assume the answer from Step 1 (for example most common classification) and generate result based on that assumption. If you receive the result from Step 1 and it is not what was assumed, (cancel and) regenerate response for Step 2 for actual result from Step 1.

Responsiveness to user actions

You can also improve the feeling of speed for you solution. Consider:

  • Streaming responses
  • Chunking input, processing it chunk by chunk, and displaying the results for each chunk as they arrive
  • Show the progress of model for example, progress bar, log of what is happening behind the scenes

Optimization through prompt caching

When a prompt contains repetitive content, you may leverage prompt caching to reduce latency and cost. The cache hits only exact prefixes of prompt, hence the repetitive content should be placed at the beginning of the prompt. Note that different type of content can be cached (messages, images, tool use, etc.) and that cache is being periodically refreshed.

Optimization through predicted outputs

When the large part of the response is known beforehand (e.g. a long form that requires input only in specific fields) you may try to leverage predicted outputs to reduce latency.

Improving infrastructure

You may also try to run the inference on faster, more powerful hardware.

Optimization of inference

LLM solutions can also be optimized through inference optimization.