Prompt Engineering Review

First Update on Oct. 2nd, 2025

How you prompt the fine-tuned Large Language Model (LLM), such as ChatGPT or Gemini, decides generally what responses you can get. These techniques or tricks are called Prompt Engineering, which is one of the very basic skills that everyone should know a bit in today’s AI era. It’s crucial for both our daily life while talking with chatGPT or Gemini and building LLM-based AI agents.

When I say LLM, I mean fine-tuned LLM model, such as chatGPT or Gemini. They are very important AI tools in our daily life.

Why Do We Need Prompt Engineering?

LLMs know everything because they are trained on the whole Internet. When we interact with them, we need to clearly express our purposes, context, and even suggested thinking or reasoning patterns and processes. These models don’t know anything about us and any contexts about our questions, even though they are all very kind and smart. We need to tell them our objectives in a way that they can handle properly and effectively. How we express our purposes, how we tell LLMs our objectives and how we control their responses are called prompt engineering. In addition, When we customize chatGPT or Gemini, or build LLM-based agents, prompt engineering is also a very crucial part. What are the roles of agents, what can they do and how they do their jobs are all specified very carefully by pre-defined prompts. Fortunately, as long as you know English, you can learn prompt engineering techniques.

Basically, we can say that prompt engineering is the way that we program LLMs so that we can make they to do what we want they to do, or respone to us what we want to know. The programming language of prompt engineering is just our lovely daily English.

The hottest new programming language is English (Source: https://x.com/karpathy)

Imaging that you are a host in an interview with a professor who knows everything except your objectives. Your question (prompt) determines the answer you could get from him. He would never be tired or mad about you. He has no way to guess what you want and he knows nothing about the context of the interview. He will for sure try his best to give you responses, even though sometimes he would hallucinate for a while.

Hallucination is Not Always Bad

In my opinion, LLM is always “hallucinating” while predicting the next token in some sense. LLM only knows to predict or generate the next word (token) based on its memory (context window) and learned weights (learned knowledge), and that’s it. We say hallucinations when LLM gives us misinformation, unverified or fabricated contents. For LLMs, they don’t know whether their responses are ground truths or not. They are just predicting or generating (dreaming)…

From this point of view, we should understand that when we ask LLMs to give us some critical facts or detailed truth, we should be very wary and always validate the answer by ourselves because of hallucinations. Luckily, most of the public LLMs would give us reference links in their response for us to check. But we still need to judge the link by ourselves, always!

However, there are also lots of scenarios in which we interact with LLMs for ideas, examples, general knowledge, creating something or entertainments which are used to inspire us or facilitate our own thinking, hallucinations are quite necessary then. For example, asking LLMs to write a haiku contains a specific word:

please write a haiku in which must have the word “morning”.

You’re looking for a haiku that includes the word “morning.” Here’s one: Golden sun ascends, Morning light fills waking world, New day’s gentle start.

Hallucinations represent the innate creative capability of LLMs. This kind of creativity is based on the overall statistics of human language and knowledge learned during training. There are many experts who claim that language stimulates our thinking and reasoning. LLMs simulate human thinking process and also day dreaming.

Prompt Engineering Techiniques

Choosing the right words and asking the right questions is the key when talking with LLMs.

Persona

Persona pattern directly gives LLMs a role so that they could response to your questions based on the roles you set. Roles are like contexts or key words, which help LLMs response more relevant information. The basic contextual statements of persona prompt technique are:

Act as a seasoned Python programming enigneer, …
You are a mathematician, …

According to your objectives, you might need to give more detailed requirements for the roles you set. For example:

Act as a C++ grogrammer. Your task is to give suggestions about how to optimize the code I give you.
You are an ancient poet. You will help me write poem in old style. I will give you my draft and you should guide me how to enhance it.

Persona pattern is very simple, straight forward and powerful. When you customize LLMs or build agents, most likely the first thing you should do is to set a role for your scenarios, much like a job description for agents. Furthermore, persona trick is also very interesting, you can set roles who do not possibly exist in real world.

This is an example of setting a LLM to response like a cat who can also speak English:

You are a cat called Tom. When I speak to you, you should response both cat’s language and the translated English.

Meeoow! (Hello there!) You want me to talk like a cat and then translate it for you? That sounds purr-fectly delightful! I can certainly do that. From now on, whenever you speak to me, I’ll give you my best feline vocalizations followed by what I’m actually trying to say in human words. Is there anything else I can help you with today, human? Meeoow!

Hello Tom. Are you hungry?

Mrow? (Are you asking if I’m hungry?) Meeeooow! (Yes, I am!) My tummy is rumbling a little. I could definitely go for some yummy salmon or maybe a nice little bird to chase… (Just kidding about the bird, mostly!) What did you have in mind, human? Mrow?

How to say Winter is Coming in cat’s language?

Meeeoooow, purrrr-ow, hisss… (This is a fun one!) Translating “Winter is coming” directly into cat language is a bit tricky because we don’t have specific words for seasons or complex concepts like “coming.” However, we express things through instinctual sounds, body language, and actions related to the changes we observe.

We cannot make sure whether the “translation” is right or wrong. But this is quite interesting, isn’t it! LLM is trying its best to hallucinate and response. In this case, you might not care right or wrong and you could be inspired by your own innovation.

Context Setting

Just like persona trick, which gives LLMs a role to response, context serves as the same purpose by a slightly different way. Sometimes we do not need to set a role for LLMs, we only need to clearly and completely express the background context of our questions and what we need.

Context: date. Suggest a restaurant in Auckland CBD

Given that you’re looking for a restaurant in Auckland CBD for a date, here are a few suggestions that offer a great atmosphere and excellent food:…

Explicitly use the word context is OK. However, we can do it in a more natural way as well.

Suggest a restaurant in Auckland CBD for a date.

The basic idea is that you should express your purpose in a clear and complete way to LLMs.

Context: male, age 45, single, software engineer. Advices for improving life quality and earning more money.

You will find that most of the time it would save you lots of typing by simply leveraging the key word context. And remember to use a period (.) to stop context setting. This detail is important. In my expeirnece, the key word context is redundant as well. Just list all involved key words, period, and then issue your question.

Step Back

Sometimes directly issuing question to LLMs would not get a good enough response due to lack of background information or rich context. However, if you type all these context into your prompt, they are too much. One trick called step-back is utilized to handle this kind of scenario.

Suppose you need to understand some details about the target field in block header in bitcoin, instead of issuing your qeustion directly, you first ask a general question like “explain the target field in block header in bitcoin”. Then you would get a response with lots of details about target field. Now, all the basic information about what you want to ask is there, in the context window generated by LLMs, not you, it’s time to issue your detailed question. By this step-back way, you could get a better reply due to the rich context in your previous step-back question.

N-Shot

We often heard about the terms Zero-Shot, One-Shot and Few-Shot. Shot is just an example. Zero-shot means there is no example in your prompt, one-shot means there is one example and few-shot means there are a few examples in the prompt text.

LLMs have an amazing capability, which is that they could learn something from their context window (in-context learning). In other words, they can learn how to response directly from the prompt text itself. People normally would try zero-shot first to see if the LLMs would response properly. If not, keep on trying one-shot and few-shot. Examples in prompt also might indicate LLMs the format of output.

For my experience, most of the time, zero-shot works fine since LLMs are more and more powerful these days. And when we need to customized the output format, we use one-shot or few-shot technique.

The key word is example! Please give full examples in your prompt.

Output Formatting

We can ask LLMs to give us replies in a specific format. This is called Output Formatting. It’s very useful in many scenarios. However, it is not guaranteed that LLMs would response in that specific format. This is because LLMs only know how to predict the next token based on what they have in context window.

The most common use cases of output formatting are JSON output formatting and output length constraint. One or more output JSON format examples is required to be included in prompt text to indicate LLMs response in this pre-defined format. Except JSON format, you can include any format as an example you want in prompt. To restrict the output length, we just need to prompt sentences like answer in less than 100 words, or response in a succinct way, etc. When specify format, <> is the most common placeholder.

It is very convenient to write a program to deal with formatted output. I will cover Function Calling capability of LLMs by which JSON format outputs are almost guaranteed in later section. And this trick is also used in big data transformation.

Alternative Better Way

As the time of this writing, I observed that Gemini has already included this trick into its web interface. Sometimes you issue a prompt, Gemini would first response to you that your prompt could be expressed in a better way and then it responses the better version of your prompt. This trick could be called Alternative Better Way. The basic idea is that you ask LLMs first to check and refine your prompt, and then ask them to answer the better version of your prompt.

Please refine my questions before answer it. What’s the whether today?

Refined question: What is the weather forecast for Auckland, New Zealand today, July 1, 2025?

Today in Auckland, it is cloudy with a high of 14°C and a low of 9°C. There is a low chance of rain. The wind will be from the south at 10 km/h.

You can also simply ask LLMs to help you write better prompts and use them in another session or other LLMs. It’s just like program LLMs to be a prompt writer for themselves (meta-prompt engineering).

Flip Interaction

Interestingly, we could make LLMs ask us questions in order to collect information so that they could response to us in a better way. We could initialize this flipped interaction process by asking LLMs to ask us question one by one in a specific context.

I really should prompt Gemini to ask me math questions above college level to prove my math capability…… Anyway, I think this example demonstrate vividly that we could program LLMs simply by prompting in our daily English.

More examples:

I want you to ask me questions to understand my fitness goals and preferences. Once you have enough information, create a personalized workout plan and diet recommendations to help me achieve my goals. Each time you should ask me one question. Now, please issue the first question.

I’d like you to ask me questions to plan my dream vacation. When you have enough information, provide me with a personalized travel itinerary, including flights, hotels, and activities. Each time you should ask me one question. Now, please ask the first question.

By making LLMs ask us questions, the burden of writing an information-rich prompt is relieved. Furthermore, we might not know which piece of information is critical for our purposes. Let LLMs ask us to get it.

Chain of Thoughts (CoT)

This is a technique to make LLMs reasoning in a more clear way. It’s hard to distinguish the differences between the process of human thinking and that of LLMs generating next token. It seems both processes are trying to figure out how to express logic in words. Language definitely influences how we thinking and reasoning significantly. Just like teachers taught us how to solve math problems in high school, we can also instruct LLMs to reason logical problems step by step.

The average speed of a car is 63 miles per hour. How long will it take to travel 180 miles? Think step by step.

The above example is called Zero-Shot CoT since there is no example. More powerful CoT trick is Few-Shot CoT which includes a few step-by-step thinking or reasoning examples in prompt.

Actually, I think Few-Shot CoT is rarely used in our daily life. LLMs are more and more powerful and they have been integrated CoT technique to response users’ questions. Every time you get a step by step response from LLMs, CoT technique might be involved behind the scene.

Recently, reasoning LLM models emerged. Unlike traditional LLMs, reasoning models employ Long Chains of Thought (Long CoT) — internal reasoning trajectories that allow the model to “think” through problems by decomposing them, critiquing its own solutions, and exploring alternatives, similar to search algorithm. These long CoT processes are hidden from users, who instead see a summarized output. These reasoning models are good at math, coding and scientic QA tasks.

ReAct (LLM-based Agent)

ReAct means Reason and Act. The basic idea of react is that we provide a set of actions (tools) which LLMs could use to reason for a specific task in a step by step way. In each step, an action might be taken and the result need to be included in memory in order to make LLMs think the next step accordingly. The process ends until LLMs decide the task has been done or some other criteria have been met. ReAct prompt paradigm illustrates the basic idea of LLM-based AI Agent.

In this sexample, I give LLM a pan, some oil and eggs to cook fried eggs for me. LLM made a step by step plan, I executed each step and replied the result until the fried egg is ready.

In real cases, the tools are all kinds of APIs. LLMs are responsible for deciding which API should be invoked at each step. However, they could not really call any API directly. This is done in another execution environment, but the result of each API execution should be put into memory in order to make LLMs reason a proper next step. Most of the time, we need to make sure LLM reply us in JSON format in which the API and parameters are included so that we can write a program to parse the JSON data, invoke API with its parameters, save the result into memory and prompt for the next step.

If we use output formatting trick to force LLMs give us JSON outputs, it is not guaranteed and this might break the reason and act loop. To overcome this issue, function calling was invented. This is not for web interface and details could be found in later section.

In-Context Learning

You might have already noticed that LLMs have the so-called In-Context Learning capability. When we are applying few-shot or output formatting prompt tricks, we are essentially leveraging the in-context learning of LLMs. Another more obvious way to employ in-context learning is to upload images or documents in the chat seesion and ask questions about them.

In-context learning is very important. Every LLM has a knowledge cutoff date beyond which the training data don’t have any information. If we ask something just happened yesterday and LLMs give us a good response, that’s because LLMs use Internet search tool to get the latest infomation, and put them into the context window of our chat session and response based on these information. The same happens when we upload images or documents. All the information about these uploaded files are inserted into context window so that we can ask question upon them and get decent reply from LLMs.

Furthermore, in-context learning is also the foundation of Retrieval Augemented Generation (RAG) technique. The basic idea is that the relevant information (stored in vector database) would be inserted into context window before generating response.

When to Start A New Chat Session?

Every time you need to do something different, or you think the current chat session is somehow “polluted” by previous conversations, you could start a new chat session.

A new chat session could give you a clean context window in which you can express your purpose in a clearer way. And hopefully you would get a better reply from LLMs.

If you think the current conversation goes too long, you should start a new session as well. The length of LLM’s context window is limited. Too long conversation would push your intial prompt out of the context window and make it ineffectvie.

The basic idea is that the clearer the content in context window, the better response you could get. You can always start a new session, it’s free!

API Prompting Techniques

If you are a software engineer and need to interact with LLMs through their APIs, there are something deeper compared to web interface you should understand.

LLMs Don’t Have Memory

On web interface, we can have sessions with LLMs. In each session, all the back and forth interactions are recorded in a single webpage. But actually, LLMs don’t have memory. They are stateless, like HTTP protocol. The way to make they remember is to include all previous interactions into your prompt. This is done automatically by web interface, and normally you need to do it by yourself while interacting by APIs.

Stateless LLMs is very flexible. Each time the response from LLM is solely based on what’s inside its conext window. Unlike web interface, you have a chance to manipulate the memories according to you requirements before issuing them to LLM again. In multi-agent scenario, this is very useful.

How do LLMs Manifest Variability (Creativity)

You must have been noticed that LLMs’ responses are not exactly same even though you prompt with exactly the same words. LLMs leverage Temperature (T), top P and top K, and a final random Sampling mechanism to realize variability, or we can say creativity.

This formular illustrates how temperature T is involved in the final softmax layer:

$$P_i = \cfrac{e^{z_i/T}}{\sum_{j=1}^{K}e^{z_j/T}}$$

The logits are divided by the value of T. So, if T<1, large probability would become even larger and small probability would become even smaller. If T>1, the differences of probabilities would be squeezed. (Different models might have different temperature range for setting. Gemini is 0.0 – 2.0, while OpenAI and Anthropic are 0.0 – 1.0)

Top P and top K control the sampling range. Top P means the sum of all chosen top tokens’ probability should >= P. Top K simply means the first K tokens.

Combine all these three parameters, we can see that parameter T is crucial. It changes the value of probability and consequently the sampling range. The final choice is within the selected token range. Suppose there are 3 tokens in the range. A 60%, B 30% and C 10% (top P = 1, top K >= 3). The choice is simply to perform a multinomial sampling with only one outcome.

Last, one single token is generated with this mechanism. And the next token would be generated in the same way by “considering” all the previous tokens in the context window of LLM. This is also called Autoregression. Randomness is propagated along the way like this … This is the randomness/creativity mechanism of LLM.

When we set T=0 (it cannot be real zero), it’s called Greedy Decoding which means we are only interested in the most possible token. While greedy decoding, the randomness is at the lowest level, but randomness is still existed due to tie scenarios.

System Instruction

System instruction is also prompt. However, it is different from user prompt. In each web session, there is only one system instrcution but may have many user prompts as long as there are multiple interactions with LLMs. System instruction serves as a overall goal to direct LLMs’ behavior. LLMs don’t have memory. So, when using API interact with LLMs, software engineers need to manage memory by themselves. In this case, system instruction is always there as a parameter of API.

Where does system intruction come from? Why?

It is from traning, specificlly from fine-tuning phase. System instruction is a effective way to hedge prompt injection attack.

Basically, training a LLM consist of two steps. First step is called pre-training which training a LLM upon large corpora, such as the whole Internet. After this step, the model only know how to generate the next token and could not even know when to stop generating. Second step is call fine-tuning in which the training data is replaced by human generated question and answer pairs. Each piece of training data has roles, such as user, assistant and system. This is where system instruction come from and it’s purpose is trained during fine-tuning step.

Two Steps to Train Llama-2-70B (Source: https://www.youtube.com/@AndrejKarpathy)

For web interface users, system instruction is something you need to specify when you are trying to create a customized GPT.

Function Calling

Function calling is the API’s parameter which takes a JSON structured data in which all available tools are described. Furthermore, when function calling parameter is presented, LLMs is guaranteed to response JSON formatted data in which which API and how should it be executed is described if they decide to take an action. This feature of LLMs make it easy for us to build LLM-based agent because JSON responses are guaranteed (you code still need to do basic check).

LLMs use Constrained Decoding to guarantee JSON response. Constrained decoding is a technique that guides the token generation process to ensure the output adheres to the specific rules or formats (like JSON Schema). Instead of letting the LLMs generate tokens freely based solely on its learned probabilities, constrained decoding intervenes at each step of token generation to filter or bias the possible next tokens. This is a powerful technique that allows developers to precisely control the output of LLMs, making them more reliable and suitable for a wider range of real-world applications where structured and compliant outputs are critical. E.g. function calling of LLM output structured data in accordance with input JSON Schema.

Encoder or Decoder? It depends on the self-attention mechanism employed in transformer block. If it’s not masked, each token could talk to all others, we call it encoder, such as BERT. Otherwise, if it’s masked (masked or casual self-attention), each token could only see previous tokens, we call it decoder, such as GPT.

Tree of Thoughts (ToT)

Chain of Thought (CoT) is to direct LLM to reason in one single path. Tree of Thoughts (ToT) is to iterate and collect all possible reasoning paths. Just like human, sometimes we need to think and analyze all different possibilities. ToT technique could make LLMs to do this kind of full and complete reasoning for us.

However, it’s not easy to conduct ToT on web interface. ToT is more likely to develop by programming with APIs. Remember LLMs don’t have memory. Whenever LLM give us a list of possibilities, we need to construct different reasoning paths by selecting one possibility out. In other words, we need to construct different reasoning memories to make LLMs continue the reasoning in different path.

Tree of Thoughts Illustration (Source: https://www.kaggle.com/whitepaper-prompt-engineering)

Left is CoT, and right is ToT!

Currently, Claud and ChatGPT support ToT on their web interface. Users could easily change a previous prompt and intialize a new branch of reasoning, and switch back and forth between each branch conveniently.

Self-Consistency

When you try to leverage LLMs to do NLP classification tasks, such as sentiment analysis, the result might not be always the same for the same input. This is because there might be multiple different reasoning paths available, and LLMs probabilistically pick one path out and give the result. To overcome this issue, self-consistency technique could be employed.

The idea of self-consistency is very simple. Try the same prompt multiple times, and take the majority as the final result. The majority is the most consistent answer and might be the most accurate one.

Visual Prompt Engineering

Prompt is simple a input, and it could be text, image, audio or video.

For Diffusion Model, we use text prompt to generate images.

For Vision Language Model (VLM), we could prompt text and image, and get text and/or images.

Conclusion

Prompt engineering is a fundamental skill for effectively interacting with fine-tuned Large Language Models (LLMs) like ChatGPT and Gemini. Since LLMs “know everything” from their training data but lack context about individual users, prompt engineering helps us clearly communicate our objectives and control their responses. Essentially, it’s “programming” LLMs using natural language such as English.

Techniques range from Persona, where you assign a role to the LLM, to providing Context for specific scenarios. N-Shot prompting (Zero-shot, One-shot, Few-shot) involves giving examples to guide the LLM’s learning and output format. We can also explicitly request Output Formatting like JSON for structured data. More advanced techniques include Alternative Better Way for prompt refinement, Flip Interaction to have the LLM ask clarifying questions, and Chain of Thoughts (CoT) for step-by-step reasoning. ReAct (Reason and Act) is key for building LLM-based agents by providing tools and enabling iterative reasoning. Finally, In-Context Learning is crucial, allowing LLMs to process uploaded information or search results for relevant responses. Understanding these techniques empowers users to unlock the full potential of LLMs in various applications.

For developers interacting with LLMs via APIs, understanding concepts like statelessness (no memory), randomness (controlled by parameters like Temperature, Top P, and Top K), System Instructions (overall behavioral guidance during fine-tuning), and Function Calling (guaranteed structured output for tool use) is crucial. Advanced techniques like Tree of Thoughts (ToT) and Self-Consistency further enhance reasoning and reliability by exploring multiple reasoning paths or taking majority votes from multiple attempts.

Ultimately, effective prompt engineering allows users to harness the immense power of LLMs, transforming them from general-purpose models into highly effective, task-specific tools.

Why Do We Need Prompt Engineering?#

Hallucination is Not Always Bad#

Prompt Engineering Techiniques#

Persona#

Context Setting#

Step Back#

N-Shot#

Output Formatting#

Alternative Better Way#

Flip Interaction#

Chain of Thoughts (CoT)#

ReAct (LLM-based Agent)#

In-Context Learning#

When to Start A New Chat Session?#

API Prompting Techniques#

LLMs Don’t Have Memory#

How do LLMs Manifest Variability (Creativity)#

System Instruction#

Function Calling#

Tree of Thoughts (ToT)#

Self-Consistency#

Visual Prompt Engineering#

Conclusion#

References#