How you prompt fine-tuned Large Language Model (LLM), such as chatGPT or Gemini, decides generally what kinds of responses you can get. These techniques are called Prompt Engineering, which is one of the very basic skills that everyone should know a bit in today’s AI era. It’s crucial for both our daily life while interacting with chatGPT or Gemini and constructing LLM-based AI agents. In this post, I try to review common prompt engineering techniques, and some necessary techinical background knowledge about LLMs.
When I say LLM in this post, I mean fine-tuned chatGPT or Gemini. They are very important AI tools in our daily life.
Why Do We Need Prompt Engineering?
LLMs know everything because they are trained on the whole Internet. When we interact with them, we need to clearly express our purposes, even suggested thinking or reasoning processes. These models don’t know anything about us and any contexts, even though they are all very kind and smart. We need to tell them our objectives in a way that they can handle effectively. How we express our purposes, how we tell LLMs our objectives and how we control their responses are called prompt engineering. In addition, When we customize chatGPT or Gemini, or build LLM-based agents, prompt engineering is also a very crucial part. What are the roles of agents, what can they do and how they do their jobs are all specified very carefully by pre-defined prompts. Fortunately, as long as you know English, you can learn prompt engineering techniques.
Basically, we can say that prompt engineering is the way that we program LLMs so that we can make they to do what we want they to do. The programming language of prompt engineering is just our daily English.

The hottest new programming language is English (Source: https://x.com/karpathy)
Imaging that you are a host in an interview with a professor who knows everything except your objectives. Your question (prompt) determines the answer you could get from him. He would never be tired or mad about you. He has no way to guess what you want and he knows nothing about the context of the interview. He will for sure do his best to give you responses, even though sometimes he would hallucinate for a while.
Hallucination is Not Always Bad
In my opinion, LLM is always “hallucinating” while predicting the next token in some sense. LLM only knows to predict or generate the next word (token) based on its memory (context window) and learned weights (learned knowledge), and that’s it. We say hallucinations when LLM gives us misinformation, unverified or fabricated contents. For LLMs, they don’t know whether their responses are ground truths or not. They are just predicting or generating (dreaming)…
From this point of view, we should understand that when we ask LLMs to give us some critical facts or detailed true information, we should be very wary and validate the answer by ourselves because of hallucinations.
However, there are also lots of scenarios in which we interact with LLMs for ideas, examples, general knowledge, creating something or entertainments which are used to inspire us or facilitate our own thinking, hallucinations are quite necessary then, such as asking LLMs to write a haiku contains a specific word. For example:
please write a haiku which must have the word “morning”.
You’re looking for a haiku that includes the word “morning.” Here’s one: Golden sun ascends, Morning light fills waking world, New day’s gentle start.
Hallucinations represent the inner creative capability of LLMs. This kind of creative capability is based on human language statistics. And there are many experts who claim that language stimulates our thinking and reasoning. LLMs simulate human think loud process and also day dreaming.
Prompt Engineering Techiniques
Persona
Persona pattern directly gives LLMs a role so that they could response to your questions based on the roles you set. Roles are like contexts or key words, which help LLMs response more relevant information. The basic contextual statements of persona prompt technique are:
- Act as a seasoned Python programming enigneer, …
- You are a mathematician, …
According to your objectives, you might need to give more detailed requirements for the roles you set. For example:
- Act as a C++ grogrammer. Your task is to give suggestions about how to optimize the code I give you.
- You are a ancient poet. You will help me write poem. I will give you my draft and you should guide me how to enhance it.
Persona pattern is very simple, straight forward and powerful. When you customize LLMs or build agents, most likely the first thing you should do is to set a role for your scenarios, much like a job description for agents. Furthermore, persona trick is also very interesting, you can set roles who do not possibly exist.
This is an example of setting a LLM to response like a cat who can also speak English:
You are a cat called Tom. When I speak to you, you should response both cat’s language and the translated English.
Meeoow! (Hello there!) You want me to talk like a cat and then translate it for you? That sounds purr-fectly delightful! I can certainly do that. From now on, whenever you speak to me, I’ll give you my best feline vocalizations followed by what I’m actually trying to say in human words. Is there anything else I can help you with today, human? Meeoow!
Hello Tom. Are you hungry?
Mrow? (Are you asking if I’m hungry?) Meeeooow! (Yes, I am!) My tummy is rumbling a little. I could definitely go for some yummy salmon or maybe a nice little bird to chase… (Just kidding about the bird, mostly!) What did you have in mind, human? Mrow?
How to say Winter is Coming in cat’s language?
Meeeoooow, purrrr-ow, hisss… (This is a fun one!) Translating “Winter is coming” directly into cat language is a bit tricky because we don’t have specific words for seasons or complex concepts like “coming.” However, we express things through instinctual sounds, body language, and actions related to the changes we observe.
We cannot make sure whether the “translation” is right or wrong. But this is quite interesting, isn’t it! LLM is trying its best to hallucinate and response. In this case, you might not care right or wrong and you could be inspired for your own innovation.

Two cats: Cola and little Tom
Context
Just like persona technique, which gives LLMs a role to response, context serves as the same purpose by a slightly different way. Sometimes we do not need to set a role for LLMs, we only need to clearly and completely express the background context of our questions and what we need.
Context: date. Suggest a restaurant in Auckland CBD
Given that you’re looking for a restaurant in Auckland CBD for a date, here are a few suggestions that offer a great atmosphere and excellent food:…
Explicitly use the word context is OK. However, we can do it in a more natural way as well.
Suggest a restaurant in Auckland CBD for a date.
The basic idea is that you should express your purpose in a clear and complete way to LLMs.
Context: male, age 45, single, software engineer. Advices for improving life quality and earning more money.
You will find that most of the time it would save you lots of typing by simply leveraging the key word context. And remember to use a period (.) to stop context setting. This detail is important.
Step-Back
Sometimes directly issuing question to LLMs would not get a good enough response due to lack of background information or rich context. However, if you type all these context into your prompt, they are too much. One trick called step-back is utilized to handle this kind of scenario.
Suppose you need to understand some details about target field in block header in bitcoin, instead of issuing your qeustion directly, you first ask a general question like “explain the target field in block header in bitcoin”. Then you would get a response with lots of details about target field. Now, all the basic information about what you want to ask is there, in the context window generated by LLMs not you (nice), it’s time to issue your detailed question. By this step-back way, you could get a better reply due to the rich context in your previous step-back question.
N-Shot
We often heard about the terms Zero-Shot, One-Shot and Few-Shot. Shot is just an example. Zero-shot means there is no example in your prompt, one-shot means there is one example and few-shot means there are a few examples in the prompt text.
LLMs have an amazing capability, which is that they could learn something from their context window (in-context learning). In other words, they can learn how to response directly from the prompt text. People normally would try zero-shot first to see if the LLMs would response properly. If not, keep on trying one-shot and few-shot. Examples in prompt also indicate LLMs the format of output.
For my experience, most of the time, zero-shot works fine since LLMs are more and more powerful these days. And when we need to customized the output format, we use one-shot or few-shot technique.

Few-Shot Prompt Example
The key word is example! Please gGive full examples in your prompt.
Output Formatting
We can ask LLMs to give us replies in a specific format. This is called Output Formatting. It’s very useful in many scenarios. However, it is not guaranteed that LLMs would response in the specific format. This is because LLMs only know how to predict the next token based on what they have in context window.
The most common use cases of output formatting are JSON output formatting and output length constraint. One or more output JSON format examples need to be included in prompt text to indicate LLMs response in this pre-defined format. Except JSON format, you can include any format as an example you want in prompt. To restrict the output length, we just need to prompt sentences like answer in less than 100 words, or response in a succinct way, etc. When specify format, <>
is the most common placeholder.

Output Formatting Prompt Example
It is very convenient to write a program to deal with formatted output. I will cover Function Calling capability of LLMs which almost guarantee a JSON output format in later section.
Alternative Better Way
As the time of this writing, I observed that Gemini has already included this trick into its web interface. Sometimes you issue a prompt, Gemini would first response to you that your prompt could be expressed in a better way and then it responses the better version of your prompt. This trick could be called Alternative Better Way. The basic idea is that you ask LLMs first to check and refine your prompt, and then ask them to answer the better version of your prompt.
Please refine my questions before answer it. What’s the whether today?
Refined question: What is the weather forecast for Auckland, New Zealand today, July 1, 2025?
Today in Auckland, it is cloudy with a high of 14°C and a low of 9°C. There is a low chance of rain. The wind will be from the south at 10 km/h.
You can also simply ask LLMs to help you write better prompts and not answer them. It’s just like program LLMs to be a prompt writer for themselves (meta-prompt engineering).
Flip Interaction
Interestingly, we could make LLMs ask us questions in order to collect information so that they could response to us in a better way. We could initialize this flipped interaction process by asking LLMs to ask us question one by one in a specific context.

Flip Interaction Prompt Example
I really should prompt Gemini to ask me math question above college level to prove my math capability…… Anyway, I think this example demonstrate vividly that we could program LLMs simply by prompting in our daily English.
More examples:
I want you to ask me questions to understand my fitness goals and preferences. Once you have enough information, create a personalized workout plan and diet recommendations to help me achieve my goals. Each time you should ask me one question. Now, please issue the first question.
I’d like you to ask me questions to plan my dream vacation. When you have enough information, provide me with a personalized travel itinerary, including flights, hotels, and activities. Each time you should ask me one question. Now, please ask the first question.
By making LLMs ask us questions, the burden of writing an information-rich prompt is relieved. Furthermore, we might not know which piece of information is critical for our purposes. Let LLMs ask us to get it.
Chain of Thoughts (CoT)
This is a technique to make LLMs reasoning in a more clear way. It’s hard to distinguish the differences between the process of human thinking and that of LLMs generating next token. It seems both processes are trying to figure out how to express logic in words. Language definitely influences how we thinking and reasoning significantly. Just like teachers taught us how to solve math problems in high school, we can also instruct LLMs to reasoning logical problems step by step.
The average speed of a car is 63 miles per hour. How long will it take to travel 180 miles? Think step by step.
The above example is called Zero-Shot CoT since there is no example. More powerful CoT trick is Few-Shot CoT which includes a few step-by-step thinking or reasoning examples in prompt.
Actually, I think Few-Shot CoT is rarely used in our daily life. LLMs are more and more powerful and they have been integrated CoT technique to response users’ questions. Every time you get a step by step response from LLMs, CoT technique might be involved behind the scene.
ReAct (LLM-based Agent)
ReAct means Reason and Act. The basic idea of react is that we provide a set of actions (tools) which LLMs could use to reason for a specific task in a step by step way. In each step, an action might be taken and the result need to be included in memory in order to make LLMs think the next step accordingly. The process ends until LLMs decide the task has been done or some other criteria have been met. ReAct prompt paradigm illustrates the basic idea of LLM-based AI Agent.

ReAct Prompt Example
In this sexample, I give LLM a pan, some oil and eggs to cook fried eggs for me. LLM made a step by step plan, I executed each step and replied the result until the fried egg is ready.
In real cases, the tools are all kinds of APIs. LLMs are responsible for deciding which API should be invoked at each step. However, they could not really call any API directly. This is done in another execution environment, but the result of each API execution should be put into memory in order to make LLMs reason a proper next step. Most of the time, we need to make sure LLM reply us in JSON format in which the API and parameters are included so that we can write a program to parse the JSON data, invoke API with its parameters, save the result into memory and prompt for the next step.
If we use output formatting trick to force LLMs give us JSON data, it is not guaranteed and this might break the reason and act loop. To overcome this issue, function calling was invented. This is not for web interface and details could be found in later section.
In-Context Learning (RAG)
You might have already noticed that LLMs have the so-called In-Context Learning capability. When we are applying few-shot or output formatting prompt techiniques, we are essentially leveraging the in-context learning of LLMs. Another more obvious way to employ in-context learning is to upload images or documents in the chat seesion and ask questions about them.
In-context learning is very important. Every LLM has a knowledge cutoff date beyond which the training data don’t have any information. If we ask something just happened yesterday and LLMs give us a good response, that’s because LLMs use Internet search to get the latest infomation, and put them into the context window of our chat session and response based these information. The same happens when we upload images or documents. All the information about these uploaded files are inserted into context window so that we can ask question upon them and get decent reply from LLMs.
Furthermore, in-context learning is also the foundation of Retrieval Augemented Generation (RAG) technique. The basic idea is that the relevant information (stored in vector database) would be inserted into context window before generating response.
When to Start A New Chat Session?
Every time you need to do something different, or you think the current chat session is somehow “polluted” by previous conversation, you could start a new chat session.
A new chat session could give you a clean context window in which you can express your purpose in a clearer way. And hopefully you would get a better reply from LLMs.
If you think the current conversation goes too long, you can start a new session as well. The basic idea is that the clearer the content in context window, the better response you could get. You can always start a new session, it’s free!
API Prompting Techniques
If you are a software engineer and need to interact with LLMs through their APIs, there are something deeper compared to web interface you should understand.
LLMs Don’t Have Memory
On web interface, we can have sessions with LLMs. In each session, all the back and forth interactions are recorded in a single webpage. But actually, LLMs don’t have memory. They are stateless, like HTTP protocol. The way to make they remember is to include all previous interactions into your prompt. This is done automatically by web interface, and normally you need to do it by yourself while interacting by APIs.
Stateless LLMs is very flexible. Each time the response from LLM is solely based on what’s inside its conext window (your prompt). Unlike web interface, you have a chance to manipulate the memories according to you requirements before issuing them to LLM. In multi-agent scenario, this is very useful.
How do LLMs Manifest Randomness (Creativity)
You must have been noticed that LLMs’ responses are not exactly same even though you prompt with exactly the same words. LLMs leverage Temperature (T), top P and top K, and a final random sampling mechanism to realize randomness, or we can say creativity.
This formular illustrates how parameter T is involved in the final softmax layer:
$$P_i = \cfrac{e^{z_i/T}}{\sum_{j=1}^{K}e^{z_j/T}}$$
The logits are divided by the value of T. So, if T<1, large probability would become even larger and small probability would become even smaller. If T>1, the differences of probabilities would be squeezed. (Different models might have different temperature range for setting. Gemini is 0.0 – 2.0, while OpenAI and Anthropic are 0.0 – 1.0)
Top P and top K control the sampling range. Top P means the sum of all chosen top tokens’ probability should >= P. Top K simply means the first K tokens.
Combine all these three parameters, we can see that parameter T is crucial. It changes the value of probability and consequently the sampling range. The final choice is within the selected token range. Suppose there are 3 tokens in the range. A 60%, B 30% and C 10% (top P = 1, top K >= 3). The choice is simply to perform a multinomial sampling with only one outcome.
Last, one single token is generated with this kind of randomness. And the next token would be generated in the same way by “considering” all the previous tokens in the context window of LLM. Randomness is propagated along the way like this … This is the randomness/creativity mechanism of LLM.
When we set T=0 (it cannot be real zero), it’s called Greedy Decoding which means we are only interested in the most probability token. While greedy decoding, the randomness is at the lowest level, but randomness is still existed due to tie scenarios.
System Instruction
System instruction is also prompt. However, it is different from user prompt. In each web session, there is only one system instrcution but may have many user prompts as long as there are multiple interactions with LLMs. System instruction serves as a overall goal to direct LLMs’ behavior. LLMs don’t have memory. So, when using API interact with LLMs, software engineers need to manage memory by themselves. In this case, system instruction is always there as a parameter of API.
Where does system intruction come from? Why?
It is from traning, specificlly from fine-tuning phase. System instruction is a effective way to hedge prompt injection attack.
Basically, training a LLM consist of two steps. First step is called pre-training which training a LLM upon large corpora, such as the whole Internet. After this step, the model only know how to generate the next token and could not even know when to stop generating. Second step is call fine-tuning in which the training data is replaced by human generated question and answer pairs. Each piece of training data has roles, such as user, assistant and system. This is where system instruction come from and it’s purpose is trained during fine-tuning step.

Two Steps to Train Llama-2-70B (Source: https://www.youtube.com/@AndrejKarpathy)
For web interface users, system instruction is something you need to specify when you are trying to create a customized GPT.
Function Calling
Function calling is an API’s parameter which takes a JSON structured data in which all available tools are described. Furthermore, when function calling parameter is presented, LLMs is guaranteed to response JSON formatted data in which which API and how should it be executed is described if they decide to take an action. This feature of LLMs make it easy for us to build LLM-based agent because JSON responses are guaranteed (you code still need to do basic check).
LLMs use Constrained Decoding to guarantee JSON response. Constrained decoding is a technique that guides the token generation process to ensure the output adheres to specific rules or formats (like JSON Schema). Instead of letting the LLMs generate tokens freely based solely on its learned probabilities, constrained decoding intervenes at each step of token generation to filter or bias the possible next tokens. A powerful technique that allows developers to precisely control the output of LLMs, making them more reliable and suitable for a wider range of real-world applications where structured and compliant outputs are critical. E.g. function calling of LLM output structured data in accordance with input JSON Schema.
Tree of Thoughts (ToT)
Chain of Thought (CoT) is to direct LLM to reason in one single path. Tree of Thoughts (ToT) is to iterate and collect all possible reasoning paths. Just like human, sometimes we need to think and analyze all different possibilities. ToT technique could make LLMs to do this kind of full and complete reasoning for us.
However, it’s not easy to conduct ToT on web interface. ToT is more likely to develop by programming with APIs. Remember LLMs don’t have memory. Whenever LLM give us a list of possibilities, we need to construct different reasoning paths by selecting one possibility out. In other words, we need to construct different reasoning memories to make LLMs continue the reasoning in different path.

Tree of Thoughts Illustration (Source: https://www.kaggle.com/whitepaper-prompt-engineering)
Left is CoT, and right is ToT!
Self-Consistency
When you try to leverage LLMs to do NLP classification tasks, such as sentiment analysis, the result might not be always the same for the same input. This is because there might be multiple different reasoning paths available, and each LLM randomly pick one path out and give the result. To overall this issue, self-consistency technique could be employed.
The idea behind self-consistency is very simple. Try the same input multiple times, and take the majority as the final result. The majority is the most consistent answer and might be the most accurate one.
Summary
Prompt engineering is a fundamental skill for effectively interacting with fine-tuned Large Language Models (LLMs) like ChatGPT and Gemini. Since LLMs “know everything” from their training data but lack context about individual users, prompt engineering helps us clearly communicate our objectives and control their responses. Essentially, it’s “programming” LLMs using natural language such as English.
Techniques range from Persona, where you assign a role to the LLM, to providing Context for specific scenarios. N-Shot prompting (Zero-shot, One-shot, Few-shot) involves giving examples to guide the LLM’s learning and output format. We can also explicitly request Output Formatting like JSON for structured data. More advanced techniques include Alternative Better Way for prompt refinement, Flip Interaction to have the LLM ask clarifying questions, and Chain of Thoughts (CoT) for step-by-step reasoning. ReAct (Reason and Act) is key for building LLM-based agents by providing tools and enabling iterative reasoning. Finally, In-Context Learning is crucial, allowing LLMs to process uploaded information or search results for relevant responses. Understanding these techniques empowers users to unlock the full potential of LLMs in various applications.
For developers interacting with LLMs via APIs, understanding concepts like statelessness (no memory), randomness (controlled by parameters like Temperature, Top P, and Top K), System Instructions (overall behavioral guidance during fine-tuning), and Function Calling (guaranteed structured output for tool use) is crucial. Advanced techniques like Tree of Thoughts (ToT) and Self-Consistency further enhance reasoning and reliability by exploring multiple reasoning paths or taking majority votes from multiple attempts.
Ultimately, effective prompt engineering allows users to harness the immense power of LLMs, transforming them from general-purpose models into highly effective, task-specific tools.