Building GenAI Apps Using AWS Bedrock: Application Components (1/5)

Archishman Bandyopadhyay
Jul 8, 2024
9 min read

1. Typical components of a generative AI application

A foundation model (FM) is in the center surrounded by components. These include frontend web application or mobile app, FM interface, machine learning (ML) environment, model training, enterprise datasets, vector database, text and image embeddings, and long-term memory store. Governance and security are integrated into all components.

Embedding is the process by which text, images, and audio are given numerical representation in a vector space.

2. Foundation model interface

At the heart of a generative AI application is the foundation model that powers it. Foundation models are models trained on broad data at scale that can be adapted to various downstream tasks.

These models ingest tremendous amounts of datasets covering diverse topics, subject matter, and modalities. They also acquire nuanced understanding of language, audio, vision, and other fields. As a result of their training, you can deploy foundation models to perform a variety of tasks. Here’s where they differ from traditional machine learning models, which you can only use for the tasks they are trained on.

Their wide applicability makes foundation models very powerful. Foundation models provide the base on which you can build various generative AI applications. Large language models (LLMs) are a subset of foundation models that are trained on a large corpus of text data.

Interface and prompts

To use a foundation model, you need an interface that provides access to it. The interface is generally an API that is managed, or it can be self-hosted using an open source or proprietary model. Self-hosting often involves procuring access to a machine learning environment that is supported by purpose-built accelerated computing instances to host the model. Using the API call, you can pass prompts to the foundation model and receive inference responses back.

Inference parameters

Along with supplying prompts, using effective inference parameters can strongly influence the output from a foundation model. You can pass the parameters along with the prompts to the foundation model interface APIs. LLMs operate on tokens, which can be words, letters, or partial word units.

One thousand tokens is equivalent to approximately 750 words

The LLM takes a sequence of input tokens and predicts the next token. The inference parameters help provide guidance to the LLM to produce the output or a sequence of tokens that are relevant for your use case. Next, you will review the most common inference parameters.

Some inference parameters:

Top P or nucleus sampling: This technique controls choosing from the smallest number of tokens where the combined, or cumulative probability of the tokens exceeds the parameter Top P. A higher value for Top P, such as 0.9, implies the output will be chosen at random from a larger number of tokens increasing diversity, but the output can become incoherent. Lower values decrease the number of tokens available for selection thereby increasing the predictability of the next token.
Top K: Whereas Top P works based on probabilities, Top K reduces the sample size to the next k probable tokens. Typical k values are from 10 to 100. A k value of 1 is called a greedy strategy because the most probable token is always chosen.

Temperature: Whereas Top P and Top K control which tokens are chosen based on the model’s output, the temperature parameter affects the model’s output directly. The higher the temperature, the flatter the probability distribution, which means it will be uniform across tokens. The generated tokens will be more creative and random. Lower temperature will polarize the distribution, which make deterministic outputs possible.

3. Working with Datasets and Embeddings

Enterprise datasets

Although foundation models can generate human-like text, images, audio, and more from your prompts, this might not be sufficient for enterprise use cases. To power customized enterprise applications, the foundation models need relevant data from enterprise datasets.

Enterprises accumulate huge volumes of internal data, such as documents, presentations, user manuals, reports, and transaction summaries, which the foundation model has never encountered. Ingesting and using enterprise data sources provide the foundation model with domain-specific knowledge to generate tailored, highly relevant outputs that align with the needs of the enterprise.

One can supply enterprise data to the foundation models as context along with the prompt, which will help the model to return more accurate outputs. How do you figure out the context to pass? For that, you need a way to search the enterprise datasets using the prompt text that is passed.

This is where vector embeddings help.

Vector embeddings

Embedding is the process by which text, images, and audio are given numerical representation in a vector space. Embedding is usually performed by a machine learning model. The following diagram provides more details about embedding.

Enterprise datasets, such as documents, images and audio, are passed to ML models as tokens and are vectorized. These vectors in an n-dimensional space, along with the metadata about them, are stored in purpose-built vector databases for faster retrieval.

For this example, consider only the text modality. The goal of generating embeddings is to capture semantic similarities between text so that text with similar meanings is mapped to nearby points in the vector space. Embeddings are often multi-dimensional vectors. Embedding helps when searching for similar words to find relevant information based on the user’s prompts.

Amazon Bedrock provides the Amazon Titan Embeddings G1 - Text model that can convert text into embeddings. These embeddings are stored in a vector database.

Vector databases

The core function of vector databases is to compactly store billions of high-dimensional vectors representing words and entities. Vector databases provide ultra-fast similarity search across these billions of vectors in real time.

The most common algorithms used to perform the similarity search are k-nearest neighbors (k-NN) or cosine similarity.

Amazon Web Services (AWS) offers the following as viable vector database options:

Amazon OpenSearch Service (provisioned)
Amazon OpenSearch Serverless
pgvector extension in Amazon Relational Database Service (Amazon RDS) for PostgreSQL
pgvector extension in Amazon Aurora PostgreSQL - Compatible Edition

AWS also offers Pinecone in the AWS Marketplace, and there are open source, in-memory options, like Facebook AI Similarity Search (FAISS), Chroma, and many more.

Vectorized enterprise data

After enterprise data is vectorized, you can search the given prompt in a vector database. You can supply the relevant chunks of information as context to improve the output of the generative AI model. This can reduce hallucinations, a phenomenon in which an LLM confidently generates plausible sounding but false information. Vector databases and context are used in Retrieval Augmented Generation (RAG).

4. Additional Application Components

Prompt history store

A prompt history store is another essential component in a generative AI application, particularly applications used for conversational AI, like chatbots. A prompt history store helps with contextually aware conversations that are both relevant and coherent. Many foundation models have a limited context window, which means you can only pass so much data as input to them. Storing state information in a multiple-turn conversation becomes difficult, which is why a prompt history store is needed. It can persist the state and make it possible to have long-term memory of the conversation.

By storing the history of prompts and responses, you can look up prompts from a previous conversation and avoid repetitive requests to the foundation model. This helps with requests from your audit and compliance teams about adherence to company policy and regulations. You can also debug prompt requests and responses to diagnose errors and warnings from your applications.

Frontend web applications and mobile apps

Often, you need to build a frontend application or app that acts as an interface for your users to use generative AI capabilities from the foundation model. The application or app is responsible for constructing prompts and calling the foundation model API. The responses from the foundation model are sanitized and filtered by the application or app before the users see them on their screens. The application or app should also handle failures and other unintended consequences in a seamless manner so the user experience is not affected.

5. RAG (Retrieval-augmented generation)

RAG is a framework for building generative AI applications that can make use of enterprise data sources and vector databases to overcome knowledge limitations. RAG works by using a retriever module to find relevant information from an external data store in response to a user's prompt. This retrieved data is used as context, combined with the original prompt, to create an expanded prompt that is passed to the language model. The language model then generates a completion that incorporates the enterprise knowledge.

With RAG, language models can go beyond their original training data to use up-to-date, real-world information. RAG addresses the challenge of frequent data changes because it retrieves updated and relevant information instead of relying on potentially outdated sets of data.

Step 1: Encode the input text using a language model like GPT-J or Amazon Titan Embeddings.

Step 2: Retrieve relevant examples from a knowledge base that matches the input. These examples are encoded in the same way.

Step 3: Provide the enhanced prompt with the question and context to the foundation model to generate a response.

Step 4: The generated response is conditioned on both the input and the retrieved examples, incorporating information from multiple relevant examples into the response.

There are two distinct stages when using the RAG pattern. The lower portion of the diagram explains converting the existing knowledge documents into vector embeddings and storing them in a vector database. This phase is typically performed by a batch job. After it is complete, you can augment the user’s query with relevant information or documents using semantic search. You can then pass the user’s query and retrieved information into an LLM for completion.

6. Model Fine-Tuning

Limitations of RAG and how fine-tuning can address them

RAG is useful for enterprise use cases, but relying solely on RAG has some limitations. For example, the retrieval is limited to the enterprise datasets that are embedded into the vector stores at the time of the retrieval. The model remains static. The retrieval can add latency, and, for some use cases, that latency can be a problem. Also, the retrieval is based on pattern matching instead of a complex understanding of the context.

Model fine-tuning can change the underlying foundation model as little or as much as you want. The model can learn the enterprise nomenclature, proprietary datasets, terminologies, and so on. Think of this as a permanent change to the underlying model. By comparison, RAG makes the model intelligent only temporarily by supplying context from relevant document chunks.

There are two broad categories of fine-tuning: prompt-based learning and domain adaptation.

Prompt-based learning: Fine-tuning the underlying foundation model for a specific task is accomplished through prompt-based learning. This involves pointing the model toward a labeled dataset of examples that you want the model to learn from. The labeled examples are formatted as prompt and response pairs, and phrased as instructions. The prompt-based learning fine-tuning process modifies the weights of the model. It is usually lightweight and involves a few epochs to tune. Because the fine-tuning is specific to one particular task, it can’t be generalized across multiple tasks.

Domain adaptation: With domain adaptation fine-tuning, you can use pretrained foundation models and adapt them to multiple tasks using limited domain-specific data. You can point the foundation model to unlabeled datasets as little or as much as you want. This will update the model’s weights, and depending on the amount of data used for fine-tuning, it will start speaking the language of your enterprise. It will use things like industry jargon, technical terms, and so on. To perform the fine-tuning, you need a machine learning environment that can handle the complete process of fine-tuning. It also needs access to the appropriate compute instances for fine-tuning.

Both RAG and fine-tuning are suitable for customizing a foundation model for serving enterprise use cases. The choice ultimately depends on users as they consider a host of parameters, such as complexity, cost, and so forth.

7. Securing Generative AI Applications

Consider the following points when building generative AI applications:

Manage and audit who can access each part of the generative AI application, such the foundation models, API methods, and so on.
Monitor, log, and report on access to the underlying foundation model either directly or through customized approaches.
Log requests to and responses from the foundation model to stay compliant with regulations and to ensure explainability of your actions.
Periodically audit foundation models with test data and simulate prompt injection attacks to ensure that there are no unintended consequences.
Document the complete process of the various facets of the application and keep them up to date.

8. Generative AI Application Architecture

Phase 1

The focus of this phase is to convert the enterprise data used to augment input prompts into a compatible format to perform a relevancy search. This is done using an embeddings machine learning model. The batch job calls the model to convert existing knowledge documents into its numerical representations. The batch job then stores the data in a vector database using the approach described in the following three steps. To learn more, choose each of the following three numbered markers.

Step 1: The batch job gets the documents in the data lake.

Step 2: The documents are tokenized, and the embeddings are generated by calling the embedding model. It is important to run this job as often as needed to keep the data current.

Step 3: The embeddings are stored in a vector database.

Phase 2

Step 1: The user from a frontend application sends a question to the user interface or API.

Step 2: The interface of the API sends the question and other relevant context to the orchestrator layer. The orchestrator layer is a central component in this pattern. It calls all the relevant components in a well-defined sequence of steps to generate responses to the user’s question.

Step 3: In this step, the orchestrator calls the conversation history store to get more context information.

Step 4: The orchestrator calls the embeddings model to tokenize the question or contextual information. The embeddings model is the same as the embeddings model used to tokenize the knowledge documents.

Step 5: When the embeddings are generated, the orchestrator calls the vector store to retrieve relevant chunks from the documents that match the user’s query. This is done by performing a similarity search between the user's query embeddings and the embeddings of documents in the vector store.

Step 6: The orchestrator now calls the generative model with the user’s question, context information, and relevant document chunks retrieved from the vector store. The generative model takes this information and generates the completions based on the prompts.

Step 7: The prompts and completions are stored in the conversational history store to preserve the context.