By Snehanshu Bhaisare, Partner Solutions Architect – AWS
By Nathan Douglas, Data Scientist – Arcanum AI
By Asa Cox, Chief Executive Officer – Arcanum AI

Arcanum AI

Arcanum AI is an AWS Partner and AWS Marketplace Seller that has a platform for artificial intelligence (AI) assistants. Its vision is to automate back-office operations for SMBs (small and medium-sized businesses) and remove all the mundane repetitive tasks, giving teams more time for impactful work. The AI assistants are a combination of generative AI, integration, and automation.

The democratization of generative AI with large language models (LLMs) has opened up new opportunities for businesses. However, many organizations have hesitated to adopt solutions due to data privacy and security concerns with transmitting sensitive information over the public internet.

Amazon Web Services (AWS) provides AI services like Amazon Bedrock and Amazon SageMaker that allow customers to benefit from the latest advancements in generative AI while maintaining control over their data.

Amazon Bedrock provides private access to API-based LLMs with additional security, access control, and network isolation. For even greater control, Amazon SageMaker JumpStart allows customers to provision LLMs directly into their own AWS accounts, enabling them to meet stringent security and privacy requirements.

Arcanum AI clients are SMBs that deal with sensitive data like finance, company IP, and personnel records. Amazon Bedrock offers a private network to connect the applications with LLMs and access control to tighten the permissions.

In this post, we will explore how Arcanum AI has migrated its generative AI workloads from OpenAI to AWS. We’ll cover the team’s model evaluation process, architecture design, and implementation details.

Migration Process – Model Evaluation

Arcanum AI followed a rigorous two-iteration model evaluation process to assess open-source models before migrating from OpenAI.


Figure 1 – Two-iteration model evaluation process.

Model evaluation is needed before migrating LLMs from OpenAI to LLMs hosted on AWS, either in Amazon SageMaker Jumpstart or Amazon Bedrock. Figure 1 shows the process Arcanum AI adopted to evaluate the open-source models available in AWS in two iterations.

In the first iteration, to limit the effort and facilitate the migration process, the team adopted the prompts from the existing application without any fine-tuning apart from tweaking the formats to meet different LLMs’ requirements.

Phase 1: Out-of-the-Box Iteration

Gather Sample Test Cases

Start by collecting examples that demonstrate the input and output expected from the LLM model or the application that integrates multiple LLMs. This step is crucial for setting benchmarks against which the models’ performances will be evaluated.

  • Example outcome: A collection of text input and expected output pairs that accurately represent the kind of tasks the model is expected to perform. For instance, a set of questions and the correct answers for a Q&A application.
  • Amazon Bedrock feature: Use the playground feature to quickly test initial inputs and outputs manually, getting a feel for the types of responses different models produce.

Investigate Public Performance Results

Examine the public performance benchmarks and results of various LLMs to identify those that align with specific use cases. This helps narrow down the potential models for further evaluation. Resources like the HELM (Holistic Evaluation of Language Models) framework can provide a comprehensive evaluation of LLM performance across various metrics:

  • Example outcome: A shortlist of models, such as AI21 Labs’ Jurassic, Amazon’s Titan, Meta’s Llama 2, and Anthropic’s Claude, deemed potentially suitable based on public performance metrics and relevance to the application’s needs.

Set Up Models in AWS

Configure each of the shortlisted LLMs in Amazon SageMaker JumpStart for detailed control and customization over the machine learning (ML) process, and in Amazon Bedrock for a quickly integrating advanced AI capabilities without much customization. Run the sample test cases through each model to generate experimental results.

  • Example outcome: Experiment results where each model’s output for the given inputs is recorded. For instance, the accuracy of answers provided by each LLM to the set of questions from step 1.
  • Amazon Bedrock feature: Bedrock offers a model evaluation feature to test and compare model performance on sample test cases. Going further, Amazon SageMaker’s model evaluation can compare models from SageMaker, Bedrock, and OpenAI, providing a more comprehensive assessment across platforms.

Score Experiment Results and Calculate Overall Performance

Have data scientists review the outputs produced by the LLMs for each sample test case, scoring them based on their accuracy and relevance. Aggregate these scores to determine the overall performance of each model.

Both Amazon Bedrock and SageMaker model evaluation features support this “human-in-the-loop” approach. While the automated evaluations provide quantitative metrics, human scoring allows for a more nuanced assessment of the model outputs.

  • Example outcome: A scored list of models where each model has been rated based on how well its outputs matched the expected results. For example, the Llama 2 model might score highest for language translation accuracy, while the Jurassic model excels in generating creative content.
  • The combined automated and human evaluation delivers a more comprehensive and reliable assessment of each model’s strengths and weaknesses.

Select Model for Migration Based on Performance Scores

Decide which LLM to migrate to based on the comprehensive performance evaluation. Begin the migration process using AWS tools and services for the model that achieved the best score in the initial round of evaluation.

  • Example outcome: The decision to proceed with migrating the Meta Llama 2 model to AWS for your application based on its superior performance across a range of test cases.

Migrate and Integrate the Chosen Model

Utilize Amazon Bedrock and other AWS services to facilitate the migration of the selected LLM into your production environment, ensuring it’s properly integrated with your application’s infrastructure.

Phase 2: Prompt Tuning Iteration

Customize Model Prompts

In this phase, the focus shifts to leveraging prompt engineering to further optimize the performance of each LLM evaluated in the initial phase. The key objective is to tailor the prompts specifically to the unique characteristics and capabilities of each model, aiming to extract the maximum potential from each one.

  • Example outcome: A set of customized prompts for each shortlisted LLM, designed to capitalize on the model’s individual strengths and address any gaps identified in the initial evaluation.
  • Tip: Use the prompt design feature to experiment with different prompt structures and phrasings to see their impact on model outputs. The playground feature can help quickly iterate on prompts and test their effects.

Run Test Cases with Customized Prompts

Execute the sample test cases, as defined in Phase 1, using the customized prompts for each language model. Capture the outputs generated by the models.

  • Example outcome: Experiment results showcasing how the model outputs change when using the optimized prompts, compared to the initial generic prompts.

Human Evaluation

Have subject matter experts review the model outputs generated with the customized prompts. They can score the responses based on criteria like accuracy, relevance, and coherence.

  • Example outcome: Qualitative assessments and scores for each model, providing deeper insights into their performance beyond just the quantitative metrics.

Calculate Scores

Aggregate the human evaluation scores to determine the overall performance of each LLM after prompt optimization. Compare these results to the initial baseline scores.

  • Example outcome: A ranked list of models, highlighting any changes in their relative performance compared to the first evaluation phase.

Select Optimized Model

Based on the comprehensive evaluation, including both the baseline and prompt-tuned results, choose the top-performing model to proceed with migration and deployment.

  • Example outcome: A final decision to proceed with migrating the Anthropic Claude model to AWS, as it exhibited the best performance across the range of test cases after the prompt tuning iteration.
  • Tip: Carefully consider the tradeoffs between model performance, accuracy, and cost when selecting the optimal LLM for deployment. The comprehensive evaluation process should provide the necessary insights to make an informed decision.

Migration Process – Architecture Design and Implementation

Figure 2 shows a high-level architecture design for applications in AWS that consume LLMs in AWS.

Amazon Route 53 is used to manage an end user’s internet connection, and requests from end users are passed to Amazon Virtual Private Cloud (VPC), which includes a public subnet and a private subnet.

Public-facing components, such as Application Load Balancers, are placed into the public subset. On the other hand, application clusters, such as the Amazon Elastic Container Service (Amazon ECS) cluster or Amazon Elastic Kubernetes Service (Amazon EKS) cluster, are provisioned inside a private subnet. The private subset can be accessed via the public subnet inside the VPC.


Figure 2 – AWS architecture diagram with Amazon Bedrock and Amazon SageMaker JumpStart.

In terms of LLMs, with Amazon SageMaker JumpStart the endpoint can be secured inside the private subset, which can only be connected from the application cluster. This restricts the access and secures the connection to the LLM endpoints. In addition, the LLMs provisioned by SageMaker JumpStart are exclusive LLMs that are only used inside a specific client’s account; the LLMs are not shared with any other AWS clients.

LLMs provisioned via Amazon Bedrock can be accessed on-demand through shared instances or on dedicated instances using provisioned throughput. AWS Identity and Access Management (AWS) policies can be used for fine-grained access control and Bedrock supports AWS PrivateLink for establishing private connectivity between the service and your VPC. Your data is not shared with model providers and is not used to improve the base models.

In regards to the implementation, AWS SDK for Python (Boto3) provides all of the functionalities needed to provision and consume LLMs via SageMaker JumpStart and Amazon Bedrock. There are also plenty of open-source libraries that extend Boto3 to support SageMaker JumpStart and Amazon Bedrock, such as LangChain which Arcanum AI uses in its product.

Most open-source libraries similar to LangChain are still under development. We find it useful to dive deep into Boto3 and look into the source code of open-source libraries under development because this increases our capacity for better use and debugging of applications that consume LLMs in SageMaker JumpStart and Amazon Bedrock.


By moving large language models to Amazon Bedrock and Amazon SageMaker JumpStart, Arcanum AI surpassed the performance levels it previously enjoyed with OpenAI models. This transition offers AWS users, especially those in the enterprise sector and SMBs handling sensitive information, enhanced network security and improved access control to language models with minimal extra work.

Moreover, the integration tools provided by AWS make incorporating Amazon Bedrock and SageMaker Jumpstart into current systems straightforward. This provides a strong incentive for AWS customers to either transition to or start their projects on these AWS platforms.

If you’re a large enterprise or software company exploring generative AI migration or adoption to AWS, reach out to Arcanum AI via [email protected]. For businesses seeking generative AI solutions built on AWS, learn about the Arcanum AI Assistant offering on AWS Marketplace or at


Arcanum – AWS Partner Spotlight

Arcanum AI is an AWS Partner that combines expert AI/ML development services with its Accelerate Platform to build tools for unstructured data and IoT.

Contact Arcanum AI | Partner Overview | AWS Marketplace