Back

The Prometeo AI Infrastructure Blog

February 5, 2025

4 min read

Deploying Saga: The 7B LLM Model for Prometeo 2025

Prometeo, the biggest techno-entrepreneurial fest of North-Western India, is an event that brings together innovators, tech enthusiasts, and entrepreneurs. As the head of technical events for Prometeo 2025, I had the opportunity to deploy Saga, a 7B parameter Large Language Model (LLM) for real-time interaction with attendees. Our theme this year was Nordic Nights, a blend of Scandinavian culture and futuristic technology.To test it out, you can visit here

SAGA

System Architecture

Deploying Saga for a live event required a robust infrastructure to handle real-time queries efficiently. Here’s a high-level breakdown of the system:

1. Model Selection & Setup

2. Infrastructure & Deployment

System Architecture

Some statistics on SAGA

The above stats are for 15 requests total time.

Prompt Eval Count:

It is the number of times a given prompt was evaluated by the LLM when generating a response.

System Architecture

Prompt Eval duration:

It is the time spent generating the response.

System Architecture

Note: Y axis represents ms (milliseconds)

Eval Count:

It is the number of tokens in the response

System Architecture

Total Duration:

It refers to the overall time it takes for the language model to process a prompt and generate a response, including the time spent loading the model, evaluating the prompt, and generating the text, essentially representing the complete execution time of a single query.

System Architecture

Note: Y axis represents ms (milliseconds)

Load Duration

The time taken to basically load and unload the model to the GPU.

System Architecture

Note: All Experiments were done on an A5000.

Technical Challenges & Solutions

1. Managing Latency for Live Queries

2. Handling High Query Volume

3. Context Retention in Conversations

4. Competitor Q/A

5. Huge Startup Time

How Did People Use Saga?

Saga was utilized in various ways throughout the event:

Overall Experience & Takeaways

Deploying an LLM for a live event was an exhilarating challenge. Seeing people interact with Saga, a model we fine-tuned and optimized in real time, was incredibly fulfilling. Some key learnings:

I want to try out speculative decoding and other methods for faster inference as well, maybe for Prometeo 2026 or some other task.

IIT Jodhpur: A Leader in LLM Innovation

One of the standout aspects of this project is that IIT Jodhpur is among the only institutes to host and deploy its own Large Language Model (LLM). This initiative showcases our commitment to AI research and deployment, setting us apart as a hub for cutting-edge technological advancements. The success of this deployment highlights IITJ’s growing expertise in foundational models and AI-driven applications.



If you’re interested in LLM deployment, let’s connect! Would love to discuss more on model serving, inference optimization, and real-world applications. 🚀 @github or @email