Welcome to TensorTunes 🎶

Hi, I’m Shahid, and I recently earned my Master’s degree in Artificial Intelligence from the University at Buffalo 🎓.

This blog is where I compile my notes on Large Language Models (LLMs) 🤖, drawn from papers, videos, and other resources.

Inspired by Albert Einstein’s words, “If you can’t explain it simply, you don’t understand it well enough,” I break down complex concepts to refine my understanding 🧠.

I’m also actively seeking full-time opportunities to apply these skills in real-world applications 💼.

LLM Reasoning [Draft]

Motivation for Reasoning Classical ML algorithms like Linear Regression, Support Vector Machines, K- Nearest Neighbours etc, have long served as powerful tools for narrow, well-structured problems, but what sets apaprt todays field of AI that distinguishes itself from the previous generation is the ability of todays models to reason (or do sometinig that resembels reasonig, quite convincingly i might add), the ideal of AGI (artificial general intellingence) is a computer that can reason and plan ahead, and current llms can be taught to do many things, this just comes under the paradigm of prompt engineering....

Where Did All the Memory Go?

CUDA error: out of memory If you’ve ever tried to train a deep learning model, the dreaded CUDA error: out of memory is likely all too familiar. The usual quick fix is to decrease the batch size and move on without giving it much thought. But have you ever wondered about how memory gets allocated during training?? In this blog post, I want to demystify memory consumption during model training and and offer practical methods to reduce the demands of memory-heavy models....

From Retrieval to RAG (Part - 2) [Draft]

The concept of “Retriever + Generator End-to-end Training” (referred to as “RAG”) by Lewis et al. (2020) integrates retrieval and generation into a single, cohesive framework. This method enhances the accuracy of generating relevant and accurate responses by training both components together, ensuring that the retriever provides relevant documents and the generator produces high-quality responses. Let’s break down the details step by step: Components of RAG Retriever: The retriever is responsible for searching a large corpus to find documents relevant to the input query....

RLHF: PPO [Draft]

Reinforcement Learning from Human Feedback (RLHF): Aligning LLMs with Human Intent Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique in the advancement of large language models (LLMs), aiming to align their behavior more closely with human intentions and ethical standards. While starting with a large pretrained LLM—such as LLaMA 2 with its 7 billion parameters trained on a trillion tokens—provides a strong foundation, these models can still struggle with handling harmful or toxic queries effectively....

Multi_GPU_Training [Draft]

In this post, i wont be discussing code implementations, my goal is to cover the foundational concepts related to multi-GPU Training of Massive llms, as stated in my post on Qunatization, you would need a cluster of gpus just to get up and running with the finetuning of of even small llms like the llama 7B models. The topics i would like to cover are as follows DDP (Distributed Data Parallel) Tensor Model parallelism Pipeline model parallelism Memory efficient pipeline parallelism Lest start Multi GPU Training...

Decoding From Language Models

A quick refresher on Autoregressive text generation Autoregressive language models generate text through a sequential process of predicting one token at a time. The model takes a sequence of tokens $ \lbrace y \rbrace _{<t} $ as input and outputs a new token $ \hat{y_t} $. This process repeats iteratively, with each newly generated token becoming part of the input for the subsequent prediction. At each time step $ t $, the model computes a vector of scores $ \mathbf{S} \in \mathbb{R}^V $, where $ V $ is the size of the vocabulary....

From Retrieval to RAG (Part - 1)

ChatGPT can make mistakes. Check important info. This disclaimer appears beneath the input field in ChatGPT and is not unique to it—similar notices can be found across all major large language models (LLMs). This is because one of the most well-known issues with LLMs is their tendency to hallucinate, meaning they can generate information that isn’t accurate or grounded in reality. So, before submitting your history paper directly from ChatGPT, make sure to proofread it carefully....

Parameter Efficient Fine-tuning of LLMs (PEFT) [Draft]

Motivation for PEFT Consider a company like character.ai, which provides different personas for users. For example, you can talk to a chat bot that mimics Elon Musk and ask, “Why did you buy Twitter?” The model responds as Elon Musk would. Now there are primarily three approaches to solving this: Context-based approach: Take an LLM and provide it with extensive data about the persona (e.g., Elon Musk’s interviews and tweets) as context, and then tag on your question....

Quantization in LLMS (Part 1): LLM.int8(), NF4

Introduction to Quantization Whether you’re an AI enthusiast looking to run large language models (LLMs) on your personal device, a startup aiming to serve state-of-the-art models efficiently, or a researcher fine-tuning models for specific tasks, quantization is a key technique to understand. Quantization can be broadly categorized into two main approaches: Quantization Aware Training (QAT): This involves training the model with reduced precision, allowing it to adjust during the training process to perform well under quantized conditions....

Quantization in LLMS Part 2: GPTQ [Draft]

Introduction Quantization is a crucial technique in deep learning that reduces the memory footprint and computational requirements of neural networks by representing weights and activations with lower-precision numerical formats. This is particularly important when deploying large models on devices with limited resources. However, quantizing a neural network without significantly degrading its performance is challenging. The GPTQ (Gradient Post-Training Quantization) algorithm is a method designed to efficiently quantize large-scale neural networks, such as those used in natural language processing, while maintaining high accuracy....