Introduction to Large Language Models (LLMs)

What are Large Language Models?

Large Language Models (LLMs) are AI systems trained on massive amounts of text data that can understand and generate human-like text. They power tools like ChatGPT, Claude, and Google Gemini.

The "Large" in LLM

GPT-3: 175 billion parameters (connections)
GPT-4: Estimated 1.76 trillion parameters
Training data: Hundreds of billions of words from books, websites, code

For comparison, the human brain has ~86 billion neurons!

What Can LLMs Do?

💬 Conversation

Natural dialogue, context awareness, follow-up questions

✍️ Writing

Essays, emails, stories, scripts, marketing copy

💻 Coding

Write, debug, explain, and optimize code

📊 Analysis

Summarize, extract insights, answer questions

🌍 Translation

Translate between 100+ languages

🎭 Creativity

Brainstorm ideas, write poetry, create characters

How LLMs Work (Simplified)

  • Training: Model reads billions of text examples
  • Learning Patterns: Discovers language patterns, facts, reasoning
  • Tokenization: Breaks text into pieces (tokens)
  • Prediction: Predicts next most likely token
  • Generation: Repeats to create full responses
  • # Simplified LLM prediction process
    def generate_response(prompt):
        """
        LLMs predict one token (word piece) at a time
        """
        tokens = tokenize(prompt)  # "Hello world" → ["Hello", " world"]
        
        # Start with user's prompt
        generated_tokens = tokens
        
        # Generate 50 new tokens
        for _ in range(50):
            # Predict next token based on context
            next_token = model.predict_next_token(generated_tokens)
            generated_tokens.append(next_token)
            
            # Stop if we generate end token
            if next_token == "":
                break
        
        return detokenize(generated_tokens)
    
    # Example
    prompt = "The capital of France is"
    response = generate_response(prompt)
    print(response)
    # Output: "The capital of France is Paris, a beautiful city 
    #          known for the Eiffel Tower..."

    Popular LLMs

    OpenAI GPT Series

    • GPT-3.5: Powers free ChatGPT
    • GPT-4: Most capable, multimodal
    • GPT-4 Turbo: Faster, cheaper

    Best for: General tasks, coding, creative writing

    Anthropic Claude

    • Claude 3 Opus: Most powerful
    • Claude 3 Sonnet: Balanced
    • Claude 3 Haiku: Fast & cheap

    Best for: Long documents, analysis, safety

    Google Gemini

    • Gemini Ultra: Top-tier
    • Gemini Pro: Standard
    • Gemini Nano: On-device

    Best for: Google integration, multimodal

    Open Source

    • Llama 2/3: Meta's models
    • Mistral: Efficient European model
    • Falcon: Strong open model

    Best for: Self-hosting, fine-tuning, privacy

    Using LLMs in Python

    1. OpenAI API

    # Install: pip install openai
    from openai import OpenAI
    
    client = OpenAI(api_key="your-api-key")
    
    # Simple completion
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful coding assistant."},
            {"role": "user", "content": "Write a Python function to calculate fibonacci"}
        ],
        temperature=0.7,  # Creativity (0-2)
        max_tokens=500    # Response length limit
    )
    
    print(response.choices[0].message.content)
    
    # Streaming response (like ChatGPT typing effect)
    stream = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Tell me a story"}],
        stream=True
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="")
    

    2. Hugging Face Transformers

    # Install: pip install transformers torch
    from transformers import pipeline
    
    # Load a pre-trained model
    generator = pipeline('text-generation', model='gpt2')
    
    # Generate text
    result = generator(
        "Once upon a time",
        max_length=100,
        num_return_sequences=1
    )
    
    print(result[0]['generated_text'])
    
    # Different tasks
    summarizer = pipeline("summarization")
    translator = pipeline("translation_en_to_fr")
    sentiment = pipeline("sentiment-analysis")
    
    # Use them
    summary = summarizer("Long article text here...", max_length=50)
    french = translator("Hello, how are you?")
    feeling = sentiment("I love this product!")

    3. LangChain (Advanced)

    # Install: pip install langchain openai
    from langchain.chat_models import ChatOpenAI
    from langchain.prompts import ChatPromptTemplate
    from langchain.chains import LLMChain
    
    # Create chat model
    llm = ChatOpenAI(model="gpt-4", temperature=0.7)
    
    # Create prompt template
    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are a {profession}. Be helpful and professional."),
        ("user", "{question}")
    ])
    
    # Create chain
    chain = LLMChain(llm=llm, prompt=prompt)
    
    # Use it
    response = chain.run(
        profession="Python expert",
        question="How do I read a CSV file?"
    )
    
    print(response)

    Key Concepts

    Tokens

    LLMs process text in chunks called tokens. ~1 token ≈ 4 characters ≈ 0.75 words

    # Example tokenization
    "Hello world!" → ["Hello", " world", "!"]  # 3 tokens
    "Artificial Intelligence" → ["Art", "ificial", " Int", "elligence"]  # 4 tokens
    
    # Pricing is per token
    # GPT-4: ~$0.03 per 1K input tokens, $0.06 per 1K output tokens

    Context Window

    Maximum tokens the model can process at once (input + output)

    • GPT-3.5: 4K or 16K tokens
    • GPT-4: 8K, 32K, or 128K tokens
    • Claude 3: 200K tokens (~150K words!)

    Longer context = can handle bigger documents, maintain longer conversations

    Temperature

    Controls randomness/creativity (0-2)

    • 0: Deterministic, same output each time (good for factual tasks)
    • 0.7: Balanced creativity (default)
    • 1.5+: Very creative, unpredictable (good for brainstorming)

    Top-p (Nucleus Sampling)

    Alternative to temperature. Considers tokens with cumulative probability up to p

    • 0.1: Very focused, only most likely tokens
    • 0.9: Balanced (common default)
    • 1.0: Consider all tokens

    Complete Example: Simple Chatbot

    # simple_chatbot.py
    from openai import OpenAI
    import os
    
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    
    def chat():
        """Simple chatbot with conversation history"""
        messages = [
            {"role": "system", "content": "You are a friendly AI assistant."}
        ]
        
        print("Chatbot ready! (Type 'quit' to exit)")
        print("-" * 50)
        
        while True:
            # Get user input
            user_input = input("You: ")
            
            if user_input.lower() == 'quit':
                print("Bot: Goodbye!")
                break
            
            # Add user message to history
            messages.append({"role": "user", "content": user_input})
            
            # Get AI response
            response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=messages,
                temperature=0.7,
                max_tokens=500
            )
            
            # Extract and display response
            bot_message = response.choices[0].message.content
            print(f"Bot: {bot_message}\n")
            
            # Add bot response to history (for context)
            messages.append({"role": "assistant", "content": bot_message})
            
            # Show token usage
            print(f"(Tokens used: {response.usage.total_tokens})")
    
    if __name__ == "__main__":
        chat()

    Best Practices

    LLM Limitations

    ⚠️ What LLMs Can't Do (Yet)

    • No Real-Time Knowledge: Training data has a cutoff date
    • Hallucinations: Can confidently state false information
    • No True Understanding: Pattern matching, not conscious thought
    • Math Struggles: Complex calculations often wrong
    • Context Limits: Can't process infinite text
    • Inconsistency: Same prompt can give different results

    Solution: Use tools like RAG (Retrieval Augmented Generation), function calling, and verification

    Next Steps