Latest

A blog about AI, coding and tech

Published on
December 5, 2025
A Practical Guide to RAG Evaluation With RAGAS Metrics and Confidence Intervals
RAG Large-Language-Models Machine-Learning Natural-Language-Processing
How to model query quality, use bootstrapping, and report realistic RAG performance with RAGAS metrics and confidence intervals.
Read more →
Published on
November 23, 2024
MLE-Bench: Benchmarking AI Agents in Machine Learning Engineering
AI-Benchmarking Machine-Learning-Engineering Kaggle-Competitions AIDE-Framework OpenAI-Research AI-Agents
MLE-Bench introduces a new benchmark to evaluate AI agents on real-world ML engineering tasks using Kaggle competitions. This post highlights key findings, including resource scaling effects, debugging challenges, and the performance of different agent frameworks.
Read more →
Published on
November 21, 2024
Understanding Scaled Dot-Product Attention in Neural Networks for Causal Discovery
Causal-Discovery Transformer-Neural-Network Attention-Mechanism Layer-Normalization Deep-Learning ADIA-Lab
This post explores a neural network model designed for the Causal Discovery Challenge organized by ADIA Lab. It highlights the use of a Transformer-based architecture with two layers of scaled dot-product attention and layer normalization, achieving a multi-balanced accuracy of 47.986%.
Read more →
Published on
October 21, 2023
Optimizing Language Model Prompts with Gradient-Based Tuning
Prompt-Optimization Reverse-Engineered-Prompt-Attack Large-Language-Model BLEU-score Transformers Weights-and-biases
I will provide my solution to the Trojan Detection Challenge 2023 (LLM Edition), a competition at NeurIPS 2023, which aims to advance our understanding and development of methods for detecting hidden functionality in large language models (LLMs). The primary task is to reverse-engineer the trigger prompts associated with a given target string.
Read more →
Published on
October 7, 2023
Evaluating Generated Text Quality using BLEU Score in Natural Language Processing
BLEU-score Natural-Language-Processing
The BLEU (Bilingual Evaluation Understudy) score is a metric used in Natural Language Processing (NLP) to evaluate the quality of text generated by machines, such as translations. It operates as a precision-based metric, examining the overlap of words or n-grams between the generated text and a reference text.
Read more →

All Posts →

Latest

A Practical Guide to RAG Evaluation With RAGAS Metrics and Confidence Intervals

MLE-Bench: Benchmarking AI Agents in Machine Learning Engineering

Understanding Scaled Dot-Product Attention in Neural Networks for Causal Discovery

Optimizing Language Model Prompts with Gradient-Based Tuning

Evaluating Generated Text Quality using BLEU Score in Natural Language Processing