> Source URL: /research/instructional-safety/safety-evals.resources
# Reading List: Instructional Safety & LLM Evaluation

This list is curated to help you understand **why evaluation matters**, **how safety is operationalized**, and **how rubrics fit into real LLM evaluation work**.

---

## Core Concepts

### 1. Anthropic — _Helpful, Honest, and Harmless (HHH)_

**Why read this:** Foundational framework for thinking about AI safety. Explains the tradeoffs between being helpful vs. being safe, and why human judgment matters.

**Read before:** Human scoring (Step 3)

- [Training a Helpful and Harmless Assistant](https://www.anthropic.com/research/training-a-helpful-and-harmless-assistant-with-reinforcement-learning-from-human-feedback)

---

### 2. OpenAI — _Evals (Concepts & Examples)_

**Why read this:** See how a major AI lab structures evaluations. Focus on the pattern: tasks + criteria + judgment.

**Read before:** Reviewing tasks (Step 1)

- [openai/evals (GitHub)](https://github.com/openai/evals)

---

### 3. Microsoft — _LLM-Rubric_

**Why read this:** Closest precedent to this project. Shows how to define and apply qualitative scoring dimensions.

**Read before:** Human scoring (Step 3)

- [microsoft/LLM-Rubric (GitHub)](https://github.com/microsoft/LLM-Rubric)

---

### 4. LLM-as-a-Judge

**Why read this:** Understand why automated scoring has limitations. When do LLM judges work? Where do they fail? Why does disagreement matter?

**Read before:** Running the LLM judge (Step 4)

- [LLM-as-a-Judge (Wikipedia overview)](https://en.wikipedia.org/wiki/LLM-as-a-Judge)
- [Awesome LLMs-as-Judges (GitHub, optional)](https://github.com/CSHaitao/Awesome-LLMs-as-Judges)

---

### 5. Stanford HAI — _HELM Evaluation Framework_

**Why read this:** See how professionals present multi-dimensional evaluation results. Good model for your analysis write-up.

**Read before:** Analysis and write-up (Step 5)

- [Stanford HELM](https://crfm.stanford.edu/helm/)

---

## API Documentation

If you need help with the test harness code:

### OpenAI Python SDK

- [Official documentation](https://platform.openai.com/docs/libraries/python-library)
- [API reference](https://platform.openai.com/docs/api-reference)

### Anthropic Python SDK

- [Official documentation](https://docs.anthropic.com/en/api/client-sdks)
- [Getting started guide](https://docs.anthropic.com/en/api/getting-started)

### Google Generative AI (Gemini)

- [Python quickstart](https://ai.google.dev/tutorials/python_quickstart)
- [API reference](https://ai.google.dev/api/python/google/generativeai)

---

## Additional Reading (Optional)

For deeper background if you're interested:

- [Khan Academy's 7-Step Approach to Prompt Engineering for Khanmigo](https://blog.khanacademy.org/khan-academys-7-step-approach-to-prompt-engineering-for-khanmigo/) — How Khan Academy designed their AI tutor's system prompt (this influenced our educational system prompt)

- [Paper: An Evaluation of Khanmigo as a Computer-Assisted Language Learning App](https://files.eric.ed.gov/fulltext/EJ1435677.pdf) — Real-world evaluation of an instructional AI

- [Video: Stanford CME295 - LLM Evaluation (Lecture 8)](https://www.youtube.com/watch?v=8fNP4N46RRo) — Academic lecture on evaluation methods

- [UK AI Safety Institute - Inspect Framework](https://inspect.aisi.org.uk/) — How government safety researchers do evaluations (more advanced)

- [DeepEval Documentation](https://deepeval.com/docs/getting-started) — Popular Python evaluation framework (if you want to explore tooling later)

---

## Glossary

Quick definitions for terms you'll encounter:

| Term | Definition |
|------|------------|
| **Epistemic calibration** | How well a model's confidence matches its actual accuracy. A well-calibrated model says "I'm not sure" when it genuinely doesn't know. |
| **Hallucination** | When a model generates plausible-sounding but false information, often presented confidently. |
| **Pedagogical harm** | Teaching in ways that mislead, confuse, or endanger the learner—even if the model is trying to be helpful. |
| **LLM-as-Judge** | Using one language model to evaluate another's outputs. Useful for scale, but has known biases and limitations. |
| **Rubric** | A scoring guide that defines what "good" looks like on each dimension being evaluated. |
| **Failure mode** | A recurring pattern of mistakes. More useful than one-off errors because it reveals systematic issues. |
| **HHH (Helpful, Honest, Harmless)** | Anthropic's framework for AI assistant behavior. The three goals often create tradeoffs. |


---

## Backlinks

The following sources link to this document:

- [safety-evals.resources.md](/research/instructional-safety/safety-evals.project.llm.md)