---
name: "candidate-sourcing-pipeline"
description: "Daily candidate sourcing pipeline using Exa API: source, rank, and output candidates as CSV + HTML review page."
status: proposal
version: "v1"
date: "2026-07-01T12:53:51.117Z"
---

# Candidate Sourcing Pipeline

## Overview

A three-stage candidate sourcing pipeline that uses Exa's people search API to find, rank, and surface candidates for any role. Outputs a ranked CSV and an HTML review page served via a static file server.

## Setup

### Prerequisites

1. **Exa API key** — stored as `EXA_API_KEY` in the OpenClaw config under `env.vars` (or any environment injection method available on the target system).
2. **Python 3.10+** with `requests` and `pyyaml` installed.
3. **A static file server** (e.g. Caddy) serving a directory for HTML output. The skill assumes a drafts preview URL is available, but this is optional — the CSV works standalone.

### Installation

1. Copy the `scripts/` folder to a working directory on the target system.
2. Copy `config.yaml` and edit queries, scoring weights, and keyword groups for the target role.
3. Set `EXA_API_KEY` in the environment.
4. Test: `python3 pipeline.py --config config.yaml`
5. Schedule via OpenClaw cron (see Cron Setup below).

## Config

All queries, scoring weights, and keyword groups live in `config.yaml`. Edit freely without touching code.

```yaml
exa:
  endpoint: "https://api.exa.ai/search"
  category: "people"
  type: "auto"
  contents:
    summary: true
  num_results: 10

queries:
  - "your search query here"
  - "another query"

scoring:
  weights:
    saas_b2b: 0.30
    technical_depth: 0.30
    ai_familiarity: 0.25
    seniority: 0.15

  keyword_groups:
    saas_b2b:
      - "saas"
      - "b2b"
      # ...
    technical_depth:
      - "engineer"
      - "api"
      # ...
    ai_familiarity:
      - "ai"
      - "machine learning"
      # ...
    seniority:
      - "sales engineer"
      - "mid-level"
      # ...

output:
  dir: "output"
  filename_prefix: "candidates"
```

### Negative keyword filtering

To penalize over-senior candidates, the scoring naturally downweights titles that don't match the `seniority` keyword group. For explicit negative filtering, add a `negative_keywords` list under `scoring` and the pipeline will subtract 0.15 per match:

```yaml
scoring:
  negative_keywords:
    - "director"
    - "vp"
    - "head of"
  negative_penalty: 0.15
```

## Pipeline Stages

### Stage 1 — Source

Calls Exa POST `https://api.exa.ai/search` with `category: "people"`, `type: "auto"`, and `contents: {"summary": true}`. Runs all queries from config, pools results, dedupes by profile URL.

Extracts from each result:
- Name (from title or entities)
- Profile URL
- Location
- Current title/company (from entities or parsed from title)
- Work history (title + company per role from `entities[].properties.workHistory`)
- Profile description (from contents.summary)

### Stage 2 — Rank

Scores each candidate 0–1 across four weighted keyword groups. The text scored is the concatenation of title, company, work history, and description (lowercased).

For each keyword group: count keyword hits in the text, normalize to 0–1 (3+ hits = 1.0), multiply by the group weight, sum across groups.

Optional negative keywords subtract a flat penalty per match.

Sort descending by score.

### Stage 3 — Output

Writes a timestamped CSV with columns: `score, name, current_title, current_company, location, work_history, profile_description, profile_url`.

Optionally generates an HTML review page via `generate_html.py` with:
- Card grid layout (responsive)
- Score badges (color-coded: green ≥0.8, yellow ≥0.6, gray <0.6)
- Filter by name/title/company
- Sort by score or name
- Expandable work history
- LinkedIn profile links

## Cron Setup

Schedule as a daily OpenClaw isolated cron job:

```json
{
  "name": "Candidate sourcing daily",
  "schedule": { "kind": "cron", "expr": "0 8 * * *", "tz": "America/New_York" },
  "sessionTarget": "isolated",
  "payload": {
    "kind": "agentTurn",
    "message": "Run the candidate sourcing pipeline: cd <path> && python3 pipeline.py --config config.yaml && python3 generate_html.py. Then read the CSV and send a summary: top 5 candidates with score, name, title, company, and profile URL.",
    "timeoutSeconds": 300,
    "toolsAllow": ["exec", "read", "message"]
  },
  "delivery": { "mode": "announce" }
}
```

Adjust `tz` and `expr` for the target timezone. The cron job runs in an isolated session, produces the CSV + HTML, and messages the user with a summary.

## File Structure

```
candidate-sourcing/
├── config.yaml          # Editable queries, weights, keywords
├── pipeline.py          # Main pipeline (source → rank → CSV)
├── generate_html.py     # CSV → HTML review page
├── requirements.txt     # requests, pyyaml
└── output/              # Generated CSVs (timestamped)
```

## Customization

- **New role**: change queries and keyword groups in config.yaml
- **Geographic filter**: add location terms to queries (e.g. "United States", "Germany")
- **Seniority tuning**: swap seniority keywords for IC terms, or add negative_keywords to penalize senior titles
- **More results per query**: increase `exa.num_results`
- **HTML output path**: edit `OUT_PATH` in `generate_html.py` or symlink the output directory to your static file server

## Notes

- Exa returns LinkedIn profiles as profile URLs. These are public profile links, not authenticated LinkedIn API calls.
- The pipeline does not store or cache results between runs. Each run is independent.
- CSV files accumulate in `output/` — clean up periodically or add a retention policy.
- The HTML page is a single static file with no backend. Safe to serve publicly (no API keys or sensitive data in the HTML).