An HTTP API for running the ML Intern
agent. A request submits a task; the agent plans, writes code, and executes it —
including launching HF Jobs
on cloud hardware — under the namespace of the calling token. Progress is delivered
as a resumable server-sent-event stream; results and artifacts (jobs, trackio
dashboards, pushed repos) are also available by polling.
The surface follows the OpenAI Responses API where applicable
(POST /v1/responses, background, previous_response_id,
response object shape, error envelope) with documented extensions:
max_cost_usd, artifacts[], approval endpoints, and additional
SSE event types. The openai-python SDK works for create/retrieve/cancel via
base_url + extra_body; its typed streaming parser does not accept
the extended event names, so consume SSE directly for streaming.
BASE URL…
Agent runs are long-lived: a turn may take seconds (a question) or hours (training).
Design clients around background: true plus polling or stream resumption.
example turn — SSE
Replay of a representative turn. Event names and payload shapes are documented under /responses/{id}/events.
All /v1 endpoints require a Hugging Face user access token in the
Authorization header:
http
Authorization: Bearer hf_xxxxxxxxxxxxxxxx
Tokens are validated against huggingface.co/api/whoami-v2 (cached for 5 minutes).
Both classic and fine-grained user tokens are accepted; organization tokens are rejected.
There is no cookie or OAuth-redirect flow on this surface.
Required token permissions
Inference Providers — all agent reasoning runs through HF Inference Providers as the caller. A token without this permission fails before session creation with 403 inference_provider_permission_required.
Write access to repos — for pushing models/datasets/Spaces.
Jobs — for launching HF Jobs. Job billing requires credits on the target namespace; without them the job call returns a billing error to the agent.
All compute, inference, and storage initiated by a run is authenticated as — and billed
to — the account behind the token. The server holds the token in memory for the session
lifetime only.
Task: Research diffusion language models for text generation; explain recent changes and cite Hugging Face paper/model pages.
request
{
"model": "moonshotai/Kimi-K2.6:novita",
"input": "Research diffusion language models for text generation. In 5 concise bullets, explain what changed recently, why it matters, and cite 2 relevant Hugging Face paper pages or model pages if available. Keep under 300 words.",
"background": true,
"max_cost_usd": 3.0
}
Task: Compare three Hugging Face ASR choices for fast batch English transcription on one GPU and recommend an implementation path.
request
{
"model": "moonshotai/Kimi-K2.6:novita",
"instructions": "Keep this as a quick model-selection answer. Do not launch broad research sub-agents. Use at most three direct Hugging Face lookups, then answer.",
"input": "Compare these three choices for quickly transcribing batches of English audio on one GPU: openai/whisper-large-v3-turbo, distil-whisper/distil-large-v3, and faster-whisper with large-v3-turbo. Recommend one for speed/accuracy/ease of use and include a short usage snippet. Keep under 400 words.",
"background": true,
"max_cost_usd": 2.0
}
Task: Pick a production embedding and reranker stack for technical-doc RAG, balancing quality and latency.
request
{
"model": "moonshotai/Kimi-K2.6:novita",
"instructions": "Keep this as a quick model-selection answer. Do not launch broad research sub-agents. Use at most four direct Hugging Face lookups, then answer.",
"input": "For a 2026 production RAG system over technical docs, compare these Hugging Face options: Qwen/Qwen3-Embedding-8B, BAAI/bge-m3, jinaai/jina-embeddings-v4, and BAAI/bge-reranker-v2-m3. Recommend an embedding + reranker stack for quality vs latency. Include one short sentence-transformers or transformers usage snippet. Keep under 450 words.",
"background": true,
"max_cost_usd": 2.0
}
completed33 s + 11 smultiturnresp_2768fb94ff614a3a90a1c455548d767f → resp_29eb917b2e2c4a0fbecdba4aa8303a21
Task: First ask for a RAG embedding recommendation, then continue the same session and ask for code that uses the recommended model.
turn 1 request
{
"model": "moonshotai/Kimi-K2.6:novita",
"instructions": "This is turn 1 of a multiturn API example. Keep it concise. Do not launch jobs or broad research sub-agents. Use direct Hub/model knowledge or at most two direct Hub lookups.",
"input": "For technical-document RAG, compare BAAI/bge-m3 and Qwen/Qwen3-Embedding-8B. Recommend one default embedding model for a startup that cares about good quality but low latency. Keep under 250 words.",
"background": true,
"max_cost_usd": 2.0
}
turn 2 request
{
"model": "moonshotai/Kimi-K2.6:novita",
"previous_response_id": "resp_2768fb94ff614a3a90a1c455548d767f",
"instructions": "This is turn 2 of a multiturn API example. Reuse the prior recommendation; do not restate the comparison. Provide runnable minimal code only plus two setup notes. Do not launch jobs.",
"input": "Using your recommended embedding model from the previous turn, write a minimal Python script that indexes 100 local Markdown files and retrieves the top 5 chunks for a query. Keep it compact.",
"background": true,
"max_cost_usd": 2.0
}
Result: Turn 1 recommended BAAI/bge-m3.
Turn 2 reused that context via previous_response_id and returned a compact sentence-transformers + faiss indexing script without resending the comparison.
Task: Summarize the current frontier for sparse autoencoders in mechanistic interpretability and cite recent HF Papers.
request
{
"model": "moonshotai/Kimi-K2.6:novita",
"input": "Research sparse autoencoders (SAEs) for mechanistic interpretability of language models. In 5 concise bullets, explain the current frontier, the main open problem, and cite 2 relevant Hugging Face paper pages if available. Keep under 350 words.",
"background": true,
"max_cost_usd": 3.0
}
Task: Inspect an instruction-tuning dataset and produce a practical one-hour LoRA SFT smoke-test plan.
request
{
"model": "moonshotai/Kimi-K2.6:novita",
"instructions": "Do a practical ML-engineering audit. Use dataset inspection and current HF/TRL knowledge as needed, but keep the final answer concise and do not launch training jobs.",
"input": "Inspect the HuggingFaceH4/ultrachat_200k dataset for supervised fine-tuning viability. Report the available splits, key columns/format, any risks for SFT, and propose a 1-hour LoRA SFT smoke-test plan for Qwen/Qwen3-0.6B using current TRL/Transformers conventions. Keep under 600 words.",
"background": true,
"max_cost_usd": 3.0
}
Result: The agent verified HuggingFaceH4/ultrachat_200k
has train_sft/test_sft conversational messages splits, flagged long-sequence and quality-variance risks,
and proposed a LoRA SFTTrainer smoke test for Qwen/Qwen3-0.6B.
Task: Launch a CPU HF Job that fine-tunes distilbert-base-uncased on a small IMDb subset, evaluates it, and pushes a model repo.
request excerpt
{
"model": "moonshotai/Kimi-K2.6:novita",
"instructions": "Launch exactly one CPU-only HF Job using the provided script as inline Python source. Use hardware=cpu-basic and timeout about 30 minutes. Set HUB_MODEL_ID to the requested repo id. Wait for the job to finish, then report the model URL, job URL, and eval metrics.",
"input": "Run this exact CPU-only fine-tuning script as one HF Job and publish the artifact to abidlabs/ml-intern-api-imdb-distilbert-20260613-020123. The script fine-tunes distilbert-base-uncased on a small IMDb subset and pushes the model.",
"background": true,
"max_cost_usd": 15.0
}
incomplete is a resumable pause, not a terminal state:
incomplete_details.reason is either approval_required
(resume via /approvals) or server_restart
(the server restarted mid-turn; previously created artifacts, including running HF
Jobs, remain listed). completed, cancelled, and
failed are terminal.
Submits a task. Three execution modes, selected by background and stream:
mode
flags
behavior
background
background: true
Returns the response object immediately with status: "queued". The turn runs server-side; poll or attach to the event stream.
streaming
stream: true
Returns text/event-stream for this request, ending at a terminal event or pause.
synchronous
neither
Blocks up to wait_timeout_seconds, then returns the response object (possibly still in_progress; the run continues server-side).
Request body
field
type
description
inputrequired
string | message[]
The task. If a list of {role, content} messages, all but the last are inserted as context and the last is submitted. Max 100,000 chars per message.
model
string
Model id from the app's supported list (GET /api/config/model). Unknown ids → 400. Default follows the account plan. Ignored when chaining.
background
boolean = false
Run without holding the connection.
stream
boolean = false
Stream this turn as SSE.
previous_response_id
string
Continue the session of an earlier response. 409 if that session is still processing or paused on an approval.
max_cost_usd
number = 5.0
Session-cumulative auto-approval cap, range (0, 500]. See Cost control.
instructions
string
Developer guidance, prefixed to the submitted task. Max 20,000 chars.
wait_timeout_seconds
number = 900
Synchronous mode only; range [1, 3600].
metadata
object
String key/value pairs, echoed back unmodified.
Example
curl
curl -s -X POST …/v1/responses \
-H "Authorization: Bearer $HF_TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"input": "Fine-tune a small encoder on imdb as an HF job; push to my namespace",
"background": true,
"max_cost_usd": 5.0
}'
Returns the current response object. Status is derived from
the persisted event log: output[] is reconstructed from the turn's events,
artifacts[] aggregated, and usage attached when available.
This endpoint does not require a live runtime session — it works after idle eviction
and across server restarts (with persistence configured; see
Limits & persistence). Requests for responses owned by another
account return 404.
Paused; payload includes the pending action and budget context. Stream ends.
response.completed / .failed / .cancelled
Terminal. Stream ends.
Unrecognized internal events are forwarded as response.<internal_name>
(e.g. response.llm_call telemetry); clients should ignore event names they
don't handle.
Signals interruption and returns the current snapshot. Cancellation is asynchronous:
the returned object may still read in_progress; the status becomes
cancelled when the interrupt lands (observable via polling or the
response.cancelled event). Idempotent — cancelling a finished response
returns it unchanged.
Cancelling a turn does not kill HF Jobs that were already
launched; manage those at huggingface.co/jobs or via a follow-up task.
Resumes a response paused with incomplete_details.reason = "approval_required".
The same response id continues — pollers and event streams pick up where they left off.
Request body
field
type
description
approverequired
boolean
Applied to the entire pending batch (headless callers approve or deny all pending actions at once).
new_max_cost_usd
number
Raises the session cap before resuming. Required in practice when the pause was the cap itself — approving without headroom re-pauses immediately.
feedback
string
Passed to the agent with the decision (most useful with approve: false).
Hub resources produced by a turn. Emitted incrementally as
response.artifact.created events and aggregated (deduplicated) on the response
object. Repos created inside HF Jobs produce no in-process events; they are
recovered at turn end from the session's Hub artifact collection.
type
fields
notes
hf_job
id, url
A launched HF Job under the caller's namespace.
trackio_dashboard
space_id, url, project?
Auto-seeded metrics dashboard Space; embeddable for live training curves.
model / dataset / space
repo_id, url
Hub repos created or written by the run.
collection
slug, url
The session's artifact collection (groups everything above).
Per-user (10 live sessions) or global capacity reached.
503
session_unavailable
Session runtime failed to start; retry.
Failures inside a run (model auth, job billing, tool errors) do not surface as
HTTP errors — the run ends with status: "failed" and a populated
error object, or the agent reports the problem in its output.
API runs execute unattended, so tool calls that would normally require interactive
approval auto-approve under a budget:
max_cost_usd is enforced per session, cumulatively — estimated spend from inference, jobs, and sandboxes accrues against it across all chained responses. The most recent request's value replaces the cap.
When the next action's estimated cost exceeds remaining budget — or accrued spend reaches the cap — the run pauses: status: "incomplete", incomplete_details.reason: "approval_required", and a response.approval_required event with the pending action and budget context.
Resume via /approvals, typically raising the cap. Denial returns control to the agent with your feedback.
Costs are estimates at approval time; authoritative billing is the HF account's
(settings/billing).
The response object's usage reports the session window's attributed spend.
Concurrency: 10 live sessions per account; one turn at a time per session (concurrent submits → 409).
Idle eviction: sessions idle ≥ 15 min release runtime resources; they restore transparently on the next request to the same session_id.
Input size: 100,000 chars per message; instructions 20,000.
Tool output in output[]: truncated to 4 KB per item (full logs stream via response.tool_log).
Persistence: with a configured event store, events/status/artifacts are durable — streams resume and polling survives restarts. Without it, tracking is in-memory: live streaming works, but replay and restart recovery are unavailable.
Restart mid-turn: the response reports incomplete (server_restart); launched HF Jobs continue on HF infrastructure and remain listed in artifacts[].