ML Intern API

An HTTP API for running the ML Intern agent. A request submits a task; the agent plans, writes code, and executes it — including launching HF Jobs on cloud hardware — under the namespace of the calling token. Progress is delivered as a resumable server-sent-event stream; results and artifacts (jobs, trackio dashboards, pushed repos) are also available by polling.

The surface follows the OpenAI Responses API where applicable (POST /v1/responses, background, previous_response_id, response object shape, error envelope) with documented extensions: max_cost_usd, artifacts[], approval endpoints, and additional SSE event types. The openai-python SDK works for create/retrieve/cancel via base_url + extra_body; its typed streaming parser does not accept the extended event names, so consume SSE directly for streaming.

BASE URL

Agent runs are long-lived: a turn may take seconds (a question) or hours (training). Design clients around background: true plus polling or stream resumption.

Replay of a representative turn. Event names and payload shapes are documented under /responses/{id}/events.

Authentication #

All /v1 endpoints require a Hugging Face user access token in the Authorization header:

http
Authorization: Bearer hf_xxxxxxxxxxxxxxxx

Tokens are validated against huggingface.co/api/whoami-v2 (cached for 5 minutes). Both classic and fine-grained user tokens are accepted; organization tokens are rejected. There is no cookie or OAuth-redirect flow on this surface.

Required token permissions

  • Inference Providers — all agent reasoning runs through HF Inference Providers as the caller. A token without this permission fails before session creation with 403 inference_provider_permission_required.
  • Write access to repos — for pushing models/datasets/Spaces.
  • Jobs — for launching HF Jobs. Job billing requires credits on the target namespace; without them the job call returns a billing error to the agent.

All compute, inference, and storage initiated by a run is authenticated as — and billed to — the account behind the token. The server holds the token in memory for the session lifetime only.

Examples #

Verified against the public Space endpoint. Each example uses background: true, then polls GET /v1/responses/{id} until terminal status.

Research a cutting-edge concept

completed43 shf_papersresp_bf64687c764f4d509a134188390a2236

Task: Research diffusion language models for text generation; explain recent changes and cite Hugging Face paper/model pages.

request
{
  "model": "moonshotai/Kimi-K2.6:novita",
  "input": "Research diffusion language models for text generation. In 5 concise bullets, explain what changed recently, why it matters, and cite 2 relevant Hugging Face paper pages or model pages if available. Keep under 300 words.",
  "background": true,
  "max_cost_usd": 3.0
}

Result: The agent found recent DLM work, identified few-step decoding and hybrid plan-and-fill architectures as key shifts, and cited google/diffusiongemma-26B-A4B-it plus T3D on HF Papers.

Full API response
json
Loading…

Find a fast transcription model

completed32 shub_repo_detailsresp_7edc3515343d4eab81eb0fe7b274d316

Task: Compare three Hugging Face ASR choices for fast batch English transcription on one GPU and recommend an implementation path.

request
{
  "model": "moonshotai/Kimi-K2.6:novita",
  "instructions": "Keep this as a quick model-selection answer. Do not launch broad research sub-agents. Use at most three direct Hugging Face lookups, then answer.",
  "input": "Compare these three choices for quickly transcribing batches of English audio on one GPU: openai/whisper-large-v3-turbo, distil-whisper/distil-large-v3, and faster-whisper with large-v3-turbo. Recommend one for speed/accuracy/ease of use and include a short usage snippet. Keep under 400 words.",
  "background": true,
  "max_cost_usd": 2.0
}

Result: The agent recommended faster-whisper with large-v3-turbo, compared it against openai/whisper-large-v3-turbo and distil-whisper/distil-large-v3, and returned a short WhisperModel(...).transcribe(...) snippet.

Full API response
json
Loading…

Choose embedding and reranker models for RAG

completed43 shub_repo_detailsresp_c32ba10ebac6446f83d6e18102f54b44

Task: Pick a production embedding and reranker stack for technical-doc RAG, balancing quality and latency.

request
{
  "model": "moonshotai/Kimi-K2.6:novita",
  "instructions": "Keep this as a quick model-selection answer. Do not launch broad research sub-agents. Use at most four direct Hugging Face lookups, then answer.",
  "input": "For a 2026 production RAG system over technical docs, compare these Hugging Face options: Qwen/Qwen3-Embedding-8B, BAAI/bge-m3, jinaai/jina-embeddings-v4, and BAAI/bge-reranker-v2-m3. Recommend an embedding + reranker stack for quality vs latency. Include one short sentence-transformers or transformers usage snippet. Keep under 450 words.",
  "background": true,
  "max_cost_usd": 2.0
}

Result: The agent recommended BAAI/bge-m3 plus BAAI/bge-reranker-v2-m3 for latency, and Qwen3-Embedding-8B plus the same reranker for maximum quality.

Full API response
json
Loading…

Continue a session with previous_response_id

completed33 s + 11 smultiturnresp_2768fb94ff614a3a90a1c455548d767f → resp_29eb917b2e2c4a0fbecdba4aa8303a21

Task: First ask for a RAG embedding recommendation, then continue the same session and ask for code that uses the recommended model.

turn 1 request
{
  "model": "moonshotai/Kimi-K2.6:novita",
  "instructions": "This is turn 1 of a multiturn API example. Keep it concise. Do not launch jobs or broad research sub-agents. Use direct Hub/model knowledge or at most two direct Hub lookups.",
  "input": "For technical-document RAG, compare BAAI/bge-m3 and Qwen/Qwen3-Embedding-8B. Recommend one default embedding model for a startup that cares about good quality but low latency. Keep under 250 words.",
  "background": true,
  "max_cost_usd": 2.0
}
turn 2 request
{
  "model": "moonshotai/Kimi-K2.6:novita",
  "previous_response_id": "resp_2768fb94ff614a3a90a1c455548d767f",
  "instructions": "This is turn 2 of a multiturn API example. Reuse the prior recommendation; do not restate the comparison. Provide runnable minimal code only plus two setup notes. Do not launch jobs.",
  "input": "Using your recommended embedding model from the previous turn, write a minimal Python script that indexes 100 local Markdown files and retrieves the top 5 chunks for a query. Keep it compact.",
  "background": true,
  "max_cost_usd": 2.0
}

Result: Turn 1 recommended BAAI/bge-m3. Turn 2 reused that context via previous_response_id and returned a compact sentence-transformers + faiss indexing script without resending the comparison.

Full API responses
json
Loading…

Research sparse autoencoders for interpretability

completed32 shf_papersresp_0a5e9ee6a94a43eda152c4310d7ddab2

Task: Summarize the current frontier for sparse autoencoders in mechanistic interpretability and cite recent HF Papers.

request
{
  "model": "moonshotai/Kimi-K2.6:novita",
  "input": "Research sparse autoencoders (SAEs) for mechanistic interpretability of language models. In 5 concise bullets, explain the current frontier, the main open problem, and cite 2 relevant Hugging Face paper pages if available. Keep under 350 words.",
  "background": true,
  "max_cost_usd": 3.0
}

Result: The agent summarized scaling SAEs to production LLMs, feature-steering fragility, and the interpretation-behavior gap, citing the SAE survey and Coffee/Coffins feature-steering analysis.

Full API response
json
Loading…

Audit a dataset and draft an SFT plan

completed143 shf_inspect_datasetresp_c63732bf03fc49b19d7cd141fa5fbd54

Task: Inspect an instruction-tuning dataset and produce a practical one-hour LoRA SFT smoke-test plan.

request
{
  "model": "moonshotai/Kimi-K2.6:novita",
  "instructions": "Do a practical ML-engineering audit. Use dataset inspection and current HF/TRL knowledge as needed, but keep the final answer concise and do not launch training jobs.",
  "input": "Inspect the HuggingFaceH4/ultrachat_200k dataset for supervised fine-tuning viability. Report the available splits, key columns/format, any risks for SFT, and propose a 1-hour LoRA SFT smoke-test plan for Qwen/Qwen3-0.6B using current TRL/Transformers conventions. Keep under 600 words.",
  "background": true,
  "max_cost_usd": 3.0
}

Result: The agent verified HuggingFaceH4/ultrachat_200k has train_sft/test_sft conversational messages splits, flagged long-sequence and quality-variance risks, and proposed a LoRA SFTTrainer smoke test for Qwen/Qwen3-0.6B.

Full API response
json
Loading…

Fine-tune and publish a model artifact

completed14 minhf_jobsmodel artifactresp_518beb5c9e7c4aeb94b67d763183fdae

Task: Launch a CPU HF Job that fine-tunes distilbert-base-uncased on a small IMDb subset, evaluates it, and pushes a model repo.

request excerpt
{
  "model": "moonshotai/Kimi-K2.6:novita",
  "instructions": "Launch exactly one CPU-only HF Job using the provided script as inline Python source. Use hardware=cpu-basic and timeout about 30 minutes. Set HUB_MODEL_ID to the requested repo id. Wait for the job to finish, then report the model URL, job URL, and eval metrics.",
  "input": "Run this exact CPU-only fine-tuning script as one HF Job and publish the artifact to abidlabs/ml-intern-api-imdb-distilbert-20260613-020123. The script fine-tunes distilbert-base-uncased on a small IMDb subset and pushes the model.",
  "background": true,
  "max_cost_usd": 15.0
}

Result: The job published abidlabs/ml-intern-api-imdb-distilbert-20260613-020123 from HF Job 6a2cba84871c005b5352ba24. The final eval accuracy on the 200-example subset was 0.815.

Full API response
json
Loading…

Conventions #

  • Request and response bodies are JSON (Content-Type: application/json); streams are text/event-stream.
  • Errors use the envelope {"error": {"message", "type", "code"}}. See Errors.
  • One response corresponds to one agent turn. previous_response_id continues the same underlying session (shared context and budget).
  • Every emitted event has a monotonically increasing sequence number per session, used for stream resumption.
  • Identifiers: responses are resp_<hex>; sessions are UUIDs (exposed as session_id).

Response lifecycle

queuedin_progresscompleted incompletecancelledfailed

incomplete is a resumable pause, not a terminal state: incomplete_details.reason is either approval_required (resume via /approvals) or server_restart (the server restarted mid-turn; previously created artifacts, including running HF Jobs, remain listed). completed, cancelled, and failed are terminal.

Create a response #

POST/v1/responses

Submits a task. Three execution modes, selected by background and stream:

modeflagsbehavior
backgroundbackground: trueReturns the response object immediately with status: "queued". The turn runs server-side; poll or attach to the event stream.
streamingstream: trueReturns text/event-stream for this request, ending at a terminal event or pause.
synchronousneitherBlocks up to wait_timeout_seconds, then returns the response object (possibly still in_progress; the run continues server-side).

Request body

fieldtypedescription
input requiredstring | message[]The task. If a list of {role, content} messages, all but the last are inserted as context and the last is submitted. Max 100,000 chars per message.
modelstringModel id from the app's supported list (GET /api/config/model). Unknown ids → 400. Default follows the account plan. Ignored when chaining.
backgroundboolean = falseRun without holding the connection.
streamboolean = falseStream this turn as SSE.
previous_response_idstringContinue the session of an earlier response. 409 if that session is still processing or paused on an approval.
max_cost_usdnumber = 5.0Session-cumulative auto-approval cap, range (0, 500]. See Cost control.
instructionsstringDeveloper guidance, prefixed to the submitted task. Max 20,000 chars.
wait_timeout_secondsnumber = 900Synchronous mode only; range [1, 3600].
metadataobjectString key/value pairs, echoed back unmodified.

Example

curl
curl -s -X POST /v1/responses \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
    "input": "Fine-tune a small encoder on imdb as an HF job; push to my namespace",
    "background": true,
    "max_cost_usd": 5.0
  }'
200 — application/json
{
  "id": "resp_820438d1de1a453da1d822409188b3e0",
  "object": "response",
  "status": "queued",
  "session_id": "6f9e1d1c-…",
  "max_cost_usd": 5.0,
  "output": [], "artifacts": [], "error": null, …
}

openai-python

python
from openai import OpenAI

client = OpenAI(base_url="/v1", api_key=os.environ["HF_TOKEN"])

resp = client.responses.create(
    input="fine-tune llama on my dataset",
    background=True,
    extra_body={"max_cost_usd": 20.0},  # non-standard fields via extra_body
)
resp = client.responses.retrieve(resp.id)
resp.status, resp.model_extra["artifacts"]

Retrieve a response #

GET/v1/responses/{id}

Returns the current response object. Status is derived from the persisted event log: output[] is reconstructed from the turn's events, artifacts[] aggregated, and usage attached when available.

This endpoint does not require a live runtime session — it works after idle eviction and across server restarts (with persistence configured; see Limits & persistence). Requests for responses owned by another account return 404.

curl
curl -s /v1/responses/$RESPONSE_ID \
  -H "Authorization: Bearer $HF_TOKEN" | jq '{status, artifacts, usage}'

Stream events #

GET/v1/responses/{id}/events

Server-sent events for one turn. Each frame is:

text/event-stream
id: 47
event: response.output_text.delta
data: {"type": "response.output_text.delta", "response_id": "resp_…", "sequence_number": 47, "delta": "…"}

Resumption

  • ?starting_after=<seq> (or the standard Last-Event-ID header) replays persisted events after that sequence number, then continues live.
  • Comment frames (: keepalive) are sent every 15 s during quiet periods; parsers ignore them.
  • The stream closes at a terminal event, or at response.approval_required (re-attach after resolving the approval).

Event types

eventpayload / semantics
response.createdSynthetic first frame on POST streams; carries the initial response object.
response.in_progressTurn execution started.
response.output_text.delta{delta} — incremental assistant text.
response.output_text.doneCurrent text segment finished.
response.output_item.added{item} — tool call started (custom_tool_call: id, name, input).
response.output_item.done{item} — tool call finished, with output (truncated to 4 KB).
response.tool_logIncremental tool logs — HF Job logs stream here.
response.tool_state.changedTool runtime state, e.g. a job entering running with its jobUrl.
response.artifact.created{artifact} — see Artifacts.
response.approval_requiredPaused; payload includes the pending action and budget context. Stream ends.
response.completed / .failed / .cancelledTerminal. Stream ends.

Unrecognized internal events are forwarded as response.<internal_name> (e.g. response.llm_call telemetry); clients should ignore event names they don't handle.

Cancel a response #

POST/v1/responses/{id}/cancel

Signals interruption and returns the current snapshot. Cancellation is asynchronous: the returned object may still read in_progress; the status becomes cancelled when the interrupt lands (observable via polling or the response.cancelled event). Idempotent — cancelling a finished response returns it unchanged.

Cancelling a turn does not kill HF Jobs that were already launched; manage those at huggingface.co/jobs or via a follow-up task.

Resolve an approval #

POST/v1/responses/{id}/approvals

Resumes a response paused with incomplete_details.reason = "approval_required". The same response id continues — pollers and event streams pick up where they left off.

Request body

fieldtypedescription
approve requiredbooleanApplied to the entire pending batch (headless callers approve or deny all pending actions at once).
new_max_cost_usdnumberRaises the session cap before resuming. Required in practice when the pause was the cap itself — approving without headroom re-pauses immediately.
feedbackstringPassed to the agent with the decision (most useful with approve: false).
curl
curl -s -X POST /v1/responses/$RESPONSE_ID/approvals \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"approve": true, "new_max_cost_usd": 15.0}'

Returns 409 no_pending_approval if nothing is pending.

The response object #

fieldtypedescription
idstringresp_<hex>
objectstringAlways "response".
statusstringSee lifecycle.
outputitem[]Ordered turn output: message items (content[].type = "output_text") and custom_tool_call items (name, input, output, status).
artifactsartifact[]Extension. See Artifacts.
usageobject | nullSession-window usage: total_usd, inference_usd, hf_jobs_estimated_usd, token counts. Null if unavailable.
errorobject | null{code, message} when status = "failed".
incomplete_detailsobject | null{reason: "approval_required", approval: {…}} or {reason: "server_restart"}.
session_idstringExtension. Underlying session; shared across chained responses.
previous_response_idstring | nullSet when this turn chained an earlier response.
max_cost_usdnumberEffective session cap at creation (or as last raised).
model, background, instructions, metadataAs supplied at creation.
created_at, completed_atint | nullUnix seconds.

Artifacts #

Hub resources produced by a turn. Emitted incrementally as response.artifact.created events and aggregated (deduplicated) on the response object. Repos created inside HF Jobs produce no in-process events; they are recovered at turn end from the session's Hub artifact collection.

typefieldsnotes
hf_jobid, urlA launched HF Job under the caller's namespace.
trackio_dashboardspace_id, url, project?Auto-seeded metrics dashboard Space; embeddable for live training curves.
model / dataset / spacerepo_id, urlHub repos created or written by the run.
collectionslug, urlThe session's artifact collection (groups everything above).
json
"artifacts": [
  { "type": "hf_job", "id": "6843a1…", "url": "https://huggingface.co/jobs/<user>/6843a1…" },
  { "type": "trackio_dashboard", "space_id": "<user>/trackio", "project": "imdb-finetune",
    "url": "https://huggingface.co/spaces/<user>/trackio" },
  { "type": "model", "repo_id": "<user>/distilbert-imdb",
    "url": "https://huggingface.co/<user>/distilbert-imdb" }
]

Errors #

json
{ "error": { "message": "…", "type": "invalid_request_error", "code": "…" } }
statuscodemeaning
401invalid_api_keyMissing/invalid Bearer token, or an organization token.
403inference_provider_permission_requiredBearer token is valid but cannot call HF Inference Providers through Router.
400model_not_foundUnknown model id.
400empty_inputinput was an empty message list.
404response_not_foundUnknown id, or owned by another account.
409previous_response_still_runningChained session is mid-turn; wait for terminal status.
409approval_pendingSession paused; resolve via /approvals first.
409no_pending_approvalApproval posted but nothing is pending.
429 / 503capacity_exceededPer-user (10 live sessions) or global capacity reached.
503session_unavailableSession runtime failed to start; retry.

Failures inside a run (model auth, job billing, tool errors) do not surface as HTTP errors — the run ends with status: "failed" and a populated error object, or the agent reports the problem in its output.

Cost control #

API runs execute unattended, so tool calls that would normally require interactive approval auto-approve under a budget:

  • max_cost_usd is enforced per session, cumulatively — estimated spend from inference, jobs, and sandboxes accrues against it across all chained responses. The most recent request's value replaces the cap.
  • When the next action's estimated cost exceeds remaining budget — or accrued spend reaches the cap — the run pauses: status: "incomplete", incomplete_details.reason: "approval_required", and a response.approval_required event with the pending action and budget context.
  • Resume via /approvals, typically raising the cap. Denial returns control to the agent with your feedback.

Costs are estimates at approval time; authoritative billing is the HF account's (settings/billing). The response object's usage reports the session window's attributed spend.

Limits & persistence #

  • Concurrency: 10 live sessions per account; one turn at a time per session (concurrent submits → 409).
  • Idle eviction: sessions idle ≥ 15 min release runtime resources; they restore transparently on the next request to the same session_id.
  • Input size: 100,000 chars per message; instructions 20,000.
  • Tool output in output[]: truncated to 4 KB per item (full logs stream via response.tool_log).
  • Persistence: with a configured event store, events/status/artifacts are durable — streams resume and polling survives restarts. Without it, tracking is in-memory: live streaming works, but replay and restart recovery are unavailable.
  • Restart mid-turn: the response reports incomplete (server_restart); launched HF Jobs continue on HF infrastructure and remain listed in artifacts[].