mirror of
https://github.com/nexu-io/open-design.git
synced 2026-05-31 19:04:39 +07:00
14 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3395d2c855
|
feat(daemon): implement fal.ai renderer for image + video generation (#1606)
* feat(daemon): implement fal.ai renderer for image + video generation Adds renderFalImage and renderFalVideo backed by the fal queue API (queue.fal.run). Any fal-ai/* model path can be used directly without a catalog entry, enabling the full fal model library without code changes. Catalogued shortcuts are mapped via FAL_ENDPOINTS to their fal-ai/* paths; OD_FAL_MAX_POLL_MS controls the poll ceiling. Expands the fal model catalog with flux-pro-ultra, flux-dev-fal, flux-schnell-fal, ideogram-v3-fal, recraft-v3-fal (images) and veo-3-fal, veo-2-fal, wan-2.1-t2v, wan-2.1-i2v, seedance-1-pro-fal, kling-2.1-t2v-fal (video). Marks fal provider as integrated: true in both daemon and web model registries. * fix(daemon): address fal renderer review comments - Correct Wan 2.1 endpoints: wan-video/v2.1/* → fal-ai/wan-t2v / fal-ai/wan-i2v - Correct Kling 2.1 t2v endpoint: .../pro/... → .../master/text-to-video - Add FAL_IMAGE_USES_ASPECT_RATIO: flux-pro-ultra sends aspect_ratio not image_size - Add FAL_VIDEO_NO_DURATION: Wan models reject the duration field - Add FAL_VIDEO_STRING_DURATION: Veo expects duration as "5s" not 5 - Fix falQueueBase() to use anchored regex replace, avoiding mangled custom base URLs - Do not wrap payload under input — raw fal queue HTTP API expects flat body; the input wrapper is an SDK abstraction only (confirmed by 422 validation error from fal showing prompt missing at body.prompt) * fix(daemon): correct fal queue protocol comment (flat body, no SDK input wrapper) * fix(daemon): clamp Veo duration to valid fal buckets (4s/6s/8s) * fix(daemon): report effective fal Veo duration in providerNote (with snap warning) * fix(daemon): reduce image generation latency from 4m37s to ~73s Five layered fixes targeting the overhead that padded a ~10s fal API call into a 4m37s user-facing wait: 1. Skip DISCOVERY_AND_PHILOSOPHY for media surfaces (image/video/audio). The ~3000-token HTML-artifact discovery layer is irrelevant for media generation and forced the agent to parse and override all its rules before dispatching. Removes it from the system prompt entirely for these surfaces; MEDIA_GENERATION_CONTRACT is the sole authority. 2. Broaden the wait-loop contract to cover ALL slow models, not just "Volcengine i2v / hyperframes-html". Any model whose generation exceeds 25s — including fal flux-pro-ultra, Veo, Sora — returns exit 2 from od media generate. The contract now makes this universal and provides a python3-based bash pattern (jq is not guaranteed to be installed on all agent runtimes). 3. Increase od media wait polling budget from 25s to 120s. od media generate keeps its 25s budget for fast feedback; od media wait is purpose-built to sit and poll, so it can safely use the full 2-minute bash-tool window. Reduces re-entries for a 3-minute generation from ~7 to ~2. 4. First fal poll is now immediate instead of always sleeping 3s before the first status check. Saves 3s for all fal jobs. 5. Project metadata no longer emits "(unknown — ask)" for imageModel and aspectRatio when unset. Emits the actual defaults (gpt-image-2, aspect-ratio scene heuristic) so the agent can dispatch without extended reasoning about model selection. Also adds dispatch-immediately defaults and a brief-reply rule (2–3 sentences max after generation). Measured end-to-end on the exact problem prompt before/after: Before: 4m37s (discovery form + 7x LLM re-entries + jq failure) After: ~73s (single bash loop, no question turn, image delivered) * feat(daemon): inject media dispatch hint for non-media project surfaces Agents running inside prototype, deck, and other non-image/video/audio projects previously had no knowledge of `od media generate`, so when asked to create an image with fal they would try to call provider REST APIs directly and ask the user for API keys — even though the daemon already holds credentials in .od/media-config.json. Add MEDIA_DISPATCH_HINT to composeSystemPrompt for all non-media surfaces. The hint tells the agent to always route media generation through the daemon dispatcher, and explicitly forbids prompting for API keys. Verified end-to-end: a prototype project generates a 952 KB image via flux-pro-ultra in ~52s with no key errors. * fix(daemon): prevent agent from converting bash env vars to PowerShell syntax MEDIA_DISPATCH_HINT now explicitly labels the shell as POSIX bash and shows the correct $VAR form side-by-side with a warning NOT to use PowerShell $env:VAR. Without this, claude-sonnet running on a Windows host converts the example to PowerShell syntax (`& $env:OD_NODE_BIN`) which then fails at the bash executor with 'syntax error near unexpected token &'. * fix(daemon): add generate→wait loop to MEDIA_DISPATCH_HINT for slow models MEDIA_DISPATCH_HINT previously showed only a bare call. flux-pro-ultra and other slow models always exit 2 after ~25s — without the wait loop the agent would treat exit 2 as a failure and report an error to the user. Replace the single-command example with the canonical generate→wait loop (matching media-contract.ts), add an explicit note that exit 2 means 'keep polling', and reinforce the POSIX bash / no-PowerShell rule directly inside the code block. * fix(daemon): allow fal-ai/* passthrough in media-agent contract The media-agent prompt instructed the agent to warn and substitute the default model for any ID not in the catalogue. This blocked the custom fal-ai/* passthrough path the daemon already supports, so users could not reach uncatalogued fal models from the normal chat flow. Carve out the fal-ai/* exception so the agent passes those IDs through directly instead of warning or substituting. * fix(daemon): align MEDIA_DISPATCH_HINT with exit-0 generate contract media generate now always exits 0 (handoff included). The non-media agent hint still checked ec==2 to decide whether to keep polling, so slow fal models (flux-pro-ultra, veo-3-fal) would stop after printing the handoff JSON instead of entering the wait loop. - generate error check: drop the ec!=2 exception (exits 0 always) - while loop: drive on taskId presence, not ec==2; stop on ec==0/5 - footer: remove --surface inference claim; CLI requires it explicitly * fix(guard): add test-fal-webui.ts to e2e scripts allowlist CI failed: guard flagged e2e/scripts/test-fal-webui.ts as an unapproved package-owned entrypoint. Add it to allowedE2eScripts. * fix(daemon): update prompt test expectations to match exit-0 handoff wording The two stale assertions checked for the old generate-exits-2 copy which no longer exists in the contract. Update them to match the current always-exits-0 wording. * fix(daemon): move skipDiscoveryBrief override before discovery block * chore(e2e): remove ad-hoc fal webui test script The script was a one-time developer helper used to manually validate fal image generation through the live UI. It relied on a real fal API key and hardcoded local port, so it cannot participate in the e2e package's fixture/reporting/CI conventions. Removing it per reviewer feedback. - Delete e2e/scripts/test-fal-webui.ts - Remove its guard.ts allowlist entry - Gitignore the file and its screenshots to prevent accidental re-addition * chore: remove accidental local scratch files from branch Remove bash.exe.stackdump (MSYS crash dump) and fix_loop.py (one-off local rewrite helper) — neither is a repo-owned source artifact. * fix(prompts): document fal-ai/* passthrough in non-media dispatch hint Prototype/deck agents now know arbitrary fal-ai/* model ids are valid --model values and should be forwarded as-is, mirroring the exception already present in media-contract.ts. Adds a prompt regression test. * fix(daemon): use renderMediaGenerationContract(mediaExecution) for media surfaces --------- Co-authored-by: mrcfps <mrc@powerformer.com> |
||
|
|
881571dea7
|
fix(media): route custom-image edits through images API (#3087)
* fix(media): route custom-image edits through images API * fix(media): normalize custom-image endpoint suffixes --------- Co-authored-by: Artist Ning <dingkuake@yeah.net> Co-authored-by: Siri-Ray <2667192167@qq.com> |
||
|
|
6c6ee44e07
|
docs(media): clarify custom providers and ComfyUI status (#479) (#2942)
Closes #479 Generated-By: looper 0.9.0 (runner=worker, agent=opencode) |
||
|
|
210b94069a
|
feat(senseaudio): BYOK chat with image + video generation tools (#2065)
* feat(senseaudio): BYOK chat with image + video generation tools
Adds SenseAudio as a first-class BYOK chat protocol and wires the daemon's
chat proxy with a tool loop so BYOK users can generate images and videos
without dropping to a CLI agent.
- BYOK protocol: new senseaudio tab + /api/proxy/senseaudio/stream route +
connection-test + provider-models discovery (OpenAI-compatible wire)
- Tool loop: generate_image (synchronous /v1/image/sync) and generate_video
(async /v1/video/create + 5s polling /v1/video/status, 10-min ceiling,
periodic progress log every 30s)
- Settings dropdown + chat-composer dropdown for the BYOK image model
default; generate_image's model enum lets the LLM override per call
- Seed-on-success: a successful BYOK chat call idempotently mirrors the
key into media-config (preserves env-resolved + already-stored keys)
- Generated artifacts land in <projectsRoot>/<projectId>/ so FileViewer,
DesignFilesPanel, and project export pick them up automatically;
legacy /api/byok-image/:id route kept for old conversation links
- Markdown renderer learns  image syntax with a scheme
allowlist (http(s) / data:image/ / blob: / relative paths)
- i18n key settings.byokImageModel across all 19 locales
- 3 SenseAudio image models registered (2.0, 1.0, doubao-seedream-5.0);
1 video model (doubao-seedance-2.0)
- Tests: byok-tools (29), media-senseaudio-image (8), media-config seed
(7), proxy-routes (47), markdown image rendering (8)
* fix(senseaudio): unblock image gen + design file preview switching
- SenseAudio /v1/image/sync rejected the previous size mapping with
`参数错误:size` (1664x936, 936x1664, 1280x960, 960x1280 are not in
the gateway's accepted set). Switched to standard HD / SD sizes that
every aspect bucket can hit: 1024×1024, 1280×720, 720×1280,
1024×768, 768×1024. Kept the byok-tools and media.ts tables in sync
so the BYOK chat tool and the CLI agent path both stop failing on
non-square aspects.
- DesignFilesPanel's <DfPreview> was missing a key prop, so React
reused the same iframe DOM node when the user picked a different
file — the src prop changed but the iframe never navigated. Added
key={previewFile.name} so the previous preview unmounts cleanly.
- Updated byok-tools + media-senseaudio-image tests for the new size
expectations.
* docs(senseaudio): clear stale provider hint + update README
- Settings → Media → SenseAudio: clear the auto-promoted
"Image · TTS · 70+ voices · clone" hint; the provider label alone is
enough now that the BYOK chat surface covers image + video tooling.
- README: list the new senseaudio (and missing ollama) proxy routes so
the BYOK section reflects what the daemon actually serves, and
mention the generate_image / generate_video chat tools that ship
with the SenseAudio path.
* fix(senseaudio): address PR #2065 review feedback
Three non-blocking review notes from @PerishCode on PR #2065:
1. Drop the dead /api/byok-image/:id route. The PR description claimed
it was "legacy fallback for old chat history" but that storage
layout never existed on main, so the route can only ever 400 or
404 — never 200. Removed the handler, the isSafeByokImageId
export, the unused createReadStream / stat / path / Request /
Response imports, and the two byok-image regression tests.
2. Add rejectProxyPluginContext guard to the senseaudio proxy
handler so it matches the invariant the other five proxy paths
already enforce (plugin runs must go through /api/runs for
snapshot pinning). Extended the existing "API fallback rejects
plugin runs" describe to also cover /api/proxy/senseaudio/stream
with the 409 PLUGIN_REQUIRES_DAEMON expectation.
3. Wrap the secondary image / video downloads (the URLs the
SenseAudio gateway hands back in /v1/image/sync .url and
/v1/video/status .video_url) in validateBaseUrlResolved so a
malicious gateway can't point us at 169.254.169.254 (AWS / Azure
metadata) or RFC1918 hosts via the response payload. Also passed
`redirect: 'error'` on both fetches to match the SSRF posture
the primary proxy fetch already uses. The new
assertExternalAssetUrl helper lives next to executeGenerateImage
so future tool downloads can reuse it.
Tests: 120/120 daemon tests pass; guard + typecheck green.
* fix(senseaudio): mirror SSRF guard onto renderSenseAudioImage CLI path
Follow-up to
|
||
|
|
56988e406c
|
feat: integrate xAI SuperGrok subscription as a credential source for Grok media + X search (#2134)
* feat(daemon): add xAI OAuth client with PKCE + token storage Wraps mcp-oauth.ts PKCE primitives for xAI's auth.x.ai OAuth server. xAI doesn't speak MCP and doesn't expose Dynamic Client Registration, so issuer / endpoints / client_id / scope / loopback :56121 are hardcoded constants. Adds xai-tokens.ts for persistent storage, mirroring mcp-tokens.ts: atomic write + chmod 0600 + per-dataDir in-memory mutex. Simplified for the single-token case (no per-server-id map). Reference: NousResearch/hermes-agent hermes_cli/auth.py:93-100. PoC reuses Hermes client_id (b1a00492-...); replace before stable release once Open Design has its own. Tests: 11 + 20, all green. tsc --noEmit clean. pnpm guard clean. * feat(daemon): expose xAI Grok models in Hermes runtime fallbackModels Lists grok-4.3, grok-4.20-reasoning, grok-4.20-non-reasoning, and grok-4.20-multi-agent-0309 as discoverable Hermes fallback models. A user who has not installed Hermes yet now sees these xAI options in the model picker, signalling that `hermes auth add xai-oauth` (SuperGrok subscription) or XAI_API_KEY unlocks Grok in Open Design without OD itself implementing OAuth-for-chat. `fetchModels` (which calls `hermes acp` to enumerate the user's actually-installed providers) is unchanged; this list only kicks in when probing fails (e.g. Hermes off PATH). Reference: xAI × Nous Research grok-hermes integration announcement, 2026-05-15. https://x.ai/news/grok-hermes * feat(media): route Grok Imagine through xAI OAuth credentials Adds resolveXAIBearer() — a refresh-aware helper on top of the xai-tokens.json store written by the daemon's OAuth client. Returns a fresh access_token, transparently refreshing in-place when the stored token enters the 120 s expiry skew window. Wires it into media-config.ts so the existing Grok provider gets the same OAuth-fallback treatment OD already gives the OpenAI provider: env keys win, then stored Settings keys, then OD-native xAI OAuth, then a borrowed Hermes-side xai-oauth token from ~/.hermes/auth.json. SuperGrok subscribers who already authorized Hermes get OD image / video generation routed through their subscription with zero extra setup. Updates the "no xAI API key" error in renderGrokImage / renderGrokVideo to point at the new OAuth path so users hitting it know they have a zero-cost option. Also exposes mediaConfigDir() so credential helpers next to media-config.json (like xai-tokens.json) reuse the same precedence: OD_MEDIA_CONFIG_DIR > OD_DATA_DIR > <projectRoot>/.od. Tests: 7 new xai-credentials cases (refresh on expiry, refresh failure, missing refresh_token, response without refresh_token) + 8 new media-config Grok OAuth fallback cases (OD-native, Hermes borrow, OD vs Hermes precedence, env precedence, stored precedence, unconfigured, expired-without-refresh). All green; tsc / guard clean. * feat(media): add xAI Grok TTS provider Registers grok-tts in the speech model catalog and wires up renderXAITTS to dispatch (provider=grok, surface=audio, kind=speech) to https://api.x.ai/v1/tts. xAI exposes a dedicated /tts endpoint that returns raw audio bytes — distinct from OpenAI's /audio/speech JSON shape — so TTS gets its own renderer rather than reusing renderOpenAISpeech. Credentials route through the same OAuth-aware path as Grok image and video (PR follow-up to media-config.ts), so a SuperGrok subscriber gets TTS for free once they have authorized once. Default request body matches the documented minimal shape (text / voice_id / language); sample_rate / bit_rate / codec are left unset so the server applies its mp3 / 24 kHz / 128 kbps defaults. Plumbing for explicit overrides is left for a later PR once the agent-facing contract grows the corresponding flags. Tests: 5 cases covering documented body shape, voice / language override, env-key fallback, server-error surfacing, and the no-credentials error. All green; tsc / guard clean. Reference: https://docs.x.ai/developers/model-capabilities/audio/text-to-speech * feat(daemon, web): expose xAI OAuth flow in Settings UI Closes the loop on the Grok integration: a SuperGrok subscriber can now authorize Open Design directly from Settings → Media Providers → Grok, with no API key and no Hermes install. After authorizing, image, video, and TTS routes pick up the bearer through the OAuth fallback chain added in 'route Grok Imagine through xAI OAuth credentials'. Daemon side - xai-oauth-server.ts opens a one-shot HTTP listener on 127.0.0.1:56121 to receive the OAuth callback. The redirect URI is hard-locked to that port because the PoC reuses the Hermes-issued client_id. Listener self-closes on first matching callback or after a 30 min timeout. - xai-routes.ts wires three endpoints onto the daemon's HTTP app: POST /api/xai/oauth/start — mint state, open listener, return authorize URL GET /api/xai/auth/status — has-token / expiry / in-flight POST /api/xai/oauth/disconnect — wipe stored token, stop listener - server.ts registers xai-routes alongside the existing mcp-routes. Web side - XaiOAuthControl.tsx renders a Sign in / Reconnect / Disconnect surface mirroring McpOAuthControl, but polls /api/xai/auth/status exclusively because the :56121 callback page lives in a separate process and can't postMessage back to the OD UI. SettingsDialog embeds it inside the Grok provider row. Tests: 9 listener cases (bind / state mismatch / replay / favicon / EADDRINUSE / timeout / explicit error param / one-shot consume / early stop) + 8 route cases (start mints PKCE URL, second start replaces in-flight listener, status reports listening + connected, callback ok stores token, callback error skips storage, disconnect wipes, cross-origin guard rejects all three endpoints). All 17 + the 74 from prior commits pass; tsc / web typecheck / pnpm guard clean. PoC client_id stays Hermes-issued; user-visible strings are hardcoded English pending an i18n pass before stable. * fix(daemon, web): xAI OAuth follow-up — paste-back, X search, UX polish PoC testing surfaced four real-world rough edges in the Sign in flow that were not obvious before getting an actual SuperGrok subscription in front of it. None alter the architecture in 'expose xAI OAuth flow in Settings UI'; they round it off so the path the user actually walks matches the one the design assumed. 1. Layout. XaiOAuthControl was a grid item inside .media-provider-body and got squeezed into the API-key column. Moves it out of the body so the row's flex-column layout gives it the full width — matches what every other Settings provider OAuth surface gets. 2. Paste-back. xAI's `auth.x.ai` page often shows a "cannot connect to your application" fallback that hands the user a code instead of redirecting back to 127.0.0.1:56121, even when the loopback listener is reachable (browser DOES quietly redirect in the background, but the page lies and shows the manual-paste UI anyway). Adds: - POST /api/xai/oauth/complete that takes {state, code} and runs completeXAIAuth + setXAIToken + stops the listener. - A paste-back input row in XaiOAuthControl that surfaces while the dance is in flight; submitting either via Enter or the button calls /complete and falls through to the same connected state the loopback path lands on. 3. X search. New POST /api/xai/search wraps Grok's native x_search tool through the Responses API, gated on the same OAuth-first credential chain as Grok image / video / TTS. Body accepts query (required), allowed_x_handles, excluded_x_handles, from_date, to_date, model. Returns { answer, citations[], model } parsed from the Responses payload via two newly exported helpers (extractAnswerText, extractUrlCitations). 4. State machine + warning banner. Three issues collapsed into one: - Polling that flipped busy → 'idle' the moment the loopback listener self-closed disabled the paste-back input even though the dance was still recoverable. Removed that branch; awaiting state now only ends on connected=true or explicit cancel. - paste-input `disabled` was over-eager (`busy !== 'awaiting' && busy !== 'refreshing'`); now it's only blocked while a submit is in flight (`busy === 'refreshing'`). - Added a heads-up banner inside the awaiting region explaining that xAI's "cannot connect" page is a UX bug on their side and the OD panel is the source of truth for sign-in success. The connected message picks up the cue too: "You can close any open xAI browser tabs now." Tests: +12 cases on top of the existing 17. The complete endpoint covers happy path, blank-field rejection, and unknown-state error. The search endpoint covers blank-query rejection, no-credentials 401, full bearer / x_search-options forwarding with response parsing, and upstream-error pass-through. Two helper functions get four direct parser cases. All 29 in the file pass; 225 across the daemon test suite pass; tsc / web tsc / pnpm guard all clean. * fix(daemon): satisfy tsconfig.tests.json strictness in xai test files The CI workspace typecheck step runs tsconfig.tests.json (which extends tsconfig.json's strict + exactOptionalPropertyTypes settings and adds the tests/ directory to the include set) — but the local `tsc -p tsconfig.json --noEmit` I ran while iterating only covered src/. That gap let two classes of strict-mode errors slip into the PR's CI: - `let outcome: CallbackOutcome | null = null` mutated from inside an async callback narrowed to `never` after `outcome?.kind` because TS doesn't track cross-function mutation. Switched the seven sites in xai-oauth-server.test.ts to a `{ current: CallbackOutcome | null }` ref object — TS does narrow .current correctly, so `kind` / `error` field access stops collapsing to `never`. - `await r.json()` returns `Promise<unknown>` in the lib.dom typings shipped with TS 5.x, so every `body.field` / `status.connected` access in xai-routes.test.ts tripped TS18046. Added a one-line `jsonOf<T = any>` helper at the top of the file and switched all call sites (both `await r.json()` and `.then((r) => r.json())`). - The cross-origin guard test iterated `for (const [method, path] of [...])` — under noUncheckedIndexedAccess that destructures to `string | undefined`, which RequestInit.method (a `string` under exactOptionalPropertyTypes) won't accept. Hoisted the cases to a typed `ReadonlyArray<readonly [string, string]>` so the elements stay non-optional. Behaviour is unchanged; vitest still reports 29/29 across these two files. tsc -p tsconfig.tests.json --noEmit now passes locally, matching what CI will run. * fix(xai-oauth): preserve refresh_token + release :56121 on cancel Two lifecycle issues Looper flagged on the prior commit: 1. resolveXAIBearer dropped the existing refresh_token whenever the refresh response omitted one. RFC 6749 §6 explicitly allows the server to skip refresh_token rotation and keep the old one valid; xAI's behaviour is currently to rotate, but a future change could silently break OD users. With the old code the first refresh succeeded but persisted a token with no refresh credential, so the next expiry forced the user back through Sign in even though their grant was still good. Carries the previous refresh_token forward when fresh.refresh_token is absent. Updates the matching xai-credentials test to assert the carried-forward value instead of the previous (incorrect) "drop it" assertion. 2. The Cancel button in XaiOAuthControl only cleared React-side pending state; the daemon's one-shot 127.0.0.1:56121 listener kept running for the full 30 min server timeout. /api/xai/auth/status would still report listening=true, and that singleton port could block the next Sign in (or a Hermes session on the same machine). Adds POST /api/xai/oauth/cancel that calls stopActiveListener() without touching the stored token (Disconnect is the destructive path; this is the narrow "release the port" affordance), wires the UI Cancel handler to fire it, and adds two route tests covering the listener-stopped-but-token-preserved invariant and the no-op behaviour when no listener is in flight. All 38 xai tests + tsconfig.tests.json typecheck + web typecheck + pnpm guard pass. * fix(xai-oauth): close two more lifecycle gaps Looper flagged Both are non-blocking but cheap and right. 1. window.open used 'noopener=no,noreferrer=no' (carried over from the sibling McpOAuthControl), which deliberately KEEPS the auth.x.ai tab's window.opener reference back to the Settings tab. Reverse tabnabbing risk if the auth page or any redirect target along the OAuth chain ever turns hostile, with no upside — the xAI flow doesn't use postMessage, the daemon receives the code through the :56121 listener (or paste-back), so opener access buys nothing. Switched to 'noopener,noreferrer'. 2. PendingAuthCache was constructed with its default 10 min TTL while the loopback listener self-closes at 30 min and the UI shows a pending state for the same 30 min. After 10 min, a user looking at a live paste-back input would hit `xAI OAuth state not found or expired` even though everything visible (and the daemon socket) still claimed the dance was live. Constructed the cache with 30 * 60 * 1000 so the PKCE state, the open :56121 socket, and the paste-back UI all expire together. The third inline comment (XaiOAuthControl.tsx:248 — "Cancel only clears React-side state") was a stale reference: the previous commit |
||
|
|
13c8bc4193
|
feat(daemon): add OpenAI-compatible media providers (#1712)
* feat(daemon): add openai-compatible media providers * fix(web): sync media registry with routed providers |
||
|
|
f78b0d3a2a
|
feat: add Leonardo.ai image provider integration (#1123)
* feat: add Leonardo.ai image provider integration
Implements Leonardo.ai as a fully supported image provider with the following models:
- Phoenix (leonardo-phoenix) - versatile general-purpose model
- Kino XL (leonardo-kino-xl) - cinematic quality
- FLUX Dev (leonardo-flux-dev) - FLUX.1 [dev]
- FLUX Schnell (leonardo-flux-schnell) - fast generation
- Anime Pastel Dream (leonardo-anime-pastel) - anime style
Features:
- Async generation with polling (2-minute timeout)
- Support for standard aspect ratios: 1:1, 16:9, 9:16, 4:3, 3:4
- Bearer token authentication
- Automatic image format detection
Implementation:
- Frontend: apps/web/src/media/models.ts
- Backend: apps/daemon/src/media-models.ts
- Renderer: renderLeonardoImage() function (~130 lines)
- Dispatcher: integrated into media generation pipeline
API Integration:
- Submit: POST /generations
- Poll: GET /generations/{id}
- Response: generations_by_pk.generated_images[0].url
Addresses #984
* fix: Add leonardo to MediaProviderId type and register env vars
1. Add 'leonardo' to MediaProviderId union in apps/web/src/media/models.ts
to fix TypeScript build error
2. Register LEONARDO_API_KEY env vars in apps/daemon/src/media-config.ts
following the same pattern as other providers
This fixes the CI TypeScript build failure and enables proper env-based
credential lookup for Leonardo.ai provider.
---------
Co-authored-by: lefarcen <935902669@qq.com>
|
||
|
|
53148d52c8
|
feat(media): add SenseAudio TTS provider (#1633)
* feat(media): add SenseAudio TTS provider Add SenseAudio (https://docs.senseaudio.cn) as a new TTS provider alongside ElevenLabs / MiniMax / FishAudio / Volcengine. Surfaced as the `senseaudio-tts` catalogue id, mapped on the wire to `senseaudio-tts-1.5-260319` — SenseAudio's flagship model with emotion / 多音字 / 公式朗读 / clone / text-generated voice support. Scope here is HTTP non-streaming (POST /v1/t2a_v2 with stream=false) only; SSE and WebSocket transports are intentionally out of scope. - Mirror provider + model entries in apps/daemon and apps/web registries (catalogue drift check stays green). - ENV_KEYS gets `OD_SENSEAUDIO_API_KEY` / `SENSEAUDIO_API_KEY` so the alias scheme matches every other integrated provider. - `renderSenseAudioTTS` in media.ts mirrors renderMinimaxTTS: Bearer auth, voice_setting / audio_setting body, hex-decoded audio under `data.audio`, base_resp envelope split from HTTP-level failures. - NewProjectPanel's audio supportedProviders allowlist now includes `senseaudio` so the picker actually surfaces the new entry. - Audio shape (mp3 / 32kHz / 128kbps / stereo) and default voice (`female_0033_b`) hard-coded for parity with the other TTS paths; MediaContext is unchanged. - New apps/daemon/tests/media-senseaudio.test.ts (8 specs) covers defaults, custom voice, default base URL fall-back, env-key path, missing-key error, base_resp failures, missing audio, and HTTP non-2xx — patterned on media-elevenlabs.test.ts. * docs(media): drop Chinese from SenseAudio provider comment Translate the model-capabilities line in the SenseAudio block comment (media.ts) into English. Keeping the source comments in a single language matches the rest of the daemon and avoids reviewer churn over mixed-locale prose. * fix(web): unblock openai and volcengine speech models in audio picker Per review on #1633, supportedModels()'s audio allowlist in NewProjectPanel was still filtering out gpt-4o-mini-tts (openai) and doubao-tts (volcengine) even though both are marked `integrated: true` in the shared media-models catalogue. Add the two ids so the picker matches the registry and the PR body's "alongside doubao-tts" claim holds true. * style(media): normalize speech hints to bare provider names Strip the trailing descriptions on the speech catalogue hints so every entry shows just the provider name (matching FishAudio / ElevenLabs / SenseAudio): `gpt-4o-mini-tts` → "OpenAI", `minimax-tts` → "MiniMax", `doubao-tts` → "Volcengine". Also move `gpt-4o-mini-tts` to the end of the list so the OpenAI entry sits after the upstream-focused providers, matching the recent picker grouping discussion on #1633. Mirrored in both apps/daemon/src/media-models.ts and apps/web/src/media/ models.ts; catalogue drift check + daemon (1848) + web (1150) suites all green. |
||
|
|
4f76e836ae
|
feat(audio): add ElevenLabs audio support (#1384)
* docs: add ElevenLabs audio support design * docs: add ElevenLabs audio implementation plan * feat(daemon): add ElevenLabs speech renderer * feat(daemon): add ElevenLabs sound effects renderer * fix(daemon): preserve ElevenLabs sfx durations * feat(web): expose ElevenLabs media providers * feat(daemon): document ElevenLabs audio contract * feat(audio): add ElevenLabs voice selection * chore: ignore superpowers scratch docs * fix(daemon): cache ElevenLabs voice options * fix(audio): expand ElevenLabs voice and SFX selection * fix(audio): align ElevenLabs SFX controls * fix(audio): tighten ElevenLabs SFX prompt budget * fix(audio): preflight ElevenLabs SFX prompt length * fix(audio): surface ElevenLabs lookup failures * fix(audio): sanitize ElevenLabs prompt errors |
||
|
|
ef9ca7baff
|
fix(daemon): typecheck core server paths (#952) | ||
|
|
56bf6ee1b6
|
feat: agent-callable research command and /search (#615)
* feat: pre-generation research (Tavily) for grounded generation
Adds an optional pre-generation research step so the agent can produce
slides / prototypes / decks grounded in real sources instead of guessing.
User flow:
1. Settings -> Tavily Search -> paste API key (or set TAVILY_API_KEY).
2. Click the new Research button in the chat composer.
3. On send, the daemon runs a Tavily search, prepends the findings
as a <research_context> block ahead of the system prompt, and
spawns the agent. Research progress shows up as status pills in
the chat stream; the agent cites sources inline as [1]/[2]/...
Phase 1 surface:
- Single provider (Tavily), single depth ('shallow'), no LLM
synthesis pass (Tavily's `answer` is the summary).
- Composer toggle only; no popover / depth picker yet.
- Reuses the existing `status` SSE agent payload + StatusPill UI
so no new event variants or renderer code are needed.
Layers touched:
- contracts: ResearchOptions / Source / Findings DTOs;
ChatRequest.research; export from index.
- daemon: apps/daemon/src/research/{index,tavily}.ts orchestrator
+ provider; tavily added to MEDIA_PROVIDERS and ENV_KEYS; hook
in startChatRun before prompt assembly.
- web: ChatComposer toggle + ChatSendMeta; threaded through
ChatPane / ProjectView / streamViaDaemon into ChatRequest.
Side fix (required to land the feature, but useful on its own):
contracts internal relative imports lacked the `.js` suffix that
NodeNext module resolution requires. This was already breaking
`pnpm --filter @open-design/daemon typecheck` on main; without the
fix, none of the new research types were visible to the daemon.
All internal contracts imports now carry `.js`.
Spec: specs/current/research-feature.md (phases 2-4 outlined for
follow-up: composer popover, multi-provider, deep recursion, example
skills with research_recommends).
Verified:
- pnpm --filter @open-design/contracts typecheck/test
- pnpm --filter @open-design/daemon typecheck (the chokidar
project-watchers test is a pre-existing flake, unrelated)
- pnpm --filter @open-design/web typecheck
- node scripts/verify-media-models.mjs
* fix(daemon): clamp Tavily max_results to 20
Tavily's /search endpoint requires `max_results` in [0, 20]; sending a
larger value (e.g. when `research.depth: "deep"` resolves to 30) returns
400 and `runResearch` silently falls back to no-research. Clamp at the
provider boundary so Phase 2 depth tiers above 20 still produce results
instead of failing the request.
Generated-By: looper 0.6.1 (runner=fixer, agent=claude-code)
* Remove stale research merge leftovers
* Add agent-callable research search
* Fix Indonesian locale typecheck
* Fix research command invocation edge cases
* Harden slash search prompt expansion
* Honor research source caps in command contract
* Require search reports in design files
* Add research data provider settings
* Wire web research provider fallback order
* Update research provider fallback wording
* Revert "Update research provider fallback wording"
This reverts commit
|
||
|
|
f3024fdc22
|
feat(media): add Nano Banana image provider (#631)
* feat(media): add Nano Banana image provider * fix(media): support Gemini API key headers for Nano Banana * refactor(media): move Nano Banana model override flag into provider metadata |
||
|
|
9e8177d80a
|
feat(media): integrate xAI Grok Imagine (image + video + native audio) (#276)
* feat(media): integrate xAI Grok Imagine (image + video + native audio)
Adds a real provider integration for xAI's Imagine API alongside OpenAI,
Volcengine and HyperFrames. The route surfaces as two model entries:
* grok-imagine-image — POST /v1/images/generations, synchronous, asks
for b64_json so the bytes arrive in one round-trip; sniffs the
returned magic bytes so JPEG payloads land with the right extension
* grok-imagine-video — POST /v1/videos/generations + GET /v1/videos/{id}
polling, with a 4s tick + onProgress heartbeat (mirrors the
Volcengine handler so the agent's bash watchdog doesn't kill long
polls). Native audio (AAC) ships in the same file as the H.264
video — that's the differentiator vs Seedance and Sora.
Picker visibility is gated by NewProjectPanel's hardcoded surface→
provider allowlist; grok is added to image + video. Settings UI picks up
the provider automatically off MEDIA_PROVIDERS.
Credentials: XAI_API_KEY (canonical, matches the official SDK) or
OD_GROK_API_KEY override; both are honoured ahead of any value pasted
into Settings, so users who already export XAI_API_KEY don't have to
re-enter it in the UI.
Verified end-to-end with a real key:
* image: 1024×1024 JPEG, single round-trip
* video: 5s 16:9 H.264 + AAC, ~46s wall clock, pending→done
* fix(i18n): include pl in EXPECTED_LOCALES so locales.test passes
Drive-by unblock for CI on this branch — `pl.ts` and `LOCALES`/`Locale`
in types.ts already include Polish, but the test's hardcoded
EXPECTED_LOCALES didn't, so every PR has been red on
locales.test.ts since the locale was added.
* fix(media): grok review feedback — credential precedence comment + poll timeout diagnostics
Two small fixes from PR #276 review:
* media-config.ts — comment claimed XAI_API_KEY won, but readEnvKey
iterates the array in order so OD_GROK_API_KEY actually wins
(matches every other provider's OD_* override convention). Rewrite
the comment to describe what the code does instead of flipping the
array — the precedence is correct, the comment was wrong.
* media.ts renderGrokVideo — single throw at the bottom couldn't tell
"timed out after N seconds with status still pending" from "submit
returned neither inline video nor a request_id to poll." Split into
two branches so operators know whether to bump
OD_GROK_VIDEO_MAX_POLL_MS or file an upstream contract bug.
|
||
|
|
3f266103b0
|
feat(media): port generation workflow onto main (#12)
Co-authored-by: Elian <elian@EliandeMacBook-Pro.local> |