open-design/apps/daemon/tests/sanitize-name.test.ts
Xinmin Zeng 132adac3bb
fix(daemon): preserve non-ASCII filenames on multipart upload (#166)
* fix(daemon): preserve non-ASCII filenames on multipart upload

multer@1 hands callers latin1-decoded multipart filenames, and
sanitizeName() then collapses every non-ASCII character to '_'. A
Chinese DOCX uploaded as 测试文档.docx therefore landed on disk as
____.docx, while the response shipped a latin1-mangled originalName
back to the client. The chat composer compared that to the UTF-8
File.name from the picker, missed the match, and reported "some
files could not be stored" even though the bytes were already on
disk.

Add decodeMultipartFilename() that round-trip-checks latin1->utf8
so genuine latin1 names aren't corrupted, and switch sanitizeName()
to keep \p{L}\p{N} (Unicode letters/digits) while still stripping
path separators, control characters, and Windows-reserved
punctuation. Apply the decoder to all three multer storage filename
callbacks so /api/upload, /api/import/claude-design, and
/api/projects/:id/upload all return UTF-8 originalNames.

Closes #144

* fix(daemon): address review feedback on filename decoder

Three changes from review:

1. decodeMultipartFilename now bails out early when any code point in
   the input exceeds 0xFF. multer can hand us an already-decoded UTF-8
   string when the client uses the RFC 5987 `filename*` parameter, and
   the previous round-trip-only check would corrupt those names — for
   example, a properly decoded `测试文档.docx` could be re-mangled into
   `KՇc.docx` because the latin1-encoded bytes happened to form a
   valid UTF-8 sequence. Code points above 0xFF can never appear in a
   genuine latin1-decoding-of-utf8 input, so they're an unambiguous
   signal that no further decoding should run.

2. decodeMultipartFilename now handles null/undefined defensively (it
   was relying on multer always populating originalname; this just
   makes the helper safer to reuse from other call sites).

3. The inline sanitiser regex in the `upload` and `importUpload`
   multer instances is replaced with a direct `sanitizeName()` call,
   matching what `projectUpload` already did. This keeps a single
   source of truth for the allowed character set so future tweaks
   only have to land in one place.

Adds two more tests to the existing sanitize-name suite covering the
above-0xFF early return and the null/undefined inputs (13/13 in that
file, 58/58 across the daemon).
2026-04-30 19:49:43 +08:00

72 lines
3 KiB
TypeScript

import { describe, expect, it } from 'vitest';
import { decodeMultipartFilename, sanitizeName } from '../src/projects.js';
describe('sanitizeName', () => {
it('keeps ASCII letters, digits, dot, dash, underscore as-is', () => {
expect(sanitizeName('Report_v2.final-1.pdf')).toBe('Report_v2.final-1.pdf');
});
it('collapses whitespace runs to a single dash', () => {
expect(sanitizeName('Hello World page.html')).toBe('Hello-World-page.html');
});
it('preserves Unicode letters/digits (Chinese, Japanese, Cyrillic, accented)', () => {
expect(sanitizeName('测试文档-中文文件名.docx')).toBe('测试文档-中文文件名.docx');
expect(sanitizeName('資料.pdf')).toBe('資料.pdf');
expect(sanitizeName('Cafe-naïveté.docx')).toBe('Cafe-naïveté.docx');
expect(sanitizeName('документ.txt')).toBe('документ.txt');
});
it('replaces path separators with underscore', () => {
expect(sanitizeName('a/b\\c.txt')).toBe('a_b_c.txt');
});
it('replaces reserved punctuation with underscore', () => {
expect(sanitizeName('a:b*c?d.txt')).toBe('a_b_c_d.txt');
});
it('rewrites leading dot runs to underscore so dotfiles cannot land on disk', () => {
expect(sanitizeName('..hidden.txt')).toBe('_hidden.txt');
});
it('falls back to a generated name when the input is empty after cleanup', () => {
const out = sanitizeName('');
expect(out).toMatch(/^file-\d+$/);
});
});
describe('decodeMultipartFilename', () => {
it('restores UTF-8 names that multer parsed as latin1', () => {
// multer@1 hands callers the latin1 decoding of the multipart bytes.
// Re-encoding 'measure' to latin1 lets us simulate that exact input.
const utf8 = '测试文档-中文文件名.docx';
const latin1 = Buffer.from(utf8, 'utf8').toString('latin1');
expect(decodeMultipartFilename(latin1)).toBe(utf8);
});
it('leaves genuine latin1 names untouched when bytes do not form valid UTF-8', () => {
// 0xE9 alone is not valid UTF-8 — keep the raw latin1 representation.
const latin1Only = Buffer.from([0x43, 0x61, 0x66, 0xe9]).toString('latin1');
expect(decodeMultipartFilename(latin1Only)).toBe(latin1Only);
});
it('round-trips ASCII names without modification', () => {
expect(decodeMultipartFilename('plain.txt')).toBe('plain.txt');
});
it('treats empty input as a no-op', () => {
expect(decodeMultipartFilename('')).toBe('');
});
it('returns input untouched when any code point exceeds 0xff', () => {
// Simulates multer receiving an RFC 5987 `filename*` parameter and
// decoding it to UTF-8 itself. Re-decoding would corrupt the name.
const alreadyDecoded = '测试文档.docx';
expect(decodeMultipartFilename(alreadyDecoded)).toBe(alreadyDecoded);
});
it('handles null and undefined defensively', () => {
expect(decodeMultipartFilename(null as unknown as string)).toBe('');
expect(decodeMultipartFilename(undefined as unknown as string)).toBe('');
});
});