Parsing an Excel file in Node.js is not hard to start and genuinely hard to finish. The first 80% — reading rows from a simple single-sheet workbook — takes 10 minutes with any of the available libraries. The remaining 20% — handling merged cells, multi-sheet workbooks, date serial numbers, formula cells, shared strings, and non-UTF-8 encoded legacy files — takes considerably longer if you haven't mapped out the problem first. This guide covers the full surface area: library selection, sheet enumeration, type coercion, date handling, validation, and the specific patterns that break naive parsers in production.
1Library Options: xlsx, ExcelJS, and node-xlsx
Three libraries cover the majority of Node.js XLSX parsing use cases. Each makes different tradeoffs between API ergonomics, streaming support, and format fidelity.
`xlsx` (also called SheetJS) is the most widely used. It supports the full OOXML spec, handles .xlsx, .xls, .ods, and .csv, and works in both Node.js and browser environments. The API is low-level — you get raw cell objects with type annotations and raw values, which gives you control but requires more explicit handling. It does not stream; it loads the full workbook into memory.
`exceljs` has a more ergonomic row-iteration API and supports streaming reads via `xlsx.createReadStream()`. It's a better choice when you need to process large files without loading the full workbook, or when you need to write XLSX files in addition to reading them. Its type coercion is more opinionated — dates are automatically converted to JavaScript Date objects, which saves work for the common case but can surprise you on edge cases.
`node-xlsx` is a thin wrapper around SheetJS that provides a simpler array-of-arrays interface. It's the fastest to get started with and the least flexible. Use it for quick scripts; use `xlsx` or `exceljs` for production integrations.
2Reading a Basic XLSX File with SheetJS
Install the package and read your first workbook:
const XLSX = require('xlsx');
// Read the workbook from disk
const workbook = XLSX.readFile('./products.xlsx');
// List available sheet names
console.log('Sheets:', workbook.SheetNames);
// → ['Products', 'Categories', 'Sheet3']
// Get the first sheet
const sheetName = workbook.SheetNames[0];
const worksheet = workbook.Sheets[sheetName];
// Convert to array of objects (uses first row as headers)
const rows = XLSX.utils.sheet_to_json(worksheet);
console.log(rows[0]);
// → { product_name: 'Wireless Keyboard', sku: 'KB-001', price: 49.99 }`sheet_to_json` with no options gives you JavaScript objects keyed by the first row's values. This works well when your header row is the literal first row. When your file has metadata rows above the header, or when headers are in row 3 after two rows of formatting, you need the `range` option.
// Tell SheetJS to start reading from row 3 (0-indexed, so row index 2)
const rows = XLSX.utils.sheet_to_json(worksheet, {
range: 2, // Start from this row index (0 = row 1, 2 = row 3)
defval: null, // Use null for missing cells instead of omitting them
});
// Or specify a cell range explicitly
const rows2 = XLSX.utils.sheet_to_json(worksheet, {
range: 'A3:Z1000',
defval: null,
});💡 Pro tip
Always pass `defval: null` to `sheet_to_json`. Without it, missing cells are omitted entirely from the row object, giving you rows with inconsistent keys. A row with five columns where the last two are empty comes back as a three-key object. With `defval: null`, every row has the same keys, which makes downstream validation straightforward.
3Handling Multi-Sheet Workbooks
Most production Excel files have multiple sheets. Your import UI needs to let users specify which sheet contains the data they want to import, or you need to apply heuristics to identify the right sheet automatically.
const workbook = XLSX.readFile(filePath);
// Get sheet metadata
const sheetInfo = workbook.SheetNames.map((name) => {
const ws = workbook.Sheets[name];
const ref = ws['!ref']; // Cell range, e.g. 'A1:H247'
if (!ref) return { name, rowCount: 0 };
const range = XLSX.utils.decode_range(ref);
const rowCount = range.e.r - range.s.r; // Subtract header row
return { name, rowCount };
});
console.log(sheetInfo);
// → [
// { name: 'Products', rowCount: 1243 },
// { name: 'Categories', rowCount: 48 },
// { name: 'Scratch', rowCount: 3 },
// ]
// Heuristic: pick the sheet with the most rows
const targetSheet = sheetInfo.sort((a, b) => b.rowCount - a.rowCount)[0].name;Exposing a sheet picker in your import UI is better than any heuristic. A dropdown populated from `workbook.SheetNames` with row counts lets the user confirm the right data source in one click. Build this into your import flow even if most users have single-sheet workbooks — the ones with multi-sheet files will thank you.
4Date Cells: The Hardest XLSX Problem
Excel stores dates as floating-point serial numbers — the number of days since January 0, 1900 (with a deliberate off-by-one bug inherited from Lotus 1-2-3). The serial number 45291 corresponds to January 1, 2024. When `sheet_to_json` returns a date cell, it gives you the raw serial number — not a JavaScript Date, not an ISO string.
// Option 1: Use SheetJS's built-in date conversion
const rows = XLSX.utils.sheet_to_json(worksheet, {
raw: false, // Convert dates and numbers to their display representation
defval: null,
});
// Dates come back as formatted strings: '1/1/2024'
// Downside: number formatting is also applied, turning 49.99 into '49.99'
// Option 2: Detect date cells manually and convert
const rows2 = XLSX.utils.sheet_to_json(worksheet, { defval: null });
function serialToDate(serial) {
// Excel's epoch is January 1, 1900, but Excel incorrectly treats 1900 as a leap year
const date = new Date((serial - 25569) * 86400 * 1000);
return date.toISOString().split('T')[0]; // Returns 'YYYY-MM-DD'
}
// Identify which columns are dates by checking cell type in the raw worksheet
function getDateColumns(worksheet) {
const range = XLSX.utils.decode_range(worksheet['!ref']);
const headerRow = range.s.r;
const dateCols = new Set();
// Check first data row's cell types
for (let col = range.s.c; col <= range.e.c; col++) {
const cellAddr = XLSX.utils.encode_cell({ r: headerRow + 1, c: col });
const cell = worksheet[cellAddr];
if (cell && cell.t === 'n' && cell.z && cell.z.includes('yy')) {
dateCols.add(col);
}
}
return dateCols;
}
// Apply conversion
const dateCols = getDateColumns(worksheet);
const processedRows = rows2.map((row, rowIdx) => {
const newRow = { ...row };
dateCols.forEach((colIdx) => {
const headers = Object.keys(row);
const key = headers[colIdx];
if (key && typeof newRow[key] === 'number') {
newRow[key] = serialToDate(newRow[key]);
}
});
return newRow;
});💡 Pro tip
The `cell.z` property on a SheetJS cell object contains the number format string. A format string containing 'yy', 'mm', or 'dd' patterns indicates a date cell. This is the most reliable way to identify date columns without requiring the user to annotate their schema.
5Formula Cells
Formula cells are another source of silent data loss. By default, `sheet_to_json` returns the cached value of a formula — the result that Excel computed and stored when the file was last saved. This is usually what you want. But if the workbook was saved without recalculating formulas (which can happen with xlsm macros or very large files), the cached value may be stale.
// Check if a cell is a formula
const cellAddr = 'B2';
const cell = worksheet[cellAddr];
if (cell && cell.f) {
console.log('Formula:', cell.f); // e.g. 'SUM(A1:A10)'
console.log('Cached value:', cell.v); // The last computed result
console.log('Cell type:', cell.t); // 'n' for number, 's' for string, etc.
}
// For imports, use the cached value unless you have reason to distrust it
// XLSX.utils.sheet_to_json does this automatically6Merged Cells
Merged cells are common in human-readable spreadsheets. A header spanning multiple columns, a category label spanning multiple rows — these are all merged cells. SheetJS represents merges in the `worksheet['!merges']` array. When you call `sheet_to_json`, merged cells return their value only at the top-left position; all other positions in the merge return undefined or the `defval` default.
For row-spanning merges (common in 'category' columns where the category name appears once and spans 10 rows), you need to fill down: when a cell value is null/undefined in a column that is known to use merges, carry the previous row's value forward.
function fillDownMergedColumns(rows, mergedColumns) {
const carry = {};
return rows.map((row) => {
const filled = { ...row };
for (const col of mergedColumns) {
if (filled[col] !== null && filled[col] !== undefined && filled[col] !== '') {
carry[col] = filled[col];
} else if (carry[col] !== undefined) {
filled[col] = carry[col];
}
}
return filled;
});
}
// Example: 'category' uses row-spanning merges
const processedRows = fillDownMergedColumns(rawRows, ['category', 'department']);7Streaming Large XLSX Files with ExcelJS
SheetJS loads the full workbook into memory before you can access any row. For files with 100K+ rows, this can consume 500MB+ of RAM in a Node.js process. ExcelJS's streaming reader solves this by emitting rows one at a time as they're parsed from the file.
const ExcelJS = require('exceljs');
async function streamXLSX(filePath, onRow) {
const workbook = new ExcelJS.stream.xlsx.WorkbookReader(filePath, {
sharedStrings: 'cache', // Required for correct string handling
hyperlinks: 'ignore',
worksheets: 'emit',
});
for await (const worksheetReader of workbook) {
// Only process the first worksheet (or add sheet selection logic)
if (worksheetReader.name !== 'Products') continue;
let headers = null;
for await (const row of worksheetReader) {
const values = row.values.slice(1); // ExcelJS rows are 1-indexed; slice removes leading undefined
if (!headers) {
headers = values.map((v) => String(v ?? '').trim());
continue;
}
const rowObj = {};
headers.forEach((header, i) => {
rowObj[header] = values[i] ?? null;
});
await onRow(rowObj);
}
break; // Stop after first matching sheet
}
}
// Usage
await streamXLSX('./large-catalog.xlsx', async (row) => {
await validateAndInsert(row);
});ExcelJS's streaming reader emits rows with constant memory overhead regardless of file size. The tradeoff: you cannot access merged cell metadata or perform two-pass operations (like detecting date columns from cell format strings) without reading the file twice.
8Type Coercion and Schema Validation
After parsing, every cell value is one of: string, number, boolean, Date (if using ExcelJS), null, or undefined. Your validation layer needs to coerce these to your target schema types and reject rows where coercion fails.
function coerceRow(row, schema) {
const result = {};
const errors = [];
for (const field of schema) {
const raw = row[field.sourceKey ?? field.key];
// Required field check
if (field.required && (raw === null || raw === undefined || raw === '')) {
errors.push({ field: field.key, message: `${field.label} is required` });
continue;
}
if (raw === null || raw === undefined || raw === '') {
result[field.key] = null;
continue;
}
switch (field.type) {
case 'string':
result[field.key] = String(raw).trim();
break;
case 'number': {
const num = typeof raw === 'number' ? raw : parseFloat(String(raw).replace(/[^0-9.-]/g, ''));
if (isNaN(num)) {
errors.push({ field: field.key, message: `${field.label} must be a number` });
} else {
result[field.key] = num;
}
break;
}
case 'boolean': {
const boolStr = String(raw).toLowerCase().trim();
if (['true', 'yes', '1', 'y'].includes(boolStr)) result[field.key] = true;
else if (['false', 'no', '0', 'n'].includes(boolStr)) result[field.key] = false;
else errors.push({ field: field.key, message: `${field.label} must be true/false` });
break;
}
case 'date': {
// Handle both serial numbers (from SheetJS) and date strings
const dateVal = typeof raw === 'number'
? new Date((raw - 25569) * 86400 * 1000)
: new Date(raw);
if (isNaN(dateVal.getTime())) {
errors.push({ field: field.key, message: `${field.label} must be a valid date` });
} else {
result[field.key] = dateVal.toISOString().split('T')[0];
}
break;
}
default:
result[field.key] = raw;
}
}
return { result, errors };
}9Using Xlork Instead of Building Your Own Parser
Everything in this guide — sheet enumeration, date serial conversion, merged cell handling, streaming for large files, type coercion, schema validation, user-facing error reporting — is code you have to write, test, and maintain if you build the parsing layer yourself. For one-off scripts and internal tooling, this is often the right call. For a production import feature in a SaaS product, you're solving a generic problem that your users don't care about.
Xlork's importer handles XLSX parsing natively — including all the edge cases in this guide. You define your target schema with field types and validators. Your users upload XLSX files. The AI column mapper matches their headers to your fields, the parser handles date cells and merged headers, and your `onComplete` callback receives clean, typed data. No parser code in your codebase.
💡 Pro tip
Xlork supports XLSX, XLS, CSV, TSV, XML, JSON, and Google Sheets with a single SDK integration. See the format support documentation at xlork.com/docs/formats. The free tier lets you test against your real files before committing to any paid plan.
10Summary
XLSX parsing in Node.js is manageable once you've mapped the problem space. Use `xlsx` (SheetJS) for general-purpose parsing, `exceljs` for streaming large files. Handle dates explicitly — do not trust `raw: false` to give you ISO strings. Pass `defval: null` to `sheet_to_json`. Implement fill-down logic for merged row spans. Validate and coerce types before database writes. And if you're building a user-facing import feature, weigh the maintenance cost of owning this parser layer against using a library like Xlork that handles it at the infrastructure level.




