Determine the Size of a File
A question that looks trivial until symbolic links, sparse files, and streamed sources show up — three ways to ask “how big is this,” and what each answer actually means.
- Why this isn’t quite as simple as it sounds
- Asking the filesystem directly (the right default)
- Code in Python, Java, and C
- The seek-to-end trick, and why it’s worse
- Sparse files and “apparent size” versus actual disk usage
- Symbolic links and what size you actually get
- Sizing something that isn’t a regular file at all
01 Why this isn’t quite as simple as it sounds
Asking “how big is this file” sounds like it should have exactly one correct answer, and for the overwhelming majority of ordinary files, it does. But the question quietly assumes a few things that aren’t always true: that the path points to a regular file rather than a directory, a symbolic link, or a special device file; that the file’s reported size matches the disk space it actually consumes; and that the file isn’t currently being written to by another process, in which case “the size” is a number that could change between the moment you ask and the moment you act on the answer.
Most of the time none of this matters, and a single straightforward call answers the question correctly. But the edge cases are common enough — log files actively being appended to, symbolic links pointing at files in another directory, sparse files used by databases and virtual machine disk images — that it’s worth understanding what’s actually being measured.
02 Asking the filesystem directly (the right default)
Every mainstream operating system already tracks file size as metadata, alongside things like creation time and permissions, in a structure usually called an inode or its equivalent. The correct, efficient way to find a file’s size is almost always to ask the filesystem for that metadata directly, rather than reading through the file’s actual contents to measure it. This is a single, fast metadata lookup — no matter how large the file is, checking its size this way takes the same small, constant amount of time.
03 Code in Python, Java, and C
import os def file_size(path): return os.path.getsize(path) # a single stat() call under the hood
Java’s java.nio.file package, the modern replacement for the older java.io.File class, exposes the same metadata-based lookup through its Files utility class.
import java.nio.file.*; public static long fileSize(String path) throws IOException { return Files.size(Paths.get(path)); }
In C, the same metadata lives behind the stat() system call, which fills a structure with everything the filesystem knows about a path, including its size in bytes.
#include <sys/stat.h> long file_size(const char* path) { struct stat st; if (stat(path, &st) != 0) { return -1; /* path doesn't exist or isn't accessible */ } return (long) st.st_size; }
All three of these, underneath, are doing essentially the same thing: a single call to the operating system’s filesystem layer, asking for metadata rather than reading the file’s actual content. That’s why this approach scales identically whether the file is a few bytes or several gigabytes.
04 The seek-to-end trick, and why it’s worse
An alternative, and historically common, approach opens the file, seeks all the way to the end, and reads the current position — since the position after seeking to the end is, by definition, the file’s length. It works, but it has real downsides compared to a direct metadata lookup.
long file_size_via_seek(const char* path) { FILE* f = fopen(path, "rb"); if (!f) return -1; fseek(f, 0, SEEK_END); long size = ftell(f); fclose(f); return size; }
- It requires actually opening the file, which can fail for permission reasons even when a plain metadata lookup would have succeeded, and which holds a file descriptor open briefly for no real reason.
- It doesn’t work reliably on non-seekable sources — pipes, certain sockets, and some special device files don’t support seeking at all.
- On very large files on some older systems,
ftell()‘s return type can be a source of overflow bugs that a 64-bit-awarestat()call avoids.
05 Sparse files and “apparent size” versus actual disk usage
A sparse file is one where large stretches of “empty” space — long runs of zero bytes — aren’t actually written to disk at all; the filesystem just remembers that those regions exist and are empty, without consuming physical storage for them. Virtual machine disk images and certain database files use this technique heavily.
The size returned by stat() or its equivalents — what’s often called the “apparent size” — reflects the logical length of the file as if every byte were really stored, which can be dramatically larger than the actual disk space consumed. This distinction matters enormously for anything that reports disk usage to a user, since reporting apparent size for a sparse file can wildly overstate how much actual storage it occupies.
06 Symbolic links and what size you actually get
| Function behavior | What you get for a symlink |
|---|---|
| Follows the link (default in most languages) | The size of the target file the link points to |
| Does not follow the link (a “lstat”-style call) | The size of the link itself — typically just the length of the path string it stores |
Most high-level language functions, including the Python, Java, and C examples above, follow symbolic links by default — asking for the size of a symlink usually returns the size of whatever it points to. This is almost always the behavior you actually want, but every major filesystem API also exposes a non-following variant for the rarer cases where the link’s own metadata is what’s actually needed.
07 Sizing something that isn’t a regular file at all
- Directories report a size too, but it reflects the storage used by the directory’s own metadata structure, not the combined size of everything inside it — summing a directory’s contents requires walking it recursively and adding up each entry.
- Pipes and sockets generally don’t have a meaningful “size” at all in the file sense — there’s no fixed length, only a stream of bytes that may or may not have more data coming.
- Streamed network resources, like an HTTP download, sometimes report an expected size in advance via a header, but that’s a claim from the far end rather than a filesystem fact.
- A file actively being written by another process can report a different size on two consecutive checks — there’s no way to “freeze” a size value as definitively final unless the writing process has actually finished and closed the file.
