File Formats

Overview

File formats specify how content is encoded to a file and tells software applications how to display/handle the encoded bits. They are usually identified by a file extension that is appended to the file name. Sometimes, file types are denoted by metadata, typically stored as a header at the beginning of the file or as a magic number (in this case, some pre-determined value that denotes the file type). A rudimentary example of this is the shebang at the beginning of a script (i.e #!/usr/bin/perl as the first line in a perl script).

Compression

Because of a number of factors, raw file data is often too large for efficient storage and transmission to other parties. To rectify this issue, there are a myriad of algorithms that reduce the size of the data encoded by the file. Most modern file formats apply some form of compression.

There are two types of compression: lossy and lossless. Lossy trades high compression for quality as it discards data and makes approximations to ensure a smaller size. Lossless compression in contrast will not degrade the data in any way and assures that the original data can be fully recovered from the original uncompressed target.

Lossy

The most common method for lossy compression are a set of methods called transform coders. These methods are most suitable for audio and images. In the example of images, the grid of pixels that comprises of the image is treated as a matrix with numeric entries. A linear transform is applied to the matrix to generate a new matrix of coefficients. Because the human eye varies in its sensitivity to different frequencies, the coefficients can be more roughly approximated for frequencies that the eye is less sensitive too without much noticeable degradation in quality. I’ll discuss the most commonly used transform coders.

The most broadly used transform coder is the Discrete Cosine Transform (DCT). DCTs are a type of Fourier transform, which decomposes a function into cosine waves of varying amplitudes and frequencies. There are several different kinds of DCT, but the most commonly used one can be defined as follows:

We take an invertible N × N square matrix whose orthogonal transformation we must find. The DCT is then

$$ X_k =\sum_{n=0}^{N-1} x_n \cos \left[\frac{\pi}{N} \left(n+\frac{1}{2}\right) k \right] \quad \quad k = 0, \dots, N-1 $$

The other, more modern algorithm is the wavelet transform. To provide context, wavelets are oscillations that have an initial amplitude of zero, increase, then decrease back to zero again. Before we reach the concept of a wavelet transform, we start with wavelet series. These take quadratically integrable functions (those of the following form):

$$ \quad \int_{-\infty}^\infty |f(x)|^2 , \mathrm dx < \infty $$

and represent them as what are called orthonormal wavelets. These are functions that live in the Hilbert space of quadratically integrable functions. If we take a wavelet series to be defined by

$$ f(x) = \sum_{j, k=-\infty}^\infty c_{jk} \psi_{jk}(x) $$

Then, the wavelet transform is the following integral transform:

$$ W(a, b) = \frac{1}{\sqrt{|a|}} \int_{-\infty}^\infty \overline{\psi\left(\frac{x-b}{a}\right)}f(x)dx $$

The wavelet transform is desirable since the wavelet matrix can be computed in linear time (O(n)), as opposed to linearithmic time like a DCT.

For images, quantization and encoding are applied to the image before it is finally compressed.

Lossless

Lossless compression primarily uses a class of techniques called entropy encoding to reduce file size. The two main variants of entropy encoding are arithmetic coding and Huffman coding. Arithmetic coding works by taking a file as input and converts the symbols in it to a floating point number between 0 and 1. The encoding process involves assigning the symbols to partitioned intervals of the real line between 0 and 1. A probability model is applied that informs the encoder what the probability of a given character is in a file. The size reduction from the encoder comes when commonly represented symbols can be represented with fewer bits relative to those that appear less frequently. The decoder applies the inverse of this process.

Huffman coding works by generating a binary tree from an input stream, which is then stored in an array of size $$ \mathbf n $$, where n is the number of symbols in the stream. It encodes by first calculating the frequency of each symbol in the input and then sorting in ascending order based on the calculated frequencies. A leaf node is then generated for each (unique) symbol. An internal node is initialized with its children being the two smallest frequency symbols; the weight of the internal node is the sum of its children. This process is iterated until there is only a single node left. Decoding is simply a matter of traversing the binary tree to find the symbols. Huffman coding is useful for data inputs with many recurring symbols. Otherwise, arithmetic coding typically achieves a better compression ratio.

Containers

Some file formats are meant for specific types of data and make use of compression and/or a codec that instructs how data is encoded. Others are nothing more than encapsulated input streams bundled with metadata embedded into a file. These formats are container formats. Those most common kinds of files that are containers are those for archival purposes (think 7-zip) and video formats.

A Look at Common Types

I won’t be going through specialized, proprietary file types designed for select programs. Instead, I’ll hit the major “groups”.

Images

The kinds of images that most people view digitally are bitmaps. These store images as groups of pixels in a dot matrix, where each pixel is an entry in the matrix containing information about color. Because the amount of bits needed per pixel to describe the color of an image grows exceedingly high as the size of the image increases, it’s necessary to apply image compression.

The image formats I’ll be looking at are

  • JPEG
  • PNG
  • GIF
  • WEBP
  • FLIF

JPEG is a compressed image format that makes use of the DCT for the compression step. Its successor, JPEG 2000 used a Discrete Wavelet Transform (DWT). Relative to PNG, JPEG is more suitable for photos as it achieves a smaller file size on average for comparable looking photos.

PNG is a lossless format that is highly portable and uses an open standard. PNG is often a better choice than JPEG for graphics, line art, and images with a lot of text.

GIF is the predecessor to PNG that most people know and love for its heavy use in animated media. It has fewer features than PNG for the most part, and lacks the breadth of color depth and things like alpha channel support that PNG has.

WEBP is an image format made by Google that supports both lossy and lossless compression. It also has native support for animations and looks like a viable alternative to gifs; both lossy and lossless compression yields smaller file sizes in converting from a gif.

FLIF is a free and open source, lossless image format that achieves superb compression ratios relative to other lossless image formats. It’s the newest out of all of these formats, so it doesn’t have as much support in image viewers/editors and other graphics software. It uses Meta-Adaptive Near-zero Integer Arithmetic Coding (MANIAC) for compression.

For recommendations, FLIF is superior in many ways to the other image formats shown with respect to quality and compression. Secondarily, WEBP is still a decent choice.

Audio

The audio formats that are most relevant are

  • MP3
  • OGG
  • FLAC
  • WAV

MP3 is a ubiquitous lossy audio format. It compresses audio by first roughly approximating sounds that are outside the hearing range of the human ear and then splitting the frequency band into separate input signals that are subjected to a Fast Fourier Transform (FFT). The FFT is given by

$$ X_k = \sum_{n=0}^{N-1} x_n e^{-i2\pi k n/N} \qquad k = 0,\ldots,N-1 $$

MP3 can use 14 different bit rates, the most common being 128kb/s, 192kb/s, and 320kb/s (which is the ceiling). Quality degrades considerably below 128k.

OGG is lossy container format that is free and open source and is most notable for being a part of the Vorbis project. As an audio container, it works with a multitude of codecs. For lossy audio, it is usually used with Vorbis, Opus, or Speex; these are all variable bitrate codecs designed to handle general audio, music, or voice data. For lossless audio, it can be the container for FLAC.

FLAC is an open, lossless audio format that has fantastic compression of raw audio, reducing the size from the original media by as much as 50-80%. FLAC has different levels of compression, each taking successively more time to encode. It is highly performant in streaming and decoding time relative to other lossless formats.

WAV is an audio format that was made by Microsoft to store raw audio bistreams. A lot of “raw” audio is stored as WAV files.

Out of these, I recommend FLAC for audiophiles who want crisp sounding music in their digital library and OGG Vorbis for those who aren’t as picky.

Video

Most video formats are containers. The ones I’ll be investigating are

  • MP4
  • Matroska (MKV)
  • WebM

MP4 is the most familiar to people and is a container for storing audio, video, and subtitles. For video, the H.264 or H.265 HEVC codecs are used. Advanced Audio Coding (AAC) is typically used for audio, which is similar to MP3, but has higher quality assuming the same bit rate is used.

Matroska is an open container format that accepts many different audio and video codecs. It is the base for other formats like WebM. It uses Extensible Binary Meta Language (EBML) for data serialization.

WebM is a video container based on Matroska that was designed by Google for web playback. It supports VP8 & VP9 for video streams and Vorbis and Opus for audio streams.

I recommend using either Matroska or WebM if you don’t mind having a constrained choice of codecs.

Digital Literature

The formats I’ll be looking at are

  • PDF
  • EPUB
  • DjVu

PDF is Adobe’s universal document format meant for presenting text, images, vector graphics, and other relevant document elements. It supports password protection and multimedia elements. Its downsides are lack of text scaling and larger file size relative to some other document formats. It is best used for printable documents.

EPUB is an open ebook file format. It supports features that PDF doesn’t, like reflowable text (the text adjusts depending on the display size), bookmarking pages, change fonts and background colors, and optional DRM. It is an archival format comprised of several HTML files that contain text, images, and other supported elements. EPUB has a much smaller file size than PDF and has the only main disadvantage of being awful for printing.

DjVu is a format primarily intended for scanned documents. Its main claim to fame is having a smaller file size than PDF. Relative to EPUB and PDF, it has far less support.

File Archives

File Archives are all container formats that store a file or group of files, often in a compressed manner. The most common archive formats are

  • Zip/7-zip
  • TAR
  • LZ4

Zip is a file compression format that typically uses the lossless compression algorithm DEFLATE, which makes use of Huffman coding. 7-zip is what most of us all know and love and supports more features like huge file sizes, higher compression ratios, and encryption of both the files within the archive and the headers. For compression, 7-zip uses the Lempel–Ziv–Markov chain algorithm (LZMA).

TAR is an archive utility that was made for Unix and Unix-like operating systems. On its own, it doesn’t compress files; it merely puts them into a single file archive. Due to this, it requires other utilities to compress the tar archive (called tarballs). The three most common compression tools that tar uses are

  • gzip (uses DEFLATE)
  • bzip2 (uses block sorting compression)
  • xz (uses LZMA)

LZ4 is a data compression algorithm that sacrifices small file sizes in exchange for rapid compression and even faster decompression speed. It takes an input stream that is represented by multiple sequences that are structured like this:

lz4

It compresses data by replacing repeated sequences to references of where they were previously.