𝐂𝐨𝐦𝐩𝐫𝐞𝐬𝐬𝐢𝐧𝐠 𝐝𝐚𝐭𝐚 𝐫𝐞𝐝𝐮𝐜𝐞𝐬 𝐬𝐭𝐨𝐫𝐚𝐠𝐞 (𝐜𝐨𝐬𝐭𝐬).
Two compression techniques that are used a lot are called Run Length Encoding (RLE) and Dictionary Encoding (DE).
These compression methods are used in the file format parquet but also by the Vertipaq engine in Power BI.
Knowing about these compression techniques helps you better understand why the file size has decreased when you change a certain file from e.g. CSV to parquet.
𝐑𝐮𝐧 𝐋𝐞𝐧𝐠𝐭𝐡 𝐄𝐧𝐜𝐨𝐝𝐢𝐧𝐠:
Original Column │ Run Length Encoded Column │ Frequency
───────────────┼───────────────────────────┼──────────
A │ A │ 1
B │ B │ 1
A │ A │ 2
A │ B │ 1
B │ C │ 2
C
C
𝐃𝐢𝐜𝐭𝐢𝐨𝐧𝐚𝐫𝐲 𝐄𝐧𝐜𝐨𝐝𝐢𝐧𝐠:
Original Column │ Dictionary Encoded Column │ Frequency
──────────────────┼───────────────────────────┼──────────
Membership sales │ Memb │ 2
Membership sales │ Car │ 1
Car sales │ Bike │ 1
Bike sales