- Using the Power of Machine Learning to Compress

Use the Power of Machine Learning to Compress

Why choose a compression method yourself when machine learning can predict the best one for your data and requirements?

How prediction works:
Input: titanic.csv

y,Pclass, sex ,age 0, 3 ,female, 22 1, 1 , male , 38 0, 2 , male , 30 1, 3 ,female, 26 [x97]


{ "num_observations": 100, "num_columns": 4, "num_string_columns": 1, ... }

Model Predictions

Smallest size: csv+bz2 ✓ Fastest write time: csv+gzip ✓ Fastest read time: csv+gzip ✓ Weighted (3, 1, 1): csv+bz2 ✓

$ pip install shrynk

Now in python...

>>> from shrynk import save

You control how important size, write_time and read_time are.
Here, size is 3 times more important than write and read.

>>> save(df, "mydata", size=3, write=1, read=1) "mydata.csv.bz2" >>> save(df, "mydata", size=3, write=1, read=1) "mydata.csv.bz2"
or from command line (will predict and compress)
$ shrynk compress mydata.csv $ shrynk decompress mydata.csv.gz
Contribute your data:
format_quote Data & Model for the community, by the community format_quote
1. Click below to upload a CSV file and let the compression be predicted.
2. It will also run all compression methods to see if it is correct.
3. In case the result is not in line with the ground truth, the features (not the data) will be added to the training data!
or run the example
The data was featurized, and a prediction was made. Then, all the compressions were ran for this file so we can see if the prediction was correct (the ground truth).
Filename: titanic_example.csv
Features: { "num_obs": 100, "num_cols": 4, "num_float_vars": 0, "num_str_vars": 1, "percent_float": 0.0, "percent_str": 0.25, "str_missing_proportion": 0.0, "float_missing_proportion": 0, "cardinality_quantile_25": 0.02, "cardinality_quantile_50": 0.025, "cardinality_quantile_75": 0.032, "float_equal_0_proportion": 0, "str_len_quantile_25": 6.0, "str_len_quantile_50": 6.0, "str_len_quantile_75": 6.0, "memory_usage": 3328 }

engine='csv' compression='bz2'
1st / 12
Wrong! We will learn from this...

Ground truth

-- scroll ->
Here are the results for titanic_example.csv.
The prediction (minimizing score) is shown in color.
It is based on user defined weights (here: size=3, write=1, read=1)
compression score size write_time read_time
csv+bz2 -5.34783 217b 1ms 3ms
csv+zip -4.37816 420b 2ms 6ms
csv+None -1.36716 1kb 1ms 4ms
fastparquet+LZ4 -0.56655 1kb 6ms 10ms
pyarrow+lz4 -0.48164 1kb 2ms 6ms
fastparquet+GZIP -0.23642 1kb 7ms 10ms
pyarrow+snappy -0.09988 1kb 3ms 8ms
pyarrow+zstd 0.23149 1kb 4ms 8ms
pyarrow+gzip 0.32534 1kb 3ms 9ms
pyarrow+brotli 0.6152 1kb 6ms 10ms
csv+xz 2.61173 268b 30ms 34ms
fastparquet+UNCOMPR. 8.69387 4kb 9ms 12ms