List Shrinking Jul 26, 2024

Code Link

I’ve been working on a conversion process, where we take a large struct A and, depending on various values within it, convert it to either B, C, or D.

This data comes from an external source and the documentation leaves a lot to be desired, with missing enum variants, incorrect type fields, incorrect nullability, etc.

Due to the inaccurate documentation, we lean heavily on our test suite.

There are a large number of items (hundreds of millions) that we deal with in this conversion process. While we certainly could pull all those items and run them in our tests, I thought there might be a better way. One that would minimize the pain of running the tests while preserving the same level of confidence.

What if we could throw away all the items that are not uniquely interesting? Or are, if you squint your eyes, near-duplicates of other items?

More concretely, imagine the following struct

struct MyStruct {
    a: u8,
    b: String,
}

These two records are, of course, not equal.

let a = MyStruct { a: 1, b: "hello".to_string() };
let b = MyStruct { a: 2, b: "hello world".to_string() };

But, in the conversion process, we wanted to treat them as equal. They both contain a positive, non-max integer and a non-empty, non-whitespace string. If one of them converts successfully, then the other will as well.

What I really was after was a way extract just the meaningfully unique items. So what exactly is a meaningfully unique item?

// A positive int is one type
let a = MyStruct { a: 1, b: "hello".to_string() };

// 0 is another interesting one
let b = MyStruct { a: 0, b: "hello world".to_string() };

// Along with the max value
let c = MyStruct { a: u8::MAX, b: "hello world".to_string() };

// Or an empty string
let c = MyStruct { a: 1, b: "".to_string() };

// Or a string of just whitespace
let c = MyStruct { a: 1, b: "   ".to_string() };

That allows us to start to pencil in some basic rules and some enums to represent them.

Strings

Perhaps the simplest case. There are only a few interesting strings that we care about:

enum StringClassification {
    Empty,
    Whitespace,
    NonEmpty,
}

Integers

Integers have a few more options.

enum SignedIntClassification {
    Min,
    Negative,
    Zero,
    Positive,
    Max,
}

enum UnsignedIntClassification {
    Zero,
    Positive,
    Max,
}

Tying it together

Let’s introduce a trait.

trait Classify {
    type Output: Hash;

    fn classify(&self) -> Self::Output;
}

impl Classify for u8 {
    type Output = UnsignedIntClassification;

    fn classify(&self) -> Self::Output {
        match self {
            0 => UnsignedIntClassification::Zero,
            u8::MAX => UnsignedIntClassification::Max,
            _ => UnsignedIntClassification::Positive,
        }
    }
}

impl Classify for String {
    type Output = StringClassification;

    fn classify(&self) -> Self::Output {
        match self.as_str() {
            "" => StringClassification::Empty,
            s if s.chars().all(char::is_whitespace) => StringClassification::Whitespace,
            _ => StringClassification::NonEmpty,
        }
    }
}

That let’s us classify some primitives, which is a start. But how do we classify the structs we introduced earlier?

let a = MyStruct { a: 1, b: "hello".to_string() };
let b = MyStruct { a: 2, b: "hello world".to_string() };

While we certainly could return a tuple of

(UnsignedIntClassification, StringClassification)

that approach doesn’t scale well. The large struct A has 20 some fields and, more importantly, has nested fields, for a total of 60+ fields.

If we make sure that our base enum classification types implement Hash, we can use that to hash the classifications of the fields, giving us some really nice compositional properties.

impl Classify for MyStruct {
    type Output = u64;

    fn classify(&self) -> Self::Output {
        let mut hasher = DefaultHasher::new();
        &self.a.classify().hash(&mut hasher);
        &self.b.classify().hash(&mut hasher);
        hasher.finish()
    }
}

More complex structs

That gets things working for our simple example, but what about something with some more complexity? Perhaps a struct like

struct ComplexStruct {
    a: Option<u8>,
    b: Vec<String>,
}

Options

Options can be thought of as just adding an additional None variant to the classification.

Vecs, arrays, and slices

This is where things get a bit more interesting. We want a classification that is order-independent and doesn’t care about duplicates.

We can accomplish that be classifying each element. Then, removing any duplicates and sorting the remaining elements.

This allows something like the following to be true:

let a = vec![1u8, 0];
let b = vec![0u8, 5, 6];
assert_eq!(a.classify(), b.classify());

A BTreeSet does the heavy lifting for us here.

impl<T: Classify> Classify for Vec<T>
where
    T::Output: Hash + Eq + Ord,
{
    type Output = u64;

    fn classify(&self) -> Self::Output {
        let unique_classifications = BTreeSet::default();
        for item in &self {
            unique_classifications.insert(item.classify());
        }

        let mut hasher = DefaultHasher::new();
        for classification in unique_classifications {
            classification.hash(&mut hasher);
        }
        hasher.finish()
    }
}

But, it does fail to capture an important property of the vec. We really would like to capture whether the vec:

is empty,
contains a single element, or
contains multiple elements

That remains an item for a different day.

Results

We have a few different datasets in which we employ this pattern and the results have been very successful. We are able to keep a gzipped version of the dataset in our repository to use in our tests.

Dataset	Meaningfully Unique Items	Total Items
First	620	40,000
Second	172	50,000
Third	681	700,000
Fourth	3,876	175,000,000