Methodology
Initial test data was manually collected across several New Zealand supermarkets. With hypothesis in hand, we expanded to programmatic collection.
To get our initial list of wines, we queried a bunch of different filter combinations on the Vivino app, including details like wine types (Red, White, Rosé, etc.), price range ($0–$10000+), and average rating (0–5 stars), and removed duplicates. For each wine, we collected the following: Vivino ID, name, year, winery, country, region, type, rating, number of ratings, price, currency, and a url with the image of the label. All details were collected in March 2024 and may not reflect current price or rating, especially if those have changed dramatically.
To avoid over-prioritizing wines whose brands included animals, we limited each brand to one wine per type (Dessert, Fortified, Red, Rosé, Sparkling, White), choosing the wine with the most ratings as a proxy for which wine would be more well known.
We then fed the label images into the OpenAI API (gpt-4-vision-preview model). We tested several prompts including open-end questions like What’s on this image?, but ultimately the prompt that worked the best was This image is a wine label. Do you see any animals or humans on it? On a scale of 0-1, with 0 being 'not certain at all' and 1 being 'very certain', how sure are you?
This provided us with responses that looked something like this: I do not see any animals or humans on the wine label. The label contains text and a small emblem at the top, but there are no discernible figures of humans or animals. I am very certain about this assessment; so on the scale from 0 to 1, my confidence level is 1. Occasionally, Chat-GPT would respond with something like I'm sorry, I can't assist with these requests. In those cases we reran the prompt or manually reviewed the label. We ended up abandoning Chat-GPT’s certainty prediction because it wildly swung toward the poles: only 0.6% of the wines had confidence scores of something other than 0 or 1.
We then manually reviewed a subset of wines to check Chat-GPT’s accuracy. Chat-GPT was particularly bad with pachyderms. It also struggled when animals weren’t the focal point and with what counted as an “animal.” Sometimes anthropomorphic or mythical creatures and insects were included, sometimes not. Knowing these shortcomings, we paid more attention to manually reviewing these types of labels. There still may be a few labels that slipped through the cracks, but overall this should be a solid sample.
We ended up identifying 1,488 animal wines (the animal counts may not equal the total number of wines because multiple animals could appear on each label). We then manually grouped them into larger categories. Some animals rolled up, others did not. For example:
This opened up a lot of existential questions about what constitutes an animal, like “Is a duck a bird?” or “Is a zebra a horse?” We tried to strike a balance between common perception and scientific naming with binomial nomenclature (genus and species), often going back to the question “What would a kid call this animal?”
We limited our analysis to animal groupings with at least 20 wines, knocking out things like bats, monkeys, rodents, and marsupials. Our apologies to Yellow Tail, the lone kangaroo label in our dataset.
To calculate the percentage of wines in each animal group that were good deals, we found the percentage of wines at or below the overall median price ($29.99) and at or above the median rating (4 stars) for each animal group.
For the trend line, we calculated the best fit regression (logarithmic) and used Harry Stevens’ d3-regression package to draw the line.
We also bucketed the wine by price, rating, type, and country to compare each animal groups’ distribution to all wines. We then calculated a Z-score for each bucket to find animal wines with statistically more or statistically less wines per bucket than all wines.