AIData ScienceNLPApplied AI

AI in Action — A Hansard Analysis

Nicolas Izcovich1 July 20255 min read

What if we could leverage how AI interprets language to give us insights into what politicians are talking about in parliament?

Feel free to skip straight to the results if you're short on time, but if you're up for it, I promise to make the journey entertaining and enlightening!

First things first, two things (and an additional bit):

This will be an objective exercise of Applied Data/AI to shed light into how industry benchmarks can be exploited. Findings here aren't a "political opinion" — hopefully you'll see as clearly as I do that, for the purpose of this article: politics are completely irrelevant in the face of how cool applied data is!
Shoutout to Ed Donner for the inspiration to write this! Ed's course takes you through the entire domain of Large Language Models with highly stimulating explanations and applications — I truly recommend it for anyone serious about going down the AI engineering road.

Additional bit: I wrote this article a few weeks ago and forgot about it until I saw there was parliamentary activity again, so the Hansard I'm using is from March — I didn't want to just let the analysis go to waste.

Context

Talking Machines: from Data Science to Artificial Intelligence

Not so long ago, when people talked about AI, they mostly meant predictive systems. Data scientists, armed with sophisticated data-processing tools, built models to forecast outcomes, make recommendations, or assess risk.

And don't get me wrong — predictive modelling was (and still is) cutting-edge in most industries. But if we're talking nerd-o-meter levels, things didn't get truly sci-fi until machines started predicting the next word in a sentence at a very large scale.

To reiterate, because this is truly the breakthrough: a machine can effectively predict what combination of words best suits the conversation it currently is in. This means that we can now unpack semantic interpretation mathematically, which is exactly what we'll do.

Organizing Data (slightly technical, but get down w/the lingo)

Just to be aligned, a Large Language Model (LLM) transforms words and sentences into numbers positioned in an N-dimensional (vector) space. Humans are naturally limited to imagining just three dimensions, but mathematics isn't. What does this imply? Simply put, an LLM translates text into numerical representations (vectors), packing deep meaning into these numbers.

Now, you've likely heard the buzzword "Vector Database" (vDB) floating around. A vDB is essentially an engine built specifically to store these meaningful numerical representations (vectors). Here's a quick example to illustrate the concept clearly: when you search for "king - man + woman," the vector engine returns "queen." That's exactly the magic we'll use to compare and analyze parliamentary speeches.

Lastly, remember how we humans can't visualize more than 3 dimensions? Well, 3 dimensions is still clunky — much easier to see it all in 2 dimensions! That's where t-SNE enters the party. All you really need to know about t-SNE is that we use it to squeeze high-dimensional numbers onto a 2D space while respecting "who is close to whom." Imagine it as flattening a complicated globe onto an easy-to-read map without losing the relationships between cities.

Huggingface: a public zoo of free models

To quote Andrej Karpathy, software is evolving! We've gone from:

1.0 — Computer code stored in GitHub
2.0 — Neural net weights stored in data
3.0 — LLMs stored in Hugging Face

For reference, we'll be using a free model from Huggingface to vectorize the Hansard: all-MiniLM-L6-v2.

Results

~30 lines of code

This is what the pipeline (overall) looks like:

# 1. read XML → grab every <talker> + <p> block
speeches = parse_hansard("House_2025_03_25.xml")

# 2. slice long speeches into 1k-character bites
chunks = splitter.split_documents(speeches)

# 3. turn each chunk into a 384-number vector
emb = HuggingFaceEmbeddings("all-MiniLM-L6-v2")
vectordb = Chroma.from_documents(chunks, emb)

# 4. pull the vectors back out, run t-SNE
vecs = vectordb._collection.get(include=["embeddings"])["embeddings"]
proj = TSNE(metric="cosine").fit_transform(vecs)

# 5. colour by speaker's party, hover shows first 100 characters

# 6. gradio to chat with the vector database if we want to go a step further

The map

Looking at the semantic map, it appears that on this sitting day, the style of talking for:

The ALP seems to have an individual style of talking positioned mostly on the top half of the space.
The LP more bottom left, though their style of talking meets the ALP on a lot of topics on the upper left.
The LNP's style of talking is on the bottom center, where they are met by all parties and even the AG, and center.

Note: It's important to remember that you're looking at vector space (semantic plane) — meaning that left, right, above, or below don't mean anything other than the semantic interpretation map for this particular embedding model.

Quick visual insights (this sitting day)

On this sitting day, 23,751 total chunks were analyzed. The topic distribution was:

54% — Climate, Industry and Infrastructure (the most debate)
~19% — Cost-of-living and jobs
1.8% — Migration (the least mentioned topic)

What's next? I can continue down the Hansard road

Seasonality drift — did parties' language clouds move closer/apart as campaigns came and go?
Budget week pulse — should I make it spicier, consolidate a year or two of debates and identify fallacies?
What are you interested in? — Tell me in the comments.

I'm also contemplating abandoning the Hansard and maybe doing some analysis on Annual Reports for publicly listed companies.

Either way, thanks for riding it out with me! Happy to hear your thoughts 🙂