🔎 Scraping the Belgian parliament

Making Belgian parliamentary data more accessible and transparent

11 minute read

🗳️ The end result can be found at zijwerkenvooru.be (page is in Dutch)

Jump to heading Introduction

The Belgian Chamber of Representatives is a part of the Belgian Federal Parliament. It's made up of 150 members that are directly elected every 5 years. They perform the following tasks

  1. Ask questions to the relevant ministers
  2. Debate about legislative proposals
  3. Vote on legislative proposals

This project started with me wanting to find out the answer to a very simple question: How does member X vote on topic Y?

Given that Belgian citizens can directly vote on members of the parliament, it only makes sense that we can easily know how they vote.

Surprisingly — or perhaps unsurprisingly, for Belgian readers — this turned out to be much harder than expected.

Jump to heading How does member X vote on topic Y?

The Chamber of Representatives has a website. So let's have a look.

Jump to heading Meeting reports

The members of the Chamber of Representatives meet weekly in the plenary meeting where they ask questions and debate/vote on legislative proposals. Every week, they publish a report of this meeting.

The reports of the plenary meetings.
The reports of the plenary meetings.

Each report is available as a .pdf or .HTML file and contains the whole meeting typed out (both in Dutch and French). A typical report is between 40-80 pages long.

You can have a look at such a report here: https://www.dekamer.be/doc/PCRI/html/56/ip033x.html.

Jump to heading Voting results

Each report contains the different votes that occurred during the meeting. For each vote, there is a section that shows the following:

To find out which member voted what, we have to scroll all the way down to the bottom of the report where for each vote (indexed by number), there is a list of names grouped by yes/no/abstain. Not great!

A vote result in the report.
A vote result in the report.
A vote result in the report.
A vote result in the report.

Additionally, the vote contains an ID such as 656/1-4 which links the vote to a dossier 656 and a subdocument 1-4, both of which can't be found in the report but on a separate page of the website of The Chamber of Representatives.

So the only way to see the vote results, is this 50+ page report. There is no way to

Let's improve this! The project I set out to build consists of the following 3 steps:

  1. Download the plenary meeting reports (HTML files)
  2. Extract the data from the HTML files and write the data to .parquet files which are more easily queryable
  3. Read the data from the parquet files and generate HTML files from them to get a website

Jump to heading Extracting the data

First, we need to extract the data from the website of the Chamber of Representatives and get it into the right format.

Jump to heading Downloading the plenary meeting reports

This was the easiest step. Using Rust and the reqwest crate I could simply download the HTML reports. By parametrizing the URL, I can just loop through the meeting IDs and download each one.

let url = format!(
    "https://www.dekamer.be/doc/PCRI/HTML/{}/ip{:03}x.HTML",
    session_id, meeting_id
);

let response = client.get(&url).await?;
let raw_bytes = response.bytes().await?;
let (decoded_str, _, _) = WINDOWS_1252.decode(&raw_bytes);
let content = decoded_str.to_string();

The only issue I encountered was that the HTML files contain the following meta tag which tells us the HTML file is encoded using Windows-1252 chracter encoding.

<meta charset=windows-1252>

I used the encoding-rs crate to handle the decoding of the HTML files which perfectly handled these Windows-1252-encoded files.

💡 Scraping Etiquette

When scraping data from a website like I was using a bot/script, make a sure to check the following:

  • Read the Terms of Service to make sure you are allowed to scrape the website
  • Respect robots.txt and follow the instructions
  • Throttle requests to avoid too many requests at once
  • Identify your bot using an appropriate User-Agent header
  • Avoid unneccessary scraping (only scrape what's needed, and check if you already have the file before downloading it again)

Jump to heading Extracting data from HTML into parquet

Now that we have the HTML files downloaded, we need to extract voting data from them. For this, I used the scraper crate which allows for parsing HTML using selectors. Selectors allow you to target a specific element + CSS selectors within the HTML file. For example, the following selector looks for tr tags which contain a child a tag that has a href attribute with the text mailto: in it. I use this selector to find email addresses.

Selector::parse("tr a[href*='mailto:']")

The extraction script goes through the downloaded HTML file, and by using these kind of selectors and other logic rules, it builds up a data structure that represents the report. It captures the following information:

This data then gets written to a .parquet file using the parquet crate.

let questions_batch = RecordBatch::try_new(
    questions_schema.clone(),
    vec![
        Arc::new(StringArray::from(question_ids)),
        Arc::new(StringArray::from(question_questioners)),
        Arc::new(StringArray::from(question_respondents)),
        Arc::new(StringArray::from(question_topics_nl)),
        // ...
    ],
)?;

let mut questions = ArrowWriter::try_new(questions_file, questions_schema, None)?;
questions.write(&questions_batch)?;
questions.close().unwrap();

We end up with a questions.parquet file that contains a row for each question, nice!

Structured question data in a parquet file.
Structured question data in a parquet file.

Jump to heading Generating the website

Now that we have generated all these parquet files, we can use them to generate a website that can visualize all of the data we extracted out of the reports.

Jump to heading Scraping structure & techniques

For generating the website, I use Eleventy which is a static site generator. At build-time, all the pages of the site are generated. To query the parquet files, I use the DuckDB Node API. DuckDB is an in-process SQL database management system which in my case is used to query and transform the data from the parquet files, into data structures that can be fed into my static site generator.

import { DuckDBInstance } from '@duckdb/node-api';

const questionsFilePath = 'src/data/questions.parquet';
const instance = await DuckDBInstance.create(':memory:');
const connection = await instance.connect();

const readParquet = async (filePath) => {
    const result = await connection.runAndReadAll(`SELECT * FROM read_parquet('${filePath}')`);
    return result.getRows();
};

readParquet(questionsFilePath);

For example, we can query both members.parquet and questions.parquet to relate questions with members. Then, we can generate a page for each member, showing all the questions they asked in the plenary sessions.

The image below shows a list of questions that a member asked. Each question shows the title of the question and by which minister it was answered.

Questions asked by a member.
Questions asked by a member.

When clicking on a question, you go to the detail page which includes the subsequent discussion in a chat/dialog style.

A discussion after a question.
A discussion after a question.

The cool thing of using Eleventy + DuckDB is that all these transformations and page generations happen at build-time. This means that there are no database calls or expensive queries happening when a user visits a page on the site.

Jump to heading Vote visualization

By linking members, parties and votes, we can display the vote results in an organized way. The vote detail page shows the title of the vote, the outcome, and the detailed vote results.

Votes in the report.
Votes in the report.
Votes on the website.
Votes on the website.

Each vote has 3 views:

Jump to heading Topic labeling

Each question, vote or proposition gets assigned a list of topics. Within these topics, there are 8 main topics

By assigning topics for each meeting item, I can then for a given member generate a chart that shows their main interest.

A member's favorite topics as a stellar chart.
A member's favorite topics as a stellar chart.
Topic detail view.
Topic detail view.

We can also go the other way around: given a topic, see all meeting items (questions/votes/propositions) related to that topic along with which members are most involved with those topics.

The mapping between a meeting item and a topic, is done using a simple list of keywords, against which the title of the question/vote/proposition is matched.

"smoking and vapes": ["vapes", "cigarette", "smoking"],
"games of chance": ["lotto", "gambling", "lottery"],

In the end for each of the 8 main topics, I added 2-5 subtopics with some keywords for each. This list will be extended and fine-tuned as more questions are ingested over time.

Jump to heading Title summarization

Often, related questions are bundled so they only need to be answered once by the responsible minister. These question titles often mean the same thing and so I decided this would be a good use case to use an LLM to summarize these topics.

4 similar question topics.
4 similar question topics.

By having a 'summarized version' toggle at the top of the page, the user can opt-in to see the summarized version of each question. This makes the whole page a bit more digestible to look at. When a user sees a topic they are interested in, they can then go to the detail page to see the full discussion.

1 summarized question topic.
1 summarized question topic.

Mistral was used for generating these summaries. I used the following prompt:

The assistant will receive a comma-separated list of topics and generate a single, concise topic (no more than 20 words) that encompasses all the given topics.
- The result must match the style of the input topics.
- The result must be in Dutch.
- Do not add explanations, clarifications, or extra words such as 'including' or 'such as'.
- The output should fit naturally within the provided list.
- Only return the summarized topic without any additional text.

Jump to heading Income tracking

In the process of building the site, I found another government site public.regimand.be which contains declared incomes of Belgian politicians. By linking this data to the existing list of members, I could generate a little income chart + a list of their functions for each year.

The yearly income a member of parliament.
The yearly income a member of parliament.

Jump to heading Dynamic charts

Since I had all this data scraped (members, parties, votes, questions, propositions, dossiers, incomes), I thought it would be fun to allow the user to generate their own dynamic charts.

For example, they can plot number of questions against party to see which party asks the most questions per person in the parliament.

Dynamic charts.
Dynamic charts.

Jump to heading Conclusion

Given the starting point (a long report of each meeting), I think the end result ended up as a great improvement. The current site zijwerkenvooru.be allows for viewing questions/votes/propositions from multiple points of views (by member, by party, by meeting, by topic) which a static report every week just could never do.

The site is set up so that each week, it can download the new meeting report, scrape the data, and update the site.

The final used stack of technologies/tools is as follows:

🗳️ Tech Stack

  • Rust (scraping & parsing)
  • Parquet (storage of data)
  • DuckDB + Node.js (querying + transformation of data)
  • Eleventy (static site generation)
  • Mistral (LLM summarization)
  • d3.js (charts)

Thanks for reading!

Published