Editing 125,000 Wikipedia Pages with GPT

Wikiflation is an AI bot that adds real-dollar figures to historical prices on Wikipedia, starting with every U.S. Economic History article.

An edit on "Economic History of the Civil War", made by my bot.

As of September 18, 2023, my bot Wikiflation has reviewed 15,400 articles and made 1,400 edits. That’s 12.2% of all U.S. Economic History articles.


The Challenge

I estimate that <2.5% of prices on Wikipedia are adjusted for inflation. That’s a shame because most people don’t have a feel for the value of a historical dollar. We need to see the price in real-dollars.

Wikipedia has a handy function for adjusting prices for inflation, and it alway stays up to date. But it was used in only 19,000 articles.

I set out to edit all 125,457 Wikipedia articles in the category U.S. Economic History to include inflation-adjusted prices where applicable. We’d need to automate this.


Automated Wikipedia Editing

There are thousands of bots operating on Wikipedia, mostly doing simple, rote tasks like reversing vandalism, detecting broken links, and adding links between articles.

But Wikipedia policy has restricted bots from making “context-sensitive changes”, edits that require human judgment.

This project is the first time that GPT has been used to automate human judgment in Wikipedia editing.


Wikipedia’s Inflation-Adjustment Template

Wikipedia’s inflation template requires a start date, the price, and the currency, and it will output the present day value of the historical price.

For US dollars, there are two inflation indices to choose from, one for large-scale spending such as capital expenses and one for smaller-scale spending such as consumer goods.

I use Wikipedia’s format price template to make the output look like “$1 million” instead of “$1000000”. And to indicate that the price is an approximation, I use a tilde as shorthand.


~${{ Format price| {{Inflation|index=INDEX|value=PRICE|start_year=YEAR}} }} in {{Inflation/year|INDEX}}


My Development Process

Wikipedia is a treasure, with 4.4 billion visits/month, so my top goal was to first do no harm. I had allowed myself zero tolerance for mistakes.

There are a lot of moving parts to consider when making an inflation-adjustment edit (I’ll mention some of them below) - too many for GPT to handle with a single prompt.

I broke down the editing thought process into steps. I used basic programming constructs as much as possible, and whenever human judgment was required, I handed that off to GPT-powered functions.

I tested each editing step extensively in isolation before deploying the bot into the wild. This involved manually reviewing hundreds of test edits.

Because of the scale of Wikipedia, the bot would eventually encounter every possible way a dollar sign could appear in an English sentence. There were many edge cases to discover.

I’ll now walk through the steps that the bot goes through to process a Wikipedia article and make an edit.


Step 1: Parse Wikipedia Article’s Source

It was surprisingly challenging to work with Wikipedia’s markdown page source. Page sources have tons of nested templates/functions, tables and infoboxes, files, headers, and references cluttering up the actual text. I needed to strip that all away to get only the text, which was easier for GPT to work with. But then I needed to insert markdown code at a location that GPT marked in the plain text. The solution I ended up writing identified those non-text elements, replaced them with placeholders, and added the non-text elements back in after the text was edited. I used mwparserfromheck to parse templates, and I used regex to identify the other elements.

Now that I have the text, I sentence-tokenize it. Working with multiple sentences gave GPT too much to be confused by and raised errors above the acceptable level.

Next, we need to identify all prices and years in the sentence. I use regex to identify all four-digit numbers starting with 1 or 2, and I use regex to select text starting with $. (I was able to use regex to eliminate various edge cases, such as to ignore prices inside quotations and to ignore years that are too distant or too recent).

Here, we need human judgment. Regex can find four digit numbers, but it doesn’t know if the number refers to a year or not. I used GPT to identify the following edge cases:

  • Four digit number is not a year (eg. ‘2000 soldiers’).
  • Four digit number is an imprecise year (eg. ‘by the mid 1970s’).
  • Price is face value (eg. ‘issued a new $5 coin’).
  • Price is symbolic (eg. ‘sold 100 acres for $1’).
  • Price is in proper noun (eg. ‘$5 Footlong’).

To handle years, I use a GPT prompt that takes as user input the article text and the list of years identified with regex. GPT categorizes the years as precise years, imprecise years, and non-years. This is the simplest GPT action and it works fine with GPT-3.5-turbo.

For prices, I use a GPT prompt that takes as user input the article text and one year. For that year, GPT gives a binary determination on the above edge cases.

For these prompts, using a forced function call was very helpful for giving more context without cluttering the prompt and for forcing a dict output.


Step 3: Handle different edit types

Once GPT has eliminated years / prices that fall into those corner cases, there are three categories of inflation edits we need to handle.

The simplest is when there is one price and one year. In that case, we simply ask GPT to confirm that the price and year are related, and decide which inflation index to use.

If there are multiple prices or years, the situation is more complex. Here, we ask GPT to match prices to their related years. If prices and years can be matched 1-1, they can be handled normally.

If there is an uneven number of prices and years, we need to determine why this is and handle each case appropriately. For example, if there is one year and a range of prices, we need to choose which price to adjust. As development continues on this project, more of these cases are being handled, but for now most are ignored to ensure the principle of no tolerance for mistakes.


Step 4: Insert the inflation template

Now all we need to do is choose where in the sentence to insert the inflation adjustment. Almost always, that will come right after the nominal price is mentioned.

This does require GPT input because sometimes it’s necessary to move a comma or something. This is actually the most finicky GPT function. GPT sometimes deletes or alters Wikipedia content, and I had to write logic to detect and prevent this.


Step 5: Submit the edit on Wikipedia

And finally, we put all the pieces of the Wikipedia article back together, giving us the full article text, which is completely unchanged except for the insertion of our inflation adjustments. Now, I use Selenium to open the Wikipedia article and replace the article’s source with our update sources.