Brandon Roots
  • INTERACTIVE
  • FILM
  • ABOUT
  • CONTACT
October 6, 2020

Word Counting

broots ITP, Programming A to Z

This week for Programming A to Z I made an attempt at creating a word counting algorithm. Coming off of the first round of (Presidential)? debates I was feeling inspired to do some word counting to determine what words were most commonly said by each of the speakers.

The transcript I was able to locate online had a standard formatting that looked perfect to tackle with one of the tools we have been learning to use in Programming A to Z: Regular Expressions!

Looking over the transcript there appeared to be a somewhat regular formatting of timecode and names in capital letters, indicate who was speaking, followed by a line break, and what they said. Starting here I lead me to this regular expression:

((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*|(JOE )?BIDEN*|(CHRIS )?WALLACE*)\n

What a mess! But it worked to select the timecode and names including some of the variations in the transcript.

Next I added a lookbehind to the regular expression to invert the selection:

Then took out the OR operator to create three distinct regular expressions: one for the moderator and another for each of the candidates.

At this point I was left with the following three regular expressions:

//Tump only
(?<=((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*)\n\n).*
//Biden only
(?<=((\d:)?(\d)?\d:\d\d )?((JOE )?BIDEN*)\n\n).*
//Wallace only
(?<=((\d:)?(\d)?\d:\d\d )?((CHRIS )?WALLACE*)\n\n).*

This was working well but took much more effort than I anticipated!

My thought at this point was to take the regular expressions into Javascript and run the text of each through a word counting algorithm. After some stumbles with the code and lots of time spent troubleshooting I realized that the lookbehind I was attempting to use was not playing well with the match() function in Javascript. After some further digging I learned that lookbehinds have not historically been supported by Javascript at all! Oh boy. While I had been testing in Safari and hitting my head against a wall I found that Chrome did have experimental support for lookbehinds.

At this point I found myself with some functioning code and was able to roughly count the words by speaker using the code below. With more time I would like to create a visualization for this data!

var txt;
var trumpCounts = {};
var bidenCounts = {};
var wallaceCounts = {};
var trumpKeys = [];
var bidenKeys = [];
var wallaceKeys = [];

async function fetchURL(url) {
  try {
    const response = await fetch(url);

    txt = await response.text();

	// Trump only
	let trump = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*)\n\n).*/g);
	trump = trump.toString();
	let trumpTokens = trump.split(/\W+/);

	//console.log(trumpTokens);

	for (var i = 0; i < trumpTokens.length; i++) {
        var word = trumpTokens[i].toLowerCase();
        if (trumpCounts[word]){
            trumpCounts[word]++;
         } else {
             trumpCounts[word] = 1;
             trumpKeys.push(word);
         }
	}

	// Biden only
	let biden = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((JOE )?BIDEN*)\n\n).*/g);
	biden = biden.toString();
	let bidenTokens = biden.split(/\W+/);

	//console.log(bidenTokens);

	for (var i = 0; i < bidenTokens.length; i++) {
        var word = bidenTokens[i].toLowerCase();
        if (bidenCounts[word]){
            bidenCounts[word]++;
         } else {
             bidenCounts[word] = 1;
             bidenKeys.push(word);
         }
	}

	// Wallace only
	let wallace = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((CHRIS )?WALLACE*)\n\n).*/g);
	wallace = wallace.toString();
	let wallaceTokens = wallace.split(/\W+/);

	//console.log(wallaceTokens);

	for (var i = 0; i < wallaceTokens.length; i++) {
        var word = wallaceTokens[i].toLowerCase();
        if (wallaceCounts[word]){
            wallaceCounts[word]++;
         } else {
             wallaceCounts[word] = 1;
             wallaceKeys.push(word);
         }
	}

	//console.log(trumpCounts);
	//console.log(bidenCounts);
	//console.log(wallaceCounts);
	
	} catch (err) {
		console.error(err);
	}
}

fetchURL('/itp/fall2020/debate/debate_transcript.txt');

Project 1: Dodeca – Part 1 Project 1: Dodeca – Part 2

Related Posts

Fractal Plant – Foiled by  Registers

Homemade Hardware, ITP, Solar Plant

Fractal Plant – Foiled by Registers

Since receiving the PCBs and successfully soldering the board together I have been trying to rewrite code for the I2C port expander. This has been immensely difficult! The Inkplate Arduino Library makes considerable use of an “Mcp” class, which is written to work with the MCP23017 GPIO expander IC. These chips are quite difficult to […]

“Handling” Playtest Week

Handling, ITP

“Handling” Playtest Week

Last week we attended “Playtest Thursday” on the second floor of 370 Jay St with our games. I came away from the experience with some very specific feedback. Seeing a number of people play the game showed me things I didn’t anticipate. Some folks approached the cabinet and immediately treated it as a touch screen. […]

Fractal Plant – Beta Build

Homemade Hardware, ITP, Solar Plant

Fractal Plant – Beta Build

The boards arrived! Amazingly within an hour of one another. Based on the experience I think that JLCPCB is a better value. With shipping OSHPark was $55.50 for 3 boards. JLCPCB was $26.36 for 10 boards. Aside from a higher cost OSHPark also left sharp bits of tabs around the edges of the boards which […]

Recent Posts

  • Fractal Plant – Foiled by  RegistersFractal Plant – Foiled by Registers
    May 9, 2022
  • “Handling” Playtest Week“Handling” Playtest Week
    May 5, 2022
  • Fractal Plant – Beta BuildFractal Plant – Beta Build
    April 24, 2022
Brandon Roots