Brandon Roots
  • FILM
  • INTERACTIVE
  • ABOUT
  • CONTACT
October 6, 2020

Word Counting

broots ITP, Programming A to Z

This week for Programming A to Z I made an attempt at creating a word counting algorithm. Coming off of the first round of (Presidential)? debates I was feeling inspired to do some word counting to determine what words were most commonly said by each of the speakers.

The transcript I was able to locate online had a standard formatting that looked perfect to tackle with one of the tools we have been learning to use in Programming A to Z: Regular Expressions!

Looking over the transcript there appeared to be a somewhat regular formatting of timecode and names in capital letters, indicate who was speaking, followed by a line break, and what they said. Starting here I lead me to this regular expression:

((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*|(JOE )?BIDEN*|(CHRIS )?WALLACE*)\n

What a mess! But it worked to select the timecode and names including some of the variations in the transcript.

Next I added a lookbehind to the regular expression to invert the selection:

Then took out the OR operator to create three distinct regular expressions: one for the moderator and another for each of the candidates.

At this point I was left with the following three regular expressions:

//Tump only
(?<=((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*)\n\n).*
//Biden only
(?<=((\d:)?(\d)?\d:\d\d )?((JOE )?BIDEN*)\n\n).*
//Wallace only
(?<=((\d:)?(\d)?\d:\d\d )?((CHRIS )?WALLACE*)\n\n).*

This was working well but took much more effort than I anticipated!

My thought at this point was to take the regular expressions into Javascript and run the text of each through a word counting algorithm. After some stumbles with the code and lots of time spent troubleshooting I realized that the lookbehind I was attempting to use was not playing well with the match() function in Javascript. After some further digging I learned that lookbehinds have not historically been supported by Javascript at all! Oh boy. While I had been testing in Safari and hitting my head against a wall I found that Chrome did have experimental support for lookbehinds.

At this point I found myself with some functioning code and was able to roughly count the words by speaker using the code below. With more time I would like to create a visualization for this data!

var txt;
var trumpCounts = {};
var bidenCounts = {};
var wallaceCounts = {};
var trumpKeys = [];
var bidenKeys = [];
var wallaceKeys = [];

async function fetchURL(url) {
  try {
    const response = await fetch(url);

    txt = await response.text();

	// Trump only
	let trump = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*)\n\n).*/g);
	trump = trump.toString();
	let trumpTokens = trump.split(/\W+/);

	//console.log(trumpTokens);

	for (var i = 0; i < trumpTokens.length; i++) {
        var word = trumpTokens[i].toLowerCase();
        if (trumpCounts[word]){
            trumpCounts[word]++;
         } else {
             trumpCounts[word] = 1;
             trumpKeys.push(word);
         }
	}

	// Biden only
	let biden = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((JOE )?BIDEN*)\n\n).*/g);
	biden = biden.toString();
	let bidenTokens = biden.split(/\W+/);

	//console.log(bidenTokens);

	for (var i = 0; i < bidenTokens.length; i++) {
        var word = bidenTokens[i].toLowerCase();
        if (bidenCounts[word]){
            bidenCounts[word]++;
         } else {
             bidenCounts[word] = 1;
             bidenKeys.push(word);
         }
	}

	// Wallace only
	let wallace = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((CHRIS )?WALLACE*)\n\n).*/g);
	wallace = wallace.toString();
	let wallaceTokens = wallace.split(/\W+/);

	//console.log(wallaceTokens);

	for (var i = 0; i < wallaceTokens.length; i++) {
        var word = wallaceTokens[i].toLowerCase();
        if (wallaceCounts[word]){
            wallaceCounts[word]++;
         } else {
             wallaceCounts[word] = 1;
             wallaceKeys.push(word);
         }
	}

	//console.log(trumpCounts);
	//console.log(bidenCounts);
	//console.log(wallaceCounts);
	
	} catch (err) {
		console.error(err);
	}
}

fetchURL('/itp/fall2020/debate/debate_transcript.txt');

Project 1: Dodeca – Part 1 Project 1: Dodeca – Part 2

Related Posts

Fractal Plant – Foiled by  Registers

Homemade Hardware, ITP

Fractal Plant – Foiled by Registers

Since receiving the PCBs and successfully soldering the board together I have been trying to rewrite code for the I2C port expander. This has been immensely difficult! The Inkplate Arduino Library makes considerable use of an “Mcp” class, which is written to work with the MCP23017 GPIO expander IC. These chips are quite difficult to […]

Fractal Plant – Beta Build

Homemade Hardware, ITP

Fractal Plant – Beta Build

The boards arrived! Amazingly within an hour of one another. Based on the experience I think that JLCPCB is a better value. With shipping OSHPark was $55.50 for 3 boards. JLCPCB was $26.36 for 10 boards. Aside from a higher cost OSHPark also left sharp bits of tabs around the edges of the boards which […]

Fractal Plant – Boards Ordered

Homemade Hardware, ITP

Fractal Plant – Boards Ordered

Over the last week I rerouted the traces in my PCB design. Following feedback during class that the auto-router feature in Eagle isn’t very smart I did manage to make some cleaner routing manually. As part of the process I went through each component and made sure they were placed a closely as possible to […]

Recent Posts

  • Fractal Plant – Foiled by  RegistersFractal Plant – Foiled by Registers
    May 9, 2022
  • Fractal Plant – Beta BuildFractal Plant – Beta Build
    April 24, 2022
  • Fractal Plant – Boards OrderedFractal Plant – Boards Ordered
    April 18, 2022
Brandon Roots
Accessibility by WAH