Word Counting
This week for Programming A to Z I made an attempt at creating a word counting algorithm. Coming off of the first round of (Presidential)? debates I was feeling inspired to do some word counting to determine what words were most commonly said by each of the speakers.
The transcript I was able to locate online had a standard formatting that looked perfect to tackle with one of the tools we have been learning to use in Programming A to Z: Regular Expressions!
Looking over the transcript there appeared to be a somewhat regular formatting of timecode and names in capital letters, indicate who was speaking, followed by a line break, and what they said. Starting here I lead me to this regular expression:
((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*|(JOE )?BIDEN*|(CHRIS )?WALLACE*)\n
What a mess! But it worked to select the timecode and names including some of the variations in the transcript.
Next I added a lookbehind to the regular expression to invert the selection:
Then took out the OR operator to create three distinct regular expressions: one for the moderator and another for each of the candidates.
At this point I was left with the following three regular expressions:
//Tump only (?<=((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*)\n\n).* //Biden only (?<=((\d:)?(\d)?\d:\d\d )?((JOE )?BIDEN*)\n\n).* //Wallace only (?<=((\d:)?(\d)?\d:\d\d )?((CHRIS )?WALLACE*)\n\n).*
This was working well but took much more effort than I anticipated!
My thought at this point was to take the regular expressions into Javascript and run the text of each through a word counting algorithm. After some stumbles with the code and lots of time spent troubleshooting I realized that the lookbehind I was attempting to use was not playing well with the match() function in Javascript. After some further digging I learned that lookbehinds have not historically been supported by Javascript at all! Oh boy. While I had been testing in Safari and hitting my head against a wall I found that Chrome did have experimental support for lookbehinds.
At this point I found myself with some functioning code and was able to roughly count the words by speaker using the code below. With more time I would like to create a visualization for this data!
var txt; var trumpCounts = {}; var bidenCounts = {}; var wallaceCounts = {}; var trumpKeys = []; var bidenKeys = []; var wallaceKeys = []; async function fetchURL(url) { try { const response = await fetch(url); txt = await response.text(); // Trump only let trump = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*)\n\n).*/g); trump = trump.toString(); let trumpTokens = trump.split(/\W+/); //console.log(trumpTokens); for (var i = 0; i < trumpTokens.length; i++) { var word = trumpTokens[i].toLowerCase(); if (trumpCounts[word]){ trumpCounts[word]++; } else { trumpCounts[word] = 1; trumpKeys.push(word); } } // Biden only let biden = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((JOE )?BIDEN*)\n\n).*/g); biden = biden.toString(); let bidenTokens = biden.split(/\W+/); //console.log(bidenTokens); for (var i = 0; i < bidenTokens.length; i++) { var word = bidenTokens[i].toLowerCase(); if (bidenCounts[word]){ bidenCounts[word]++; } else { bidenCounts[word] = 1; bidenKeys.push(word); } } // Wallace only let wallace = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((CHRIS )?WALLACE*)\n\n).*/g); wallace = wallace.toString(); let wallaceTokens = wallace.split(/\W+/); //console.log(wallaceTokens); for (var i = 0; i < wallaceTokens.length; i++) { var word = wallaceTokens[i].toLowerCase(); if (wallaceCounts[word]){ wallaceCounts[word]++; } else { wallaceCounts[word] = 1; wallaceKeys.push(word); } } //console.log(trumpCounts); //console.log(bidenCounts); //console.log(wallaceCounts); } catch (err) { console.error(err); } } fetchURL('/itp/fall2020/debate/debate_transcript.txt');