Word Counting
This week for Programming A to Z I made an attempt at creating a word counting algorithm. Coming off of the first round of (Presidential)? debates I was feeling inspired to do some word counting to determine what words were most commonly said by each of the speakers.
The transcript I was able to locate online had a standard formatting that looked perfect to tackle with one of the tools we have been learning to use in Programming A to Z: Regular Expressions!
Looking over the transcript there appeared to be a somewhat regular formatting of timecode and names in capital letters, indicate who was speaking, followed by a line break, and what they said. Starting here I lead me to this regular expression:
((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*|(JOE )?BIDEN*|(CHRIS )?WALLACE*)\n
What a mess! But it worked to select the timecode and names including some of the variations in the transcript.
Next I added a lookbehind to the regular expression to invert the selection:
Then took out the OR operator to create three distinct regular expressions: one for the moderator and another for each of the candidates.
At this point I was left with the following three regular expressions:
//Tump only (?<=((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*)\n\n).* //Biden only (?<=((\d:)?(\d)?\d:\d\d )?((JOE )?BIDEN*)\n\n).* //Wallace only (?<=((\d:)?(\d)?\d:\d\d )?((CHRIS )?WALLACE*)\n\n).*
This was working well but took much more effort than I anticipated!
My thought at this point was to take the regular expressions into Javascript and run the text of each through a word counting algorithm. After some stumbles with the code and lots of time spent troubleshooting I realized that the lookbehind I was attempting to use was not playing well with the match() function in Javascript. After some further digging I learned that lookbehinds have not historically been supported by Javascript at all! Oh boy. While I had been testing in Safari and hitting my head against a wall I found that Chrome did have experimental support for lookbehinds.
At this point I found myself with some functioning code and was able to roughly count the words by speaker using the code below. With more time I would like to create a visualization for this data!
var txt;
var trumpCounts = {};
var bidenCounts = {};
var wallaceCounts = {};
var trumpKeys = [];
var bidenKeys = [];
var wallaceKeys = [];
async function fetchURL(url) {
  try {
    const response = await fetch(url);
    txt = await response.text();
	// Trump only
	let trump = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((PRESIDENT DONALD )?TRUMP*)\n\n).*/g);
	trump = trump.toString();
	let trumpTokens = trump.split(/\W+/);
	//console.log(trumpTokens);
	for (var i = 0; i < trumpTokens.length; i++) {
        var word = trumpTokens[i].toLowerCase();
        if (trumpCounts[word]){
            trumpCounts[word]++;
         } else {
             trumpCounts[word] = 1;
             trumpKeys.push(word);
         }
	}
	// Biden only
	let biden = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((JOE )?BIDEN*)\n\n).*/g);
	biden = biden.toString();
	let bidenTokens = biden.split(/\W+/);
	//console.log(bidenTokens);
	for (var i = 0; i < bidenTokens.length; i++) {
        var word = bidenTokens[i].toLowerCase();
        if (bidenCounts[word]){
            bidenCounts[word]++;
         } else {
             bidenCounts[word] = 1;
             bidenKeys.push(word);
         }
	}
	// Wallace only
	let wallace = txt.match(/(?<=((\d:)?(\d)?\d:\d\d )?((CHRIS )?WALLACE*)\n\n).*/g);
	wallace = wallace.toString();
	let wallaceTokens = wallace.split(/\W+/);
	//console.log(wallaceTokens);
	for (var i = 0; i < wallaceTokens.length; i++) {
        var word = wallaceTokens[i].toLowerCase();
        if (wallaceCounts[word]){
            wallaceCounts[word]++;
         } else {
             wallaceCounts[word] = 1;
             wallaceKeys.push(word);
         }
	}
	//console.log(trumpCounts);
	//console.log(bidenCounts);
	//console.log(wallaceCounts);
	
	} catch (err) {
		console.error(err);
	}
}
fetchURL('/itp/fall2020/debate/debate_transcript.txt');
            
        





 
                                                                     
                                                                    