Perl Counting Spoken Words of a Silent Genius

In previous posts (CSV data parts 1, 2, 3 and 4) we had some movie quotes. The naive thinking to who has the longest quote, could be to find the length of the quote (i.e. the number of characters in the string that represents the quote).

But, that doesn’t always hold true when the longest quote is defined as which quote has the most words – and by words we mean any alpha and/or numeric characters including any accompanying commas, full-stops, exclamation marks etc. separated by a space or spaces.

As an example, here’s Arnie…

He'll be back: Arnold Schwarzenegger confirms 'Terminator ...
$ string="I'll be back"
$ echo ${#string}

And here’s Julie and Dick…

$ string="Supercalifragilisticexpialidocious"
$ echo ${#string}

So, in terms of words spoken, we can see Arnie has 3 words. Julie and Dick have 1 word. There isn’t a single space from start to finish of “Supercalifragilisticexpialidocious”. Therefore, if we define a quote as being longer than another quote by the amount of individual words, using the length function is the wrong option. We only needed to use the (Linux) command line to determine this.

Here’s a CSV file, and we’ve added one of the most powerful speeches from a movie to address humankind.

How would we go about creating a word count for each quote?

1,The Terminator,1984,5,"I'll be back"
2,Terminator 2: Judgement Day,1991,5,"I need your clothes, your boots and your motorcycle"
3,Predator,1987,4,"Get to the chopper!"
4,Kindergarten Cop,1990,3,"I'm a cop, you idiot! I'm Detective John Kimble!"
5,Wild at Heart,1990,5,"Did I ever tell ya that this here jacket represents a symbol of my individuality, and my belief in personal freedom?"
6,Con Air,1997,4,"Put... the bunny... back... in the box."
7,Face/Off,1997,1,"You'll be seeing a lot of changes around here. Papa's got a brand new bag."
8,Kick-Ass,2010,4,"Tool up, honey bunny. It's time to get bad guys."
9,Some randon film,2000,1,"Some random quote."
10,Another random film,2001,1,"Another random quote."
11,Spinal Tap,1984,5,"well,it's one louder, isn't it?"
12,Snake Eyes,1998,3,"I saw you and you saw me, don't pretend like you don't know who I am girly man"
13,Raging Bull,1980,5,"""I could've been a contender"""
14,Dracula,1958,4,"I am Dracula"
15,Mary Poppins,1964,4,"Supercalifragilisticexpialidocious"
16,The Great Dictator,1940,5,"I’m sorry, but I don’t want to be an emperor. That’s not my business. I don’t want to rule or conquer anyone. I should like to help everyone - if possible - Jew, Gentile - black man - white. We all want to help one another. Human beings are like that. We want to live by each other’s happiness - not by each other’s misery. We don’t want to hate and despise one another. In this world there is room for everyone. And the good earth is rich and can provide for everyone. The way of life can be free and beautiful, but we have lost the way.Greed has poisoned men’s souls, has barricaded the world with hate, has goose-stepped us into misery and bloodshed. We have developed speed, but we have shut ourselves in. Machinery that gives abundance has left us in want. Our knowledge has made us cynical. Our cleverness, hard and unkind. We think too much and feel too little. More than machinery we need humanity. More than cleverness we need kindness and gentleness. Without these qualities, life will be violent and all will be lost. The aeroplane and the radio have brought us closer together. The very nature of these inventions cries out for the goodness in men - cries out for universal brotherhood - for the unity of us all. Even now my voice is reaching millions throughout the world - millions of despairing men, women, and little children - victims of a system that makes men torture and imprison innocent people. To those who can hear me, I say - do not despair. The misery that is now upon us is but the passing of greed - the bitterness of men who fear the way of human progress. The hate of men will pass, and dictators die, and the power they took from the people will return to the people. And so long as men die, liberty will never perish. Soldiers! don’t give yourselves to brutes - men who despise you - enslave you - who regiment your lives - tell you what to do - what to think and what to feel! Who drill you - diet you - treat you like cattle, use you as cannon fodder. Don’t give yourselves to these unnatural men - machine men with machine minds and machine hearts! You are not machines! You are not cattle! You are men! You have the love of humanity in your hearts! You don’t hate! Only the unloved hate - the unloved and the unnatural! Soldiers! Don’t fight for slavery! Fight for liberty! In the 17th Chapter of St Luke it is written: “the Kingdom of God is within man” - not one man nor a group of men, but in all men! In you! You, the people have the power - the power to create machines. The power to create happiness! You, the people, have the power to make this life free and beautiful, to make this life a wonderful adventure. Then - in the name of democracy - let us use that power - let us all unite. Let us fight for a new world - a decent world that will give men a chance to work - that will give youth a future and old age a security. By the promise of these things, brutes have risen to power. But they lie! They do not fulfil that promise. They never will! Dictators free themselves but they enslave the people! Now let us fight to fulfil that promise! Let us fight to free the world - to do away with national barriers - to do away with greed, with hate and intolerance. Let us fight for a world of reason, a world where science and progress will lead to all men’s happiness. Soldiers! in the name of democracy, let us all unite!"

Perl can accomplish this quite easily.

# create an extra hash key value - word count from quote       #
# sort the order of words in quotes lowest to highest          #
# or ID if words in quotes are the same from different records #

use strict;
use warnings;
use Text::CSV;
use Data::Dump qw(pp);

my $csv = Text::CSV->new(
   sep_char  => ',',
   binary    =>  1,
   auto_diag =>  1,
   eol       => $\

my $file = 'Arnie_Nic.csv' ;
open (my $fh,'<:encoding(utf8)',$file) or die "Cannot open file: $!";

my @fields = @{ $csv->getline($fh) } ;

# create the holding array that the hash is pushed onto

my @array ;
while (my $row = $csv->getline($fh)) {
	# create the hash
	my %data ;
	# create the hash slice
	# de-reference the array ref '$row'
	@data{@fields} = @{$row} ;
	# add another entry to the hash
	# key is Quotecount value is number of words in Quotes
	$data{Quotecount} = map { $data{Quotes} } split ( /\s+/,$data{Quotes} );
	# push the hash onto the holding array
	# to create an array of hashes ;
	push (@array,\%data);
close $fh;

# sort by amount of words in Quote
# if words from different quotes are equal
# sort by ID

sub Quote_sorter {
	$a->{Quotecount} <=> $b->{Quotecount} ||
	$a->{ID}         <=> $b->{ID}

my @sorted = sort Quote_sorter (@array);

printf "%-3s %-30s %-12s\n" , 'ID', 'Movie', 'WordsinQuote';
map { printf "%-3d %-30s %03d\n" , $_->{ID}, $_->{Movie}, $_->{Quotecount} }  @sorted;

Line 41 is doing all the heavy work. We’ve created an array of hashes from the file, and done it in such a way that we haven’t called column_names and getline_hr from the Text::CSV module.

We need’t have created this structure. We could have simply looped through the file and printed an added word count of quotes with the surrounding data from the CSV file.

But we’ve added some features on lines 54 – 62 to produce a nice report. Lines 54 – 57 are a subroutine to sort the array of hashes data structure. On line 59, we assign a new array (of hashes) which holds the sorted order of the data. See the comments on lines 50 – 52 for the theory to this sorting.

Line 61 prints a header. Line 62 prints the sorted array of hashes. We’ve opted for ID Movie and the newly created Quotecount, and omitted Year, Rating and Quotes. We’ve utilised printf to produce leading zeros for all but the word count for the quote from The Great Dictator.

ID  Movie                          WordsinQuote
15  Mary Poppins                   001
1   The Terminator                 003
9   Some randon film               003
10  Another random film            003
14  Dracula                        003
3   Predator                       004
11  Spinal Tap                     005
13  Raging Bull                    005
6   Con Air                        007
2   Terminator 2: Judgement Day    009
4   Kindergarten Cop               009
8   Kick-Ass                       010
7   Face/Off                       015
12  Snake Eyes                     018
5   Wild at Heart                  021
16  The Great Dictator             643

Until the next time.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: