Description
Introduction
We’v e just completed our review of the most common Python datatypes, and you’ve been exposed to
some simple operations, functions and methods for manipulating these datatypes. In this assignment,
we’re going to develop some code that relies greatly on the string datatype as well as all sorts of iteration.
First, a few general points.
(1) This is a challenging project, and you have been given two weeks to work on it. If you wait to
begin, you will almost surely fail to complete it. The best strategy for success is to work on the
project a little bit every day. To help incentivize you to do so, we will provide preliminary
feedback to a partial draft you will upload by the draft due date shown at the top of this page
(more on this below).
(2) The work you hand in should be only your own; you are not to work with or discuss your
work with any other student. Sharing your code or referring to code produced by others is a
violation of the student honor code and will be dealt with accordingly.
(3) Help is always available from the TAs or the instructor during their posted office hours. You
may also post general questions on the discussion board (although you should never post your
Python code). I hav e opened a discussion board topic specifically for HW1.
Background
In this assignment we will be processing text. With this handout, you will find a file containing the entire
text of The Wind in the Willows,achildren’s novel published in 1908. At some point during the course of
this assignment, I will provide you additional texts for you to test your code on; updated versions of this
handout may also be distributed as needed. You should think of this project as building tools to read in,
manipulate, and analyze these texts.
The rest of these instructions outline the functions that you should implement, describing their
input/output behaviors. As usual, you should start by completing the hawkid() function so that we may
properly credit you for your work. Test hawkid() to ensure it in fact returns your own hawkid as the
only element in a single element tuple. As you work on each function, test your work on the document
provided to make sure your code functions as expected. Feel free to upload versions of your code as you
go; we only grade the last version uploaded (although we do provide preliminary feedback on a draft
version; see below), so this practice allows you to “lock in” working partial solutions prior to the deadline.
Finally, some general guidance.
(1) You will be graded on both the correctness and the quality of your code, including the quality of
your comments!
(2) As usual, respect the function signatures provided.
(3) Be careful with iteration; always choose the most appropriate form of iteration
(comprehension, while, or for) as the function mandates. Poorly selected iterative forms may be
graded down, even if they work!
(4) Finally, to incentivize getting an early start, you should upload an initial version of your
homework by midnight Friday, September 22 (that’s one week from the start of the
assignment). We will use the autograder to provide feedback on the first two functions, getBook()
1
and cleanup(), only. We reserve the right to deduct points from the final homework grade for
students who do not meet this preliminary milestone.
def getBook(file):
This function should open the file named file, and return the contents of the file formatted as a single
string. During processing, you should (1) remove any blank lines and, (2) remove any lines consisting
entirely of CAPITALIZED WORDS. To understand why this is the case, inspect the wind. txt sample file
provided. Notice that the frontspiece (title, index and so on) consists of ALL CAPS, and each CHAPTER
TITLE also appears on a line in ALL CAPS.
def cleanup(text):
This function should take as input a string such as might be returned by getBook() and return a new string
with the following modifications to the input:
Remove possessives, i.e., “’s” at the end of a word;
Remove parenthesis, commas, colons, semicolons, hyphens and quotes (both single and double); and
Replace ’!’ and ’?’ with ’.’
A condition of this function is that it should be easy to change or extend the substitutions made. In other
words, a function that steps through each of these substitutions in an open-coded fashion will not get full
credit; write your function so that the substitutions can be modified or extended without having to
significantly alter the code. Here’sahint: if your code for this function is more than a few lines long,
you’re probably not doing it right.
def extractWords(text):
This function should take as input a string such as might be returned by cleanup() and return an ordered
list of words from the input string. The words returned should all be lowercase, and should contain only
characters, no punctuation.
def extractSentences(text):
This function returns a list of sentences, where each sentence consists of a string terminated by a ’.’.
def countSyllables(word):
This function takes as input a string representing a word (such as one of the words in the output from
extractWords(), and returns an integer representing the number of syllables in that word. One problem is
that the definition of syllable is unclear. As it turns out, syllables are amazingly difficult to define in
English!
For the purpose of this assignment, we will define a syllable as follows. First, we strip any trailing ’s’ or
’e’ from the word (the final ’e’ in English is often, but not always, silent). Next, we scan the word from
beginning to end, counting each transition between a consonant and a vowel, where vowels are defined as
the letters ’a’, ’e’, ’i’, ’o’ and ’u’. So, for example, if the word is “creeps,” we strip the trailing ’s’ to get
“creep” and count one leading vowel (the ’e’ following the ’r’), or a single syllable. Thus:
>>> c oun t Sy l l a bl e s( ’ cr e ep s ’ )
1
>>> c oun t Sy l l a bl e s( ’ d evo t i on’ )
3
>>> c oun t Sy l l a bl e s( ’ c r y’ )
1
2
The last example hints at the special status of the letter ’y’, which is considered a vowel when it follows a
non-vowel, but considered a non-vowel when it follows a vowel. So, for example:
>>> c oun t Sy l l a bl e s( ’ c oyo t e’ )
2
Here, the ’y is a non-vowel so the two ’o’s correspond to 2 transitions, or 2 syllables (don’t forget we
stripped the trailing ’e’). And while that’s not really right (’coyote’ has 3 syllables, because the final ’e’ is
not silent here), it does properly recognize that the ’y’ is acting as a consonant.
You will find this definition of syllable works pretty well for simple words, but fails for more complex
words; English is a complex language with many orthographic bloodlines, so it may be unreasonable to
expect a simple definition of syllable! Consider, for example:
>>> c oun t Sy l l a bl e s( ’ c ons ume s ’ )
3
>>> c oun t Sy l l a bl e s( ’ sp l a s he s ’ )
2
Here, it is tempting to treat the trailing -es as something else to strip, but that would cause ’splashes’ to
have only a single syllable. Clearly, our solution fails under some conditions; but I would argue it is close
enough for our intended use.
def ars(text):
Next, we turn our attention to computing a variety of readability indexes. Readability indexes hav e been
used since the early 1900’s to determine if the language used in a book or manual is too hard for a
particular audience. At that time, of course, most of the population didn’t hav e a high school degree, so
employers and the military were concerned that their instructions or manuals might be too difficult to
read. Today, these indexes are largely used to rate books by difficulty for younger readers.
The Automated Readability Score, or ARS, like all the indexes here, is based on a sample of the text
(we’ll be using the text in its entirety).
ht tp: / /www . r e adab i lit yf ormu l as . c om/ a utoma t e d- r e a dab i lit y- i nd ex.php
The ARS is based on two computed paramters; the average number of characters per word (cpw) and the
av erage number of words per sentence (wps). The formula is:
ARS = 4. 71 * cpw + 0. 5 * wps − 21. 43
were the weights are fixed as shown. Texts with longer words or sentences have a greater ARS; the value
of the ARS is supposed to approximate the US grade level. Thus a text with an ARS of 12 corresponds
roughly to high school senior reading level.
def fki(text):
The Flesch-Kincaid Index, or FKI, is also based on the average number of words per sentence (wps), but
instead of characters per word (cpw) like the ARS, it uses syllables per word (spw).
ht tp: / /www . r e adab i lit yf ormu l as . c om/ fles ch- gr ade – l eve l – r e a dab i lit y- f o rmu l a . php
The formula is:
FKI = 0. 39 * wps + 11. 8 * spw − 15. 59
As with the ARS, a greater value indicates a harder text. This is the scale used by the US military; like
with the ARS, the value should approximate the intended US grade level. Of course, as the FKI was
3
developed in the 1940’s, it was intended to be calculated by people who had no trouble counting syllables
without relying on an algorithm to do so.
def cli(text):
The Coleman-Liau Index, or CLI, also approximates the US grade level, but it is a more recent index,
developed to take advantage of computers.
ht tp: / /www . r e adab i lit yf ormu l as . c om/ c ol ema n -l i au – r e adab i lit y- f o rmu l a . php
The CLI thus uses average number of characters per 100 words (cphw) and average number of sentences
per 100 words (sphw), and thus avoids the difficulties encountered with counting syllables by computer.
CLI = 0. 0588 * cphw − 0. 296 * sphw − 15. 8
Testing Your Code
I hav e provided a function, evalBook(), that you can use to manage the process of evaluating a book. Feel
free to comment out readability indexes you haven’t yet tried to use.
I’ve also provided three texts for you to play with. The first, ’test.txt’, is a simple passage taken from the
readbility formulas website listed above. The output my solution produces is:
>>> eva lBo o k (’ t es t . t xt ’ )
Ev a l u at i ng TEST. TXT :
10 . 59 Au t oma t e d Re a dab i lit y Sco r e
10 . 17 Fl e s c h -Ki n ca i d I nd ex
7.28 Col ema n -L i au I nd ex
The second, ’wind.txt’, is the complete text to The Wind in the Willows by Kenneth Grahame. My output:
>>> eva lBo o k ( ’wi nd . t xt ’ )
Ev a l u at i ng WIND. TXT :
7.47 Automa t e d Re a dab i lit y Sco r e
7.63 F l e s c h -Ki n ca i d I nd ex
7.23 Col ema n -L i au I nd ex
as befits a book intended for young adults. Finally, ’iliad.txt’, is an English translation of Homer’s Iliad.
My output:
>>> eva lBo o k (’ i l i ad . txt ’ )
Ev a l u at i ng IL IAD . TXT:
12 . 36 Au t oma t e d Re a dab i lit y Sco r e
10 . 50 Fl e s c h -Ki n ca i d I nd ex
9.46 Col ema n -L i au I nd ex
which I think, correctly, establishes the relative complexity of the language used.
4
Sequences indexing
Base Types
Python 3 Cheat Sheet ©2012-2013 – Laurent Pointal
Licence Creative Commons Attribution 2
Official Python documentation on
http://docs.python.org/py3k
int 783 0 -192
float 9.23 0.0 -1.7e-6
bool True False
str “One\nTwo” ‘I\’m’
“””X\tY\tZ
1\t2\t3″””
10-6
tab char
new line
multiline
Container Types
list [1,5,9] [“x”,11,8.9] [“word”] []
tuple (1,5,9) 11,”y”,7.4 (“word”,) ()
dict
{1:”one”,3:”three”,2:”two”,3.14:”π”}
{“key”:”value”}
set
{}
{1,9,3,0}
◾ ordered sequence, fast index access, repeatable values
set()
◾ no a priori order, unique key, fast key access ; keys = base types or tuples
{“key1″,”key2″}
immutable
Variables assignment
x = 1.2+8+sin(0)
y,z,r = 9.2,-7.6,”bad”
value or computed expression
variable name (identifier)
a‥zA‥Z_ followed by a‥zA‥Z_0‥9
◽ diacritics allowed but should be avoided
◽ language keywords forbidden
◽ lower/UPPER case discrimination
expression with just comas
immutable,
ordered sequence of chars
dictionary
integer, float, boolean, string
container with several
values (here a tuple)
variables
names
Identifiers
☺ a toto x7 y_max BigOne
☹ 8y and
x+=3 x-=2
increment
decrement
Conversions
for lists, tuples, strings, …
int(“15”)
float(“-11.24e8”)
bool
str(78.3) repr(“Text”)
can specify integer number base in 2nd parameter
int(15.56) truncate decimal part (round(15.56) for rounded integer)
and for litteral representation
type(expression)
use comparators (with ==, !=, <, >, …), logical boolean result
see other side for string formating allowing finer control
“:”.join([‘toto’,’12’,’pswd’]) ‘toto:12:pswd’
joining string sequence of strings
“words with spaces”.split() [‘words’,’with’,’spaces’]
“1,4,8,2”.split(“,”) [‘1′,’4′,’8′,’2′]
splitting string
dict([(3,”three”),(1,”one”)]) {1:’one’,3:’three’}
list(“abc”) [‘a’,’b’,’c’] use each element
from sequence
set([“one”,”two”]) {‘one’,’two’}
use each element
from sequence
lst=[11, 67, “abc”, 3.14, 42, 1968] lst[1]→67
lst[-2]→42
0 1 2 3 4 5
-6 -5 -4 -3 -2 -1
positive index individual access to items via [index]
negative index
0 1 2 3 4 5 6
negative slice -6 -5 -4 -3 -2 -1
positive slice
access to sub-sequences via [start slice:end slice:step]
len(lst) 6
lst[1:3]→[67,”abc”]
lst[::2]→[11,”abc”,42]
lst[-3:-1]→[3.14,42]
lst[:3]→[11,67,”abc”]
lst[:-1]→[11,67,”abc”,3.14,42]
lst[4:]→[42,1968]
lst[1:-1]→[67,”abc”,3.14,42]
lst[:]→[11,67,”abc”,3.14,42,1968]
Missing slice indication → from start / up to end.
Conditional Statement
if x==42:
# block if logical expression x==42 is true
print(“real truth”)
elif x>0:
# else block if logical expression x>0 is true
print(“be positive”)
elif bFinished:
# else block if boolean variable bFinished is true
print(“how, finished”)
else:
# else block for other cases
print(“when it’s not”)
Boolean Logic Statements Blocks
parent statement:
statements block 1…
⁝
parent statement:
statements block 2…
⁝
next statement after block 1
indentation !
Comparators: < > <= >= == !=
≤ ≥ = ≠
a and b
a or b
not a
logical and
logical or
logical not
one or other or both
both simultaneously
if logical expression:
statements block
statements block executed
only if a condition is true
True
False
true constant value
false constant value
can go with several elif, elif… and only one final else,
example :
lst[-1]→1968
lst[0]→11
last one
first one
x=None « undefined » constant value
Maths
Operators: + – * / // % **
× ÷
integer ÷ ÷ remainder
a
b
from math import sin,pi…
‘ escaped
abs(-3.2)→3.2
round(3.57,1)→3.6
☝ floating point numbers… approximated values!
sin(pi/4)→0.707…
cos(2*pi/3)→-0.4999…
sqrt(81)→9.0 √
log(e**2)→2.0
angles in radians
acos(0.5)→1.0471…
etc. (cf doc)
(1+5.3)*2→12.6
for variables, functions,
modules, classes… names
Mémento v1.2.2
str as an ordered sequence of chars
On mutable sequences, usable to remove del lst[3:5] and to modify with assignment lst[1:4]=[‘hop’,9]
key/value associations
Display / Input
print(“v=”,3,”cm :”,x,”,”,y+4)
print options:
◽ sep=” ” (items separator, default space)
◽ end=”\n” (end of print, default new line)
◽ file=f (print to file, default standard output)
items to display: litteral values, variables, expressions
loop on dict/set = loop on sequence of keys
statements block executed as long Conditional loop statement
as condition is true
while logical expression:
statements block
s = 0
i = 1
while i <= 100: # statement executed as long as i ≤ 100 s = s + i**2 i = i + 1 print(“sum:”,s) initializations before the loop condition with at least one variable value (here i) s= ∑ i=1 i=100 i 2 ☝ make condition variable change statements block executed for each Iterative loop statement item of a container or iterator for variable in sequence: statements block s = “Some text” cnt = 0 for c in s: if c == “e”: cnt = cnt + 1 print(“found”,cnt,”‘e'”) Go over sequence’s values Count number of e in the string Go over sequence’s index ◽ modify item at index ◽ access items around index (before/after) lst = [11,18,9,12,23,4,17] lost = [] for idx in range(len(lst)): val = lst[idx] if val > 15:
lost.append(val)
lst[idx] = 15
print(“modif:”,lst,”-lost:”,lost)
Limit values greater
than 15, memorization
of lost values.
☝ be careful of inifinite loops !
initializations before the loop
loop variable, value managed by for statement
use slices to go over a subset of the sequence
computed result after the loop
Generator of int sequences
Files
s = input(“Instructions:”)
☝ input always returns a string, convert it to required type
(cf boxed Conversions on on ther side).
frequently used in
for iterative loops
range returns a « generator », converts it to list to see
the values, example:
print(list(range(4)))
range(5) 0 1 2 3 4
range(3,8) 3 4 5 6 7
range(2,12,3) 2 5 8 11
range([start,]stop [,step])
f = open(“fil.txt”,”w”,encoding=”utf8″)
storing data on disk, and reading it back
opening mode
◽ ‘r’ read
◽ ‘w’ write
◽ ‘a’ append…
encoding of
chars for text
files:
utf8 ascii
latin1 …
name of file
on disk
(+path…)
file variable
for operations
f.write(“hello”)
writing
☝ text file → read /write only
strings, convert from/to required
type.
reading
s = f.read(4)
for line in f :
# line processing block
cf functions in modules os and os.path
if char count not
specified, read
whole file
s = f.readline()
read next
line
f.close()☝ don’t forget to close file after use
Pythonic automatic close : with open(…) as f:
very common: iterative loop reading lines of a text file
Function definition
def fctname(p_x,p_y,p_z):
“””documentation”””
# statements block, res computation, etc.
return res
function name (identifier)
result value of the call.
if no computed result to
return: return None
☝ parameters and all of this bloc
only exist in the block and during
the function call (“black box”)
named parameters
Function call
r = fctname(3,i+2,2*i)
one argument per parameter
retrieve returned result (if necessary)
empty string if end of file
default 0 not included
“model {} {} {}”.format(x,y,r)
“{selection:formating!conversion}”
◽ Selection :
2
x
0.nom
4[key]
0[2]
str
Strings formating
formating directives values to format
“{:+2.3f}”.format(45.7273)
→’+45.727′
“{1:>10s}”.format(8,”toto”)
→’ toto’
“{!r}”.format(“I’m”)
→'”I\’m”‘
◽ Conversion : s (readable text) or r (litteral representation)
< > ^ = 0 at start for filling with 0
integer: b binary, c char, d decimal (default), o octal, x or X hexa…
float: e or E exponential, f or F fixed point, g or G appropriate (default),
% percent
string : s …
◽ Formating :
fillchar alignment sign minwidth.precision~maxwidth type
+ – space
Examples
len(c) Operations on containers
min(c) max(c) sum(c)
sorted(c)
reversed(c)
☝ modify original list
lst.append(item)
lst.pop(idx)
lst.sort() lst.reverse()
c.index(val) c.count(val)
→ items count
→ sorted copy
→ reverse iterator
→ position → events count
lst.extend(seq)
add item at end
add sequence of items at end
lst.insert(idx,val) insert item at index
lst.remove(val) remove first item with value
remove item at index and return its value
sort / reverse list in place
Operations on dictionaries Operations on sets
Operators:
| → union (vertical bar char)
& → intersection
– ^ → difference/symetric diff
< <= > >= → inclusion relations
d.update(d2) update/add
associations
Note: For dictionaries and set, these
operations use keys.
Special for sequence containeurs (lists, tuples, strings) :
val in c → boolean, membersihp operator in (absence not in)
Operations on lists
d[key]→value del d[clé]
d[key]=value
d.keys()
d.clear()
d.items()
d.values() views on keys, values
associations
d.pop(clé)
s.update(s2)
s.add(key) s.remove(key)
s.discard(key)
c*5 → duplicate c+c2 → concatenate
Loop control
break
continue
immediate exit
next iteration
Go simultaneously over sequence’s index and values:
for idx,val in enumerate(lst):
enumerate(c)→ iterator on (index,value)