# Python Refresher I: Exercise Solutions

The 36 part-of-speech tags used in the Penn Treebank Project ([overview](www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)):

In [1]:
penn_pos_tags = "CC CD DT EX FW IN JJ JJR JJS " \
                "LS MD NN NNS NNP NNPS PDT POS PRP " \
                "PRP$ RB RBR RBS RP SYM TO UH VB " \
                "VBD VBG VBN VBP VBZ WDT WP WP$ WRB"

### 1) idx2tag

Transform the string into a list.

In [2]:
idx2tag = penn_pos_tags.split()

assert 36 == len(idx2tag)
assert 'TO' == idx2tag[24]
assert 24 == idx2tag.index('TO')

print(idx2tag)

['CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS', 'NNP', 'NNPS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']


### 2) tag2idx

Create a dictionary that inverses the list.

In [3]:
tag2idx = {tag: idx for idx, tag in enumerate(idx2tag)}

assert all(idx == tag2idx[tag]
           for idx, tag in enumerate(idx2tag))

print(tag2idx)

{'IN': 5, 'EX': 3, 'JJ': 6, 'WDT': 32, 'NNS': 12, 'NN': 11, 'DT': 2, 'MD': 10, 'VBG': 28, 'RBR': 20, 'VB': 26, 'RP': 22, 'VBP': 30, 'JJR': 7, 'TO': 24, 'CD': 1, 'VBN': 29, 'UH': 25, 'WRB': 35, 'RB': 19, 'NNP': 13, 'JJS': 8, 'POS': 16, 'WP': 33, 'PRP$': 18, 'NNPS': 14, 'PRP': 17, 'PDT': 15, 'VBD': 27, 'WP$': 34, 'LS': 9, 'CC': 0, 'FW': 4, 'SYM': 23, 'RBS': 21, 'VBZ': 31}


###  3) sent

Encode the sentence below as a list of integers, based on tag2idx.

In [4]:
sent = "DT NN PRP MD VBG VBZ RB DT JJ NN , " \
       "CC PRP MD VB DT NN VBN IN NN ."
    
sent_int = []
for tag in sent.split():
    try:
        sent_int.append(tag2idx[tag])
    except KeyError:
        # The tag isn't in the list and dictionary yet, so let's add it:
        n_tags = len(idx2tag)
        idx2tag.append(tag)
        tag2idx[tag] = n_tags
        sent_int.append(n_tags)

# idx2tag and tag2idx should still correspond to one another.
assert all(idx == tag2idx[tag]
           for idx, tag in enumerate(idx2tag))

sent_int

[2, 11, 17, 10, 28, 31, 19, 2, 6, 11, 36, 0, 17, 10, 26, 2, 11, 29, 5, 11, 37]

### 4) one-hot encoding

Write a function that creates one-hot arrays.

In [5]:
def one_hot(idx, dim):
    """
    Returns a one-hot vector.

    Arguments:
    idx: An int giving the position of the `1`.
    dim: The length of the list.

    Returns:
    A list(int) that is a one-hot vector.
    """
    one_hot = [0] * dim
    one_hot[idx] = 1
    return one_hot

# Let's try it out:
print(one_hot(2, 5))
print(one_hot(1, 6))
print(one_hot(5, 6))

[0, 0, 1, 0, 0]
[0, 1, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 1]


### 5) matrix

Encode the sentence as a list of one-hot arrays.

In [6]:
dim = len(tag2idx)
matrix = [one_hot(idx, dim) for idx in sent_int]

assert len(matrix) == len(sent_int)
assert len(matrix[0]) == len(tag2idx)
assert matrix[0] == one_hot(tag2idx['DT'], len(tag2idx))

### 6) writing and reading files

Save and load 2D matrices.

*Technically, removing the parentheses before/after the list and the whitespace is **not** necessary if we just want a way of persisting the matrix so we can load it again later with* `load_matrix`. *However, the exercise description specifies the CSV format, so I'm removing these extra characters, so other applications that use CSV input could also work with the file we write here.*

In [7]:
def save_matrix(matrix, file):
    with open(file, 'w', encoding='utf8') as f:
        for row in matrix:
            line = str(row)
            # Get rid of the list parentheses [ ] and the blanks.
            line = line[1:-1].replace(' ', '')
            f.write(line + '\n')
    return file



def load_matrix(file):
    matrix = []
    with open(file, 'r', encoding='utf8') as f:
        for line in f:
            line = '[' + line + ']'
            row = eval(line)
            matrix.append(row)
    return matrix


path = 'matrix.csv'
assert matrix == load_matrix(save_matrix(matrix, path))