Due at 11:59pm on 4/29/2015.
By the end of this lab, you should have submitted the lab with
python3 ok --submit. You may submit more than once before the
deadline; only the final submission will be graded.
To receive credit for this lab, you must complete the questions in lab13.sql.
This treebank consists of several sentences from technical documents that have been annotated with the syntactic category of each word (e.g., noun or verb), along with all of its syntactic dependents. For instance, in the sentence "I have a computer," The words "I" and "computer" are dependents of the verb "have", and "a" is a dependent of the word "computer". Each dependent has a type, such as subj for the subject of a verb. The exact definition of a syntactic dependent is debated in linguistics, but a reasonable and consistent convention was used to annotate this dataset.
lab13.py file generates a table from the
file, which describes a collection of annotated sentence. You need to run
to generate a single table called
deps that has five columns. Each row
describes one dependent of a word in a sentence. The columns are:
word- the base form of a word
category- the syntactic category of a word
dependent_word- the base form of the dependent word
dependent_category- the syntactic category of the dependent word
dependency_type- the type of dependency between the word and its dependent
To check that you have successfully created this table, run
python3 ok -q generate_table
To use the
deps table, first open the
spanish.db file from sqlite3. Keep the
sqlite interpreter running to run our examples throughout the lab.
$ sqlite3 sqlite> .open spanish.db sqlite> select word from deps where dependent_word = "caliente"; agua agua agua plato muestra
The articles "el" and "la" both mean "the", but the first is used with masculine nouns while the second is used with feminine nouns.
Create a table called
masculine_a full of masculine nouns that end in "a".
You can tell if a
word ends in "a" using the expression
substr(word, -1) =
"a". You can tell if a
word is a masculine noune if "el" is a
-- A table containing words that are masculine but end in "a" CREATE TABLE masculine_a as-- REPLACE THIS LINE select 'YOUR CODE HERE';select word from deps where dependent_word = "el" and substr(word, -1) = "a";
Note: Finding all words ending in "a" that have "el" as a dependent will not find every masculine noun ending in "a" in Spanish. Also, some nouns that start with "a", such as "agua", are feminine but nonetheless take "el" as an article.
Check your work before proceeding.
python3 ok -q masculine
Previously, we have been dealing with queries that process one row at a time.
When we join, we make pairwise combinations of all of the rows. When we use
WHERE, we filter out certain rows based on the condition. Alternatively,
applying an aggregate function
MAX(column) combines the values in multiple rows.
By default, all rows are combined together. What if we wanted to group together
the values in similar rows and perform the aggregation operations within those
groups? We use a
GROUP BY clause.
Here's an example. For every unique value in category, collect all the rows having that value into a group. Then, provide that category and count the number of rows that have that category. The word "poder" is both a noun meaning "power" and a verb meaning "be able to" do something. We can see that it is usually a verb.
sqlite> SELECT category, count(*) FROM deps where word="poder" GROUP BY category; NC|26 VMG|2 VMI|1362 VMN|7 VMP|16 VMS|6
Just like how we can filter out rows with
WHERE, we can also filter out
HAVING. Important: A
HAVING clause should use an aggregate
function. Suppose we want to see all categories with at least 100 rows:
sqlite> SELECT category, COUNT(*) FROM deps GROUP BY category HAVING COUNT(*) > 100; A|5310 CC|3955 CS|526 DA|138 F|166 NC|63980 NP|3076 PI|125 R|536 S|25693 VAI|512 VMG|183 VMI|10291 VMM|207 VMN|2274 VMP|2584 VMS|383 VSI|2978 VSN|178 Z|933
This query discards rare categories such as Demonstrative Determiner ("DD"), but keeps common ones such as Coordinating Conjunction ("CC").
Please edit lab13.sql for the following two questions.
Write a query that counts the number of appearances of a particular
word-category pair in the dataset. The new table
word_cat_count should have
How should you group the rows together?
Hint: To group based on information from multiple columns, not just one,
simply add commas between the columns:
GROUP BY school, major.
-- A table containing each word, category, and count of that group CREATE TABLE word_cat_count AS-- REPLACE THIS LINE select 'YOUR CODE HERE';SELECT word, category, COUNT(*) as count FROM deps GROUP BY word, category;
Use OK to test your solution before moving on:
python3 ok -q word_cat_count
What are all the rows in
word_cat_count involving the word "ser"?
$ sqlite3 sqlite> .read lab13.sql sqlite> SELECT * FROM word_cat_count WHERE word = "ser"; ser|NC|15 ser|VSG|8 ser|VSI|2978 ser|VSN|178 ser|VSP|71 ser|VSS|78
So, "ser" is a verb meaning "to be", e.g. "Soy de California" means "I am from
California." We expect to see a lot of verb (
V__) tags in this result.
But wait, there are also cases where "ser" is used as a noun (
¡Increíble! It turns out that "ser" can also mean "being", for instance
ser humano means "human being".
Write a query that finds all words that can appear as both nouns and verbs.
Hint: Use the
word_cat_count table you just wrote.
The resulting rows shoud look like the following:
Hint: No aggregation is required here because we need to compare individual rows, not aggregate statistics/concatenations of word group. This question is a review for joins.
select * to select all columns from a joined table.
Hint: Nouns are always
NC, but verbs can be many different categories,
such as "VMI" (main indicative verb) or "VAS" (auxiliary subjunctive verb). To
capture all the possinge strings, use the expression
verb LIKE "V__" instead
verb = "VMI" AND verb = "VMP" AND .... The underscores match any
LIKE is how SQL searches strings, instead of just looking for
-- A table containing all info of words that can be both nouns and verbs CREATE TABLE noun_verbs AS-- REPLACE THIS LINE select 'YOUR CODE HERE';SELECT * FROM word_cat_count as n, word_cat_count as v WHERE n.word = v.word AND n.category == "NC" AND v.category LIKE "V__";
Use OK to test your solution before moving on:
python3 ok -q noun_verbs
It turns out that
haber ("to have (aux)"), like
ha comido ("has eaten")
can also appear as a noun!
The following questions are for extra practice — they can be found in the lab13_extra.sql file. It is recommended that you complete these problems as well, but you do not need to turn them in for credit.
We are now ready to compute a big frequency table in lab13_extra.sql. The code for
frequencies is provided for
you. In particular, we group by the word, category, child dependency, and child
category. Why did we consider the child's category in our group? A modifier
can, for instance, be a preposition ("S") or an adjective ("A").
Let's run a few examples with the new
$ sqlite3 sqlite> .read lab13_extra.sql sqlite> -- Example: See what arguments certain verbs take in sqlite> SELECT * FROM frequencies WHERE word = "figurar"; figurar|VMI|MOD|S|1|0.0555555555555556 figurar|VMI|PP-LOC|S|8|0.444444444444444 figurar|VMI|SUBJ|NC|9|0.5 figurar|VMN|MOD|S|1|1.0 sqlite> -- Example: See what deps some words must have. "F" denotes punctuation sqlite> SELECT word, category, dependency_type, dependent_category, count ...> FROM frequencies ...> WHERE frequency = 1 AND count > 5 AND dependent_category != "F"; autorizar|VMN|DO|NC|6 auxiliar|VMN|MOD|S|15 capaz|A|COMP|S|6 en_función_de|S|COMP|NC|6 este|DD|SPEC|DI|11 ...
The next natural question to ask is: for a particular word, what kind of dependent should we expect to see? This question can actually help us identify certain Spanish constructions (if we're lucky) that are systematically different from English ones.
Write an appropriate select query for
likeliest_child, whose columns are
dependent_category, and the maximum frequency
for each word-category pair under the column name
Conditions: We want to ignore deps that are just punctuation, so make sure
dependent_category != "F" somewhere in the query.
We also want the maximum frequency found in each word-category pair to be > 0.5
and the total number of the word-category pair occurrences to be > 10.
Hint: To reiterate, we are finding the maximum frequency for each word-category
pair. How will we use
GROUP BY to carry out this specific query?
Hint2: Which condition applies to each row? Which condition applies to each
group? This then determines which condition(s) to put in
WHERE and which to
Use OK to test your solution:
python3 ok -q likeliest_child
We see a lot of common nouns ("NC") that have an article ("DA"), verbs ("V__") that take a direct object ("DO"), or prepositions ("S") that have noun complements. These examples are mostly similar to English. Are there others that surprise you?
In particular, I saw an interesting example:
It turns out that
relacionado ("related", "VMP" means participle) is used
frequently with an oblique object "OBLC", namely an object denoted by
Example: "La supervivencia de el hombre ha estado siempre estrechamente relacionada con el agua." Translation: "Human survival has always been closely related to water."
If you come up with other interesting queries, please feel free to post them on Piazza or discuss with your classmates/lab assistants/TAs!
Fin. (The end.)
I had a lot of fun investigating Spanish, and I hope you did too. Here are some other extensions you can explore over your summer break:
relacionado con, as you have seen above. We can then see if there are some bigrams that occur more frequently than expected.
(el) vacaciones (venir, future tense) pronto, the attributes of "vacationes" determine that the sentence should be
las vacaciones vendrán pronto.