Description
This assignment consists of composing ten XQuery queries over the MONDIAL III XML dataset.
The MONDIAL Database
The MONDIAL Database is curated and housed at the Institute for Informatics Georg-August-Universität Göttingen. It is the integration of various datasets regarding facts about countries — much from the CIA World Factbook — and is available in various formats, including XML.
Göttingen hosts the MONDIAL III XML dataset, mondial.xml
; a copy, mondial-2015.xml
, is hosted locally at York for our purposes. This is the latest version of the dataset from 2015. The XML document is 3.8MB in size. When you reference the document source in your queries, use the locally hosted one! (We do not want to pound Göttingen’s server.)
Explore the dataset to become familiar with it.
XQuery Resources
- The official description of XQuery is at W3C:
- BaseX catalogs a good list of tutorials and such:
- Altova offers online materials:
- W3Schools
- XQuery Tutorial (Meh…okay)
But the best way to learn XQuery? Just to start writing queries!
Bibliography Examples
Refer to the examples from class, XML & XQuery Examples.
These queries are over a small, toy bibliography XML document.
Every XPath expression is also a legal XQuery query.
Full XQuery syntax is more flexible and more expressive, of course.
Zorba
The Zorba, the NoSQL Query Processor, “is an open source query processor written in C++, implementing
- several W3C XQuery and XML specifications and
- the JSONiq language for processing JSON data.
Zorba is distributed under Apache License, Version 2.0. The project is mainly supported by the FLWOR Foundation, Oracle, and 28msec.” [Zorba (XQuery processor) @ Wikipedia]
Zorba is an easy installation. And Zorba version 3.1.0 is installed on the PRISM machines in EECS for your use.
Zorba is invoked by command line. The most general workflow is to write your Xquery query in a file, say myQuery.xq
, then call Zorba to execute it.
% zorba --indent -f myQuery.xq
To save the results, use shell redirect. E.g.,
% zorba --indent -f myQuery.xq > myQuery.xml
One can issue a query “in-line”. One accesses a “source” XML document in a query via the function doc, with its argument the URI to the document. E.g.,
% zorba --indent -q 'doc("http://www.eecs.yorku.ca/course/4415/assignment/xquery/dataset/mondial-2015.xml")/mondial/country/name[1]'
If you install Zorba on your machine, you might want to copy the dataset mondial-2015.xml
to your machine if you want to work offline. One accesses a local file as a source by giving the path to the file in doc(…)
(or use the file://
URI protocol).
Other XQuery Engines
Other leading XQuery / XML systems are
XQIB, “XQuery In the Browser”, is a Javascript library that implements XQuery for use over the DOM. Modern web browsers implement XPath, but not XQuery natively. Sadly, the XQIB project seems to be defunct.
IBM DB2 and Oracle support XQuery.
XML
One can view an XML file by opening it in a web browser such as Firefox or Chrome. They will show the document in an open/close-node tree fashion, which makes it easy to look over.
Of course, XML is plain-text (unicode, utf-8), so one can peruse it with any text editor. Whitespace is often left out of XML documents outside of the tags, which can make things ugly, unless one has a plug-in that shows XML in a structured way. The editor jedit is said to have a good such plugin. Otherwise, most tools to work with XML easily are proprietary, and come with a price tag.
The XML document can be formatted itself to have line-breaks after tag names, and indentation for node nesting. The mondial-2015.xml
is. And Zorba‘s flag --indent
“prettifies” the XML output as such.
Example Query
Here is the example query, religion.xq
, for the Mondial dataset:
(:
title: religion.xq
author: parke godfrey
creation: 2017/11/24
last version: 2021/03/23
dataset: Mondial III XML
--------------------------------------------------------------------------
List each religion; order by name. Within each religion node,
list the countries for which the leading religion by percentage
is that religion. Order the countries then by percentage
descending, country name (ascending).
Notes
* skip any country that does not report any religion
* all religion nodes in the doc have a percentage attribute
:)
declare variable $mondial as xs:string :=
"https://www.eecs.yorku.ca"
|| "/course_archive/2020-21/W/4415"
|| "/assignment/xquery"
|| "/dataset/mondial-2015.xml";
<religions>{
let $religions :=
<summary>{
for $country in doc($mondial)//country[religion]
let $high := max($country//religion/@percentage)
let $rel := $country//religion[@percentage=$high][1]
return
<country>
<name>{$country/name[1]/data()}</name>
{$rel}
</country>
}</summary>
for $rel in $religions//religion/data()
group by $rel
order by $rel
return
<religion name='{$rel}'>{
for $country in $religions/country[religion/data()=$rel]
let $percentage := xs:float($country/religion/@percentage/data())
let $cname := $country/name/data()
order by $percentage descending,
$cname ascending
return
<country name='{$cname}'
percentage='{$country/religion/@percentage/data()}'/>
}</religion>
}</religions>
What does it do?
- The first subquery (into
$religions
) is finding the top religion by percentage for each country. (Ties are broken by taking the first one.) - The main query then creates a distinct node list of the religions. Within each religion node is a list of the countries that have that as their “leading” religion, ordered by percentage.
Answer document: religion.xml
.
Some of the answer looks odd, like reporting China under Christian with 4%! These oddities are artifacts of the data. That is the highest percentage for a religion reported for China; basically, little information on religions is in the Mondial XML document for China. With further refinement of our query, we could clean up to exclude these “anomalies”.
Note that I did clean up this example query from what was here originally. (See religion-2017-11-24.xq
and religion-2017-11-24.xml
for the original query and answer document, respectively.) People pointed out that it seemed to have a flaw: it was listing all countries under each religion node, quite incorrectly if what was intended was my English description above! The query was missing a test in the inner for in the main query to walk across the countries for that religion:
for $country in $religions/country[religion/data()=$rel]
Write an XQuery query for each of the following, sourcing the MONDIAL III XML dataset.
A-buddhist
Question
Report the countries that have “Buddhist” reported as a religion practiced within the country.
Structure
<buddhist>
<country name='…' percentage='…'/>
⋮
</buddhist>
Instructions
- For a country’s name, report the country’s name, not its country code.
- For attribute percentage, report the percentage of the population of that country which is buddhist.
- Within root
<buddhist>
, present the<country>
list in document order.
Answer XML: A-buddhist.xml
B-straddle
Question
Report countries that straddle two (or more) continents. Include as content which continents the country occupies.
Structure
<straddle>
<country name='…'>
<continent name='…'/>
⋮
</country>
⋮
</straddle>
Instructions
- Within root
<straddle>
, present the<country>
list in document order. - Within node
<country>
, present the<continent>
node in document order (as they appear within that<country>
node within the document).
Answer XML: B-straddle.xml
Notes
- A node tagged
<encompassed>
contains the continent information in attribute continent.
C-woe
Question
Report countries that have more than 5% inflation and 10% unemployment.
Structure
<woe>
<country inflation='…' unemployment='…'>
<inflation>…</inflation>
<unemployment>…</unemployment>
</country>
⋮
</woe>
Instructions
- Within root
<woe>
, present the<country>
list in document order.
Answer XML: C-woe.xml
D-summary
Question
For each country, report its name, capital, population, and size. Report every country; if one of the requested pieces of information for the country is missing, just leave it out.
Structure
<summary>
<country>
<name>…</name>
<capital>…</capital>
<population year='…'>…</population>
<size>…</size>
<inception>…</inception>
</country>
⋮
</summary>
Instructions
- For name, report the country’s name, not its country code.
- For capital, report the city’s name, not the document’s string for it. E.g., France’s capital should be reported as “Paris”, not as “cty-France-Paris”. If there is more than one name for the city provided, choose the first.
- For population, report the population count of the most recent (latest) year reported. Show the year in the results as an attribute. (Only have that one attribute, no more!)
Note that all<population>
nodes in the dataset have a year attribute. You may assume this. - Size is reported as attribute
area
in the document. - For the inception data, use the value of the node
<indep_date>
under<country>
.
If a country has no indep_date reported, do _not include<inception>
for that country in the results. - Within root
<summary>
, sort the<country>
list by name.
Answer XML: D-summary.xml
E-alpha
Question
For each country, report the alpha city for that country; that is, the city in the country with the largest population.
Structure
<alpha>
<country name='…'>
<alpha name='…' population='…'/>
</country>
⋮
</alpha>
Instructions
- Use the country’s name, not country code.
- In many cases, several populations are reported for a city, as measured in different years; use the latest year’s population.
- If a city has no population reported, ignore it.
In the dataset, when a<population>
node is present for a city, it has a year attribute. You may assume this. - Within root
<cities>
, sort the<country>
list by country name. - In the extremely rare case a country has more than one largest city (that is, with the exact same reported population for each), report each, ordered by the cities’ names.
Answer XML: E-alpha.xml
F-rivers
Question
For each river mentioned, report it by name (as an attribute) and contain the list of the countries by name that the river runs through.
Structure
<rivers>
<river name='…'>
<country name='…'/>
⋮
</river>
⋮
</rivers>
Instructions
- For a country’s name, report the country’s name, not its country code.
- Do not repeat a
<river>
more than once; i.e., ensure the list of rivers is distinct.- Consider anything in node
<located_at>
with a value of “river” for attribute watertype as a river. Take the value of attribute river of such nodes in the document to be the river’s name. - Rename the river not to include “river-” or ”sea-” at the front. See function replace for this.
Note, do all XPath matching you need to do before this string replace!
- Consider anything in node
- Do not repeat a
<country>
more than once within a<river>
node; i.e., ensure the list of countries per river is distinct. - Within root
<rivers>
, order the<river>
list by the rivers’ names. - Within a
<river>
node, order the<country>
nodes by the countries’ names.
Answer XML: F-rivers.xml
Notes
- There is one value of attribute river in a matching
<located_in>
node that contains more than one ”value”; that is, the string contains two names separated by a space. (This is “river-Missouri_River river-Mississippi_River”.)
Ignore this one exception in the semantics and construction of your query. In the end, convert this exceptional value to “Missouri_River”. This can be accomplished at the time you do your replace to rid of the “river-” prefix; e.g.,replace($rName, '^[^-]*-([^-]*).*$', '$1')
.
G-bordering
Question
For each country, list the countries that border it by name. Place within the bordering <neighbour>
a node <length>
that contains the length of the shared border.
Structure
<countries>
<country name='…'>
<neighbour name='…'>
<length>…</length>
</neighbour>
⋮
</country>
⋮
</countries>
Instructions
- For countries and neighbours, use the country’s name, not country code.
- If a country’s node contains no border information, do not list it.
- Within root
<countries>
, sort the<country>
list by name. - Within each
<country>
node, sort the<neighbour>
nodes by name.
Answer XML: G-bordering.xml
H-languages
Question
Generate a document that reports for each language, the countries that have a reported population that speaks that language. Report in an attribute speakers
for <country>
an estimate of the number of speakers of that language (as country’s population times the percentage that speak that language).
Structure
<languages>
<language name='…'>
<country name='…' speakers='…'/>
⋮
</language>
⋮
</languages>
Instructions
- For calculating
speakers
, when more than one population number is reported for a country, use the one with the latest year. - If the language’s percentage for that country is missing, or if the country’s population is missing, leave out
speakers
. - Sort the
<language>
list within the<languages>
root by language name (ascending), and the<country>
list within each<language>
node by number of speakers descending. (Place any that have nospeakers
at the end sorted by the languages’ names.)
Answer XML: H-languages.xml
Notes
- Because “/” is so overloaded in XML & XQuery, it is not used for the divide operator. Instead, the keyword div is used instead.
- To cast a value to an integer is
xs:integer(…)
. - The function
round(…)
rounds its real-number argument to the nearest integer. - To order a for‘s results descending, use
order by … descending
. Ascending is the default.
To order by a “numeric” value, cast the variable to numeric. E.g.,order by xs:integer($speakers) descending
.
I-gdppc
Question
Report the aggregate gdp per capita (gdppc) for democracies versus non-democracies.
Structure
<gdp_per_capita>
<countries government='…' gdppc='…'/>
⋮
</gdp_per_capita>
Instructions
- Only consider countries with a reported gdp_total.
- Consider a country a democracy if it has “democracy”, “republic”, or “constitutional monarchy”, but not “dictator”, in the contents of its
<government>
node. In the answer XML, set the value of the attribute government for the report node<countries>
for it to be “democracy”. Consider all other countries (with a reported gdp_total) non-democracies. In the answer XML, set the value of the attribute government for the report node<countries>
for it to be “non-democracy”. - Compute the GDP per capita as the summed GDP’s of the countries divided by the sum of their populations. Find the population of each country, using the last reported year for its population.
- The GDP’s reported in the dataset seem to be in US dollars and in millions. Adjust to report your gross domestic product per capita in dollars; e.g.,
$11,507.48
.
Answer XML: I-gdppc.xml
Notes
- “||” is string concatenation in XQuery. E.g.,
'$' || '1000'
would result in$1000
. - The function
format-number(…, …)
can be used to format a number. the first argument is the number value, the second is the format directive. E.g.,format-number('12345.6789', '#,##0.00')
would result in12,345.68
.
J-continents
Question
Report for each continent
the land area of the continent as size and the number of countries on that continent.
Structure
<continents>
<continent name='…' size='…' countries='…'>
<country name='…' size='…'/>
⋮
</continent>
⋮
</continents>
Instructions
- As from an earlier query, some countries straddle several continents. The node
<encompassed>
reports the percentage of land area of the country within that continent. Pro-rate the country’s contribution to the continent’s size by the percentage.
Note that every<encompassed>
has a percentage attribute; the value is 100% for any country entirely within that continent.
For country count, count the country with each continent that it straddles. - Within root
<continents>
, order the<continent>
list by name. - Within a node
<continent>
, order the<country>
list by country name. For area in country, list the area of the country belonging to that continent. - For area of country reported, round to integer.
Answer XML: J-continents.xml
Submit
Use the “submit
” command on a PRISM machine to turn in your program. Create a directory named xquery/
for your assignment. Have the following 21 files in it:
- cover page
readme.txt
,
- the queries
A-buddhist.xq
B-straddle.xq
C-woe.xq
D-summary.xq
E-alpha.xq
F-rivers.xq
G-bordering.xq
H-languages.xq
I-gdppc.xq
J-continents.xq
- the answer XML documents
A-buddhist.xml
B-straddle.xml
C-woe.xml
D-summary.xml
E-alpha.xml
F-rivers.xml
G-bordering.xml
H-languages.xml
I-gdppc.xml
J-continents.xml
In the readme file, write in plain-text your name and student#, and include a small write-up of any issues or problems that you encountered that you would like me to consider, or deviations in your answers that you feel are justified. (The write-up can be empty, if you have nothing to highlight.)
The .xq
files are plain-text files with your XQuery
query implementations. The .xml
files are the corresponding answer XML files, as evaluated against mondial-2015.xml
.
% submit 4415 xquery xquery/
Due: Monday 22 March before midnight.
Should I reference to a local copy of mondial-2015.xml
or the copy at EECS through the URI for what I turn in?
Ideally, via the URI. But it is okay if you do it via a local reference.
When I reference the mondial-2015.xml
document via the URI, some of my answers have some text with messed up characters. This does not happen when I reference a local copy of mondial-2015.xml
. What is the difference?
It is an artifact of how the EECS’s Apache web server is delivering the document. The document’s character encoding is unicode. But that is lost on delivery.
There should not be a difference if all were working correctly. (I have tried to track down that Apache bug in our Apache configuration — I am assuming it is a bug — but haven’t found it yet.)
Don’t worry about the character encoding issue. I will ignore this when marking the assignment.
I cannot understand the results of the religion.xq
example query.
Indeed! It was not doing what I intended; though, it wasn’t stated what the query was meant to evaluate. Of course, the query itself is reasonably self-explanatory.
I added an explanation above and fixed the flaw.