$30.00
Order NowThe MONDIAL Database is curated and housed at the Institute for Informatics Georg-August-Universität Göttingen. It is the integration of various datasets regarding facts about countries — much from the CIA World Factbook — and is available in various formats, including XML.
Göttingen hosts the MONDIAL III XML dataset, mondial.xml
; a copy, mondial-2015.xml
, is hosted locally at York for our purposes. This is the latest version of the dataset from 2015. The XML document is 3.8MB in size. When you reference the document source in your queries, use the locally hosted one! (We do not want to pound Göttingen’s server.)
Explore the dataset to become familiar with it.
But the best way to learn XQuery? Just to start writing queries!
Refer to the examples from class, XML & XQuery Examples.
These queries are over a small, toy bibliography XML document.
Every XPath expression is also a legal XQuery query.
Full XQuery syntax is more flexible and more expressive, of course.
The Zorba, the NoSQL Query Processor, “is an open source query processor written in C++, implementing
Zorba is distributed under Apache License, Version 2.0. The project is mainly supported by the FLWOR Foundation, Oracle, and 28msec.” [Zorba (XQuery processor) @ Wikipedia]
Zorba is an easy installation. And Zorba version 3.1.0 is installed on the PRISM machines in EECS for your use.
Zorba is invoked by command line. The most general workflow is to write your Xquery query in a file, say myQuery.xq
, then call Zorba to execute it.
% zorba --indent -f myQuery.xq
To save the results, use shell redirect. E.g.,
% zorba --indent -f myQuery.xq > myQuery.xml
One can issue a query “in-line”. One accesses a “source” XML document in a query via the function doc, with its argument the URI to the document. E.g.,
% zorba --indent -q 'doc("http://www.eecs.yorku.ca/course/4415/assignment/xquery/dataset/mondial-2015.xml")/mondial/country/name[1]'
If you install Zorba on your machine, you might want to copy the dataset mondial-2015.xml
to your machine if you want to work offline. One accesses a local file as a source by giving the path to the file in doc(…)
(or use the file://
URI protocol).
Other leading XQuery / XML systems are
XQIB, “XQuery In the Browser”, is a Javascript library that implements XQuery for use over the DOM. Modern web browsers implement XPath, but not XQuery natively. Sadly, the XQIB project seems to be defunct.
IBM DB2 and Oracle support XQuery.
One can view an XML file by opening it in a web browser such as Firefox or Chrome. They will show the document in an open/close-node tree fashion, which makes it easy to look over.
Of course, XML is plain-text (unicode, utf-8), so one can peruse it with any text editor. Whitespace is often left out of XML documents outside of the tags, which can make things ugly, unless one has a plug-in that shows XML in a structured way. The editor jedit is said to have a good such plugin. Otherwise, most tools to work with XML easily are proprietary, and come with a price tag.
The XML document can be formatted itself to have line-breaks after tag names, and indentation for node nesting. The mondial-2015.xml
is. And Zorba‘s flag --indent
“prettifies” the XML output as such.
Here is the example query, religion.xq
, for the Mondial dataset:
(:
title: religion.xq
author: parke godfrey
creation: 2017/11/24
last version: 2021/03/23
dataset: Mondial III XML
--------------------------------------------------------------------------
List each religion; order by name. Within each religion node,
list the countries for which the leading religion by percentage
is that religion. Order the countries then by percentage
descending, country name (ascending).
Notes
* skip any country that does not report any religion
* all religion nodes in the doc have a percentage attribute
:)
declare variable $mondial as xs:string :=
"https://www.eecs.yorku.ca"
|| "/course_archive/2020-21/W/4415"
|| "/assignment/xquery"
|| "/dataset/mondial-2015.xml";
<religions>{
let $religions :=
<summary>{
for $country in doc($mondial)//country[religion]
let $high := max($country//religion/@percentage)
let $rel := $country//religion[@percentage=$high][1]
return
<country>
<name>{$country/name[1]/data()}</name>
{$rel}
</country>
}</summary>
for $rel in $religions//religion/data()
group by $rel
order by $rel
return
<religion name='{$rel}'>{
for $country in $religions/country[religion/data()=$rel]
let $percentage := xs:float($country/religion/@percentage/data())
let $cname := $country/name/data()
order by $percentage descending,
$cname ascending
return
<country name='{$cname}'
percentage='{$country/religion/@percentage/data()}'/>
}</religion>
}</religions>
What does it do?
$religions
) is finding the top religion by percentage for each country. (Ties are broken by taking the first one.)Answer document: religion.xml
.
Some of the answer looks odd, like reporting China under Christian with 4%! These oddities are artifacts of the data. That is the highest percentage for a religion reported for China; basically, little information on religions is in the Mondial XML document for China. With further refinement of our query, we could clean up to exclude these “anomalies”.
Note that I did clean up this example query from what was here originally. (See religion-2017-11-24.xq
and religion-2017-11-24.xml
for the original query and answer document, respectively.) People pointed out that it seemed to have a flaw: it was listing all countries under each religion node, quite incorrectly if what was intended was my English description above! The query was missing a test in the inner for in the main query to walk across the countries for that religion:
for $country in $religions/country[religion/data()=$rel]
Write an XQuery query for each of the following, sourcing the MONDIAL III XML dataset.
Question
Report the countries that have “Buddhist” reported as a religion practiced within the country.
Structure
<buddhist>
<country name='…' percentage='…'/>
⋮
</buddhist>
Instructions
<buddhist>
, present the <country>
list in document order.Answer XML: A-buddhist.xml
Question
Report countries that straddle two (or more) continents. Include as content which continents the country occupies.
Structure
<straddle>
<country name='…'>
<continent name='…'/>
⋮
</country>
⋮
</straddle>
Instructions
<straddle>
, present the <country>
list in document order.<country>
, present the <continent>
node in document order (as they appear within that <country>
node within the document).Answer XML: B-straddle.xml
Notes
<encompassed>
contains the continent information in attribute continent.Question
Report countries that have more than 5% inflation and 10% unemployment.
Structure
<woe>
<country inflation='…' unemployment='…'>
<inflation>…</inflation>
<unemployment>…</unemployment>
</country>
⋮
</woe>
Instructions
<woe>
, present the <country>
list in document order.Answer XML: C-woe.xml
Question
For each country, report its name, capital, population, and size. Report every country; if one of the requested pieces of information for the country is missing, just leave it out.
Structure
<summary>
<country>
<name>…</name>
<capital>…</capital>
<population year='…'>…</population>
<size>…</size>
<inception>…</inception>
</country>
⋮
</summary>
Instructions
<population>
nodes in the dataset have a year attribute. You may assume this.area
in the document.<indep_date>
under <country>
.<inception>
for that country in the results.<summary>
, sort the <country>
list by name.Answer XML: D-summary.xml
Question
For each country, report the alpha city for that country; that is, the city in the country with the largest population.
Structure
<alpha>
<country name='…'>
<alpha name='…' population='…'/>
</country>
⋮
</alpha>
Instructions
<population>
node is present for a city, it has a year attribute. You may assume this.<cities>
, sort the <country>
list by country name.Answer XML: E-alpha.xml
Question
For each river mentioned, report it by name (as an attribute) and contain the list of the countries by name that the river runs through.
Structure
<rivers>
<river name='…'>
<country name='…'/>
⋮
</river>
⋮
</rivers>
Instructions
<river>
more than once; i.e., ensure the list of rivers is distinct.
<located_at>
with a value of “river” for attribute watertype as a river. Take the value of attribute river of such nodes in the document to be the river’s name.<country>
more than once within a <river>
node; i.e., ensure the list of countries per river is distinct.<rivers>
, order the <river>
list by the rivers’ names.<river>
node, order the <country>
nodes by the countries’ names.Answer XML: F-rivers.xml
Notes
<located_in>
node that contains more than one ”value”; that is, the string contains two names separated by a space. (This is “river-Missouri_River river-Mississippi_River”.)replace($rName, '^[^-]*-([^-]*).*$', '$1')
.Question
For each country, list the countries that border it by name. Place within the bordering <neighbour>
a node <length>
that contains the length of the shared border.
Structure
<countries>
<country name='…'>
<neighbour name='…'>
<length>…</length>
</neighbour>
⋮
</country>
⋮
</countries>
Instructions
<countries>
, sort the <country>
list by name.<country>
node, sort the <neighbour>
nodes by name.Answer XML: G-bordering.xml
Question
Generate a document that reports for each language, the countries that have a reported population that speaks that language. Report in an attribute speakers
for <country>
an estimate of the number of speakers of that language (as country’s population times the percentage that speak that language).
Structure
<languages>
<language name='…'>
<country name='…' speakers='…'/>
⋮
</language>
⋮
</languages>
Instructions
speakers
, when more than one population number is reported for a country, use the one with the latest year.speakers
.<language>
list within the <languages>
root by language name (ascending), and the <country>
list within each <language>
node by number of speakers descending. (Place any that have no speakers
at the end sorted by the languages’ names.)Answer XML: H-languages.xml
Notes
xs:integer(…)
.round(…)
rounds its real-number argument to the nearest integer.order by … descending
. Ascending is the default.order by xs:integer($speakers) descending
.Question
Report the aggregate gdp per capita (gdppc) for democracies versus non-democracies.
Structure
<gdp_per_capita>
<countries government='…' gdppc='…'/>
⋮
</gdp_per_capita>
Instructions
<government>
node. In the answer XML, set the value of the attribute government for the report node <countries>
for it to be “democracy”. Consider all other countries (with a reported gdp_total) non-democracies. In the answer XML, set the value of the attribute government for the report node <countries>
for it to be “non-democracy”.$11,507.48
.Answer XML: I-gdppc.xml
Notes
'$' || '1000'
would result in $1000
.format-number(…, …)
can be used to format a number. the first argument is the number value, the second is the format directive. E.g., format-number('12345.6789', '#,##0.00')
would result in 12,345.68
.Question
Report for each continent
the land area of the continent as size and the number of countries on that continent.
Structure
<continents>
<continent name='…' size='…' countries='…'>
<country name='…' size='…'/>
⋮
</continent>
⋮
</continents>
Instructions
<encompassed>
reports the percentage of land area of the country within that continent. Pro-rate the country’s contribution to the continent’s size by the percentage.<encompassed>
has a percentage attribute; the value is 100% for any country entirely within that continent.<continents>
, order the <continent>
list by name.<continent>
, order the <country>
list by country name. For area in country, list the area of the country belonging to that continent.Answer XML: J-continents.xml
Use the “submit
” command on a PRISM machine to turn in your program. Create a directory named xquery/
for your assignment. Have the following 21 files in it:
readme.txt
,A-buddhist.xq
B-straddle.xq
C-woe.xq
D-summary.xq
E-alpha.xq
F-rivers.xq
G-bordering.xq
H-languages.xq
I-gdppc.xq
J-continents.xq
A-buddhist.xml
B-straddle.xml
C-woe.xml
D-summary.xml
E-alpha.xml
F-rivers.xml
G-bordering.xml
H-languages.xml
I-gdppc.xml
J-continents.xml
In the readme file, write in plain-text your name and student#, and include a small write-up of any issues or problems that you encountered that you would like me to consider, or deviations in your answers that you feel are justified. (The write-up can be empty, if you have nothing to highlight.)
The .xq
files are plain-text files with your XQuery
query implementations. The .xml
files are the corresponding answer XML files, as evaluated against mondial-2015.xml
.
% submit 4415 xquery xquery/
Due: Monday 22 March before midnight.
Should I reference to a local copy of mondial-2015.xml
or the copy at EECS through the URI for what I turn in?
Ideally, via the URI. But it is okay if you do it via a local reference.
When I reference the mondial-2015.xml
document via the URI, some of my answers have some text with messed up characters. This does not happen when I reference a local copy of mondial-2015.xml
. What is the difference?
It is an artifact of how the EECS’s Apache web server is delivering the document. The document’s character encoding is unicode. But that is lost on delivery.
There should not be a difference if all were working correctly. (I have tried to track down that Apache bug in our Apache configuration — I am assuming it is a bug — but haven’t found it yet.)
Don’t worry about the character encoding issue. I will ignore this when marking the assignment.
I cannot understand the results of the religion.xq
example query.
Indeed! It was not doing what I intended; though, it wasn’t stated what the query was meant to evaluate. Of course, the query itself is reasonably self-explanatory.
I added an explanation above and fixed the flaw.
WhatsApp us