EECS-4415M Big Data Systems Assignment #4 XQuerying the World

$30.00

Category: You will Instantly receive a download link for .zip solution file upon Payment

Description

5/5 - (6 votes)

This assignment consists of composing ten XQuery queries over the MONDIAL III XML dataset.

The MONDIAL Database

The MONDIAL Database is curated and housed at the Institute for Informatics Georg-August-Universität Göttingen. It is the integration of various datasets regarding facts about countries — much from the CIA World Factbook — and is available in various formats, including XML.

Göttingen hosts the MONDIAL III XML dataset, mondial.xml; a copy, mondial-2015.xml, is hosted locally at York for our purposes. This is the latest version of the dataset from 2015. The XML document is 3.8MB in size. When you reference the document source in your queries, use the locally hosted one! (We do not want to pound Göttingen’s server.)

Explore the dataset to become familiar with it.

XQuery, Zorba, XML, etc.

XQuery Resources

But the best way to learn XQuery? Just to start writing queries!

Zorba

The Zorbathe NoSQL Query Processor, “is an open source query processor written in C++, implementing

  • several W3C XQuery and XML specifications and
  • the JSONiq language for processing JSON data.

Zorba is distributed under Apache LicenseVersion 2.0. The project is mainly supported by the FLWOR FoundationOracle, and 28msec.” [Zorba (XQuery processor) @ Wikipedia]

Zorba is an easy installation. And Zorba version 3.1.0 is installed on the PRISM machines in EECS for your use.

Zorba is invoked by command line. The most general workflow is to write your Xquery query in a file, say myQuery.xq, then call Zorba to execute it.

% zorba --indent -f myQuery.xq

To save the results, use shell redirect. E.g.,

% zorba --indent -f myQuery.xq > myQuery.xml

One can issue a query “in-line”. One accesses a “source” XML document in a query via the function doc, with its argument the URI to the document. E.g.,

% zorba --indent -q 'doc("http://www.eecs.yorku.ca/course/4415/assignment/xquery/dataset/mondial-2015.xml")/mondial/country/name[1]'

If you install Zorba on your machine, you might want to copy the dataset mondial-2015.xml to your machine if you want to work offline. One accesses a local file as a source by giving the path to the file in doc(…) (or use the file:// URI protocol).

Other XQuery Engines

Other leading XQuery / XML systems are

XQIB, “XQuery In the Browser”, is a Javascript library that implements XQuery for use over the DOM. Modern web browsers implement XPath, but not XQuery natively. Sadly, the XQIB project seems to be defunct.

IBM DB2 and Oracle support XQuery.

XML

One can view an XML file by opening it in a web browser such as Firefox or Chrome. They will show the document in an open/close-node tree fashion, which makes it easy to look over.

Of course, XML is plain-text (unicode, utf-8), so one can peruse it with any text editor. Whitespace is often left out of XML documents outside of the tags, which can make things ugly, unless one has a plug-in that shows XML in a structured way. The editor jedit is said to have a good such plugin. Otherwise, most tools to work with XML easily are proprietary, and come with a price tag.

The XML document can be formatted itself to have line-breaks after tag names, and indentation for node nesting. The mondial-2015.xml is. And Zorba‘s flag --indent “prettifies” the XML output as such.

Example Query

Here is the example query, religion.xq, for the Mondial dataset:

(: 
    title:        religion.xq
    author:       parke godfrey 
    creation:     2017/11/24
    last version: 2021/03/23
    dataset:      Mondial III XML

    --------------------------------------------------------------------------

    List each religion; order by name.  Within each religion node,
    list the countries for which the leading religion by percentage
    is that religion.  Order the countries then by percentage
    descending, country name (ascending).

    Notes
      * skip any country that does not report any religion
      * all religion nodes in the doc have a percentage attribute
:)

declare variable $mondial as xs:string :=
    "https://www.eecs.yorku.ca"
    || "/course_archive/2020-21/W/4415"
    || "/assignment/xquery"
    || "/dataset/mondial-2015.xml";

<religions>{
let $religions :=
    <summary>{
        for $country in doc($mondial)//country[religion]
        let $high := max($country//religion/@percentage)
        let $rel  := $country//religion[@percentage=$high][1]
        return
            <country>
                <name>{$country/name[1]/data()}</name>
                {$rel}
            </country>
    }</summary>
for $rel in $religions//religion/data()
group by $rel
order by $rel
return
    <religion name='{$rel}'>{
        for $country in $religions/country[religion/data()=$rel]
        let $percentage := xs:float($country/religion/@percentage/data())
        let $cname      := $country/name/data()
        order by $percentage descending,
                 $cname      ascending
        return
            <country name='{$cname}'
                     percentage='{$country/religion/@percentage/data()}'/>
    }</religion>
}</religions>

What does it do?

  • The first subquery (into $religions) is finding the top religion by percentage for each country. (Ties are broken by taking the first one.)
  • The main query then creates a distinct node list of the religions. Within each religion node is a list of the countries that have that as their “leading” religion, ordered by percentage.

Answer document: religion.xml.

Some of the answer looks odd, like reporting China under Christian with 4%! These oddities are artifacts of the data. That is the highest percentage for a religion reported for China; basically, little information on religions is in the Mondial XML document for China. With further refinement of our query, we could clean up to exclude these “anomalies”.

Note that I did clean up this example query from what was here originally. (See religion-2017-11-24.xq and religion-2017-11-24.xml for the original query and answer document, respectively.) People pointed out that it seemed to have a flaw: it was listing all countries under each religion node, quite incorrectly if what was intended was my English description above! The query was missing a test in the inner for in the main query to walk across the countries for that religion:

for $country in $religions/country[religion/data()=$rel]
The Queries

Write an XQuery query for each of the following, sourcing the MONDIAL III XML dataset.

A-buddhist

Question

Report the countries that have “Buddhist” reported as a religion practiced within the country.

Structure

<buddhist>
    <country name='…' percentage='…'/>
        ⋮
</buddhist>

Instructions

  • For a country’s name, report the country’s name, not its country code.
  • For attribute percentage, report the percentage of the population of that country which is buddhist.
  • Within root <buddhist>, present the <country> list in document order.

Answer XMLA-buddhist.xml

B-straddle

Question

Report countries that straddle two (or more) continents. Include as content which continents the country occupies.

Structure

<straddle>
    <country name='…'>
        <continent name='…'/>
            ⋮
    </country>
        ⋮
</straddle>

Instructions

  • Within root <straddle>, present the <country> list in document order.
  • Within node <country>, present the <continent> node in document order (as they appear within that <country> node within the document).

Answer XMLB-straddle.xml

Notes

  • A node tagged <encompassed> contains the continent information in attribute continent.

C-woe

Question

Report countries that have more than 5% inflation and 10% unemployment.

Structure

<woe>
    <country inflation='…' unemployment='…'>
        <inflation>…</inflation>
        <unemployment>…</unemployment>
    </country>
        ⋮
</woe>

Instructions

  • Within root <woe>, present the <country> list in document order.

Answer XMLC-woe.xml

D-summary

Question

For each country, report its name, capital, population, and size. Report every country; if one of the requested pieces of information for the country is missing, just leave it out.

Structure

<summary>
    <country>
        <name>…</name>
        <capital>…</capital>
        <population year='…'>…</population>
        <size>…</size>
        <inception>…</inception>
    </country>
        ⋮
</summary>

Instructions

  • For name, report the country’s name, not its country code.
  • For capital, report the city’s name, not the document’s string for it. E.g., France’s capital should be reported as “Paris”, not as “cty-France-Paris”. If there is more than one name for the city provided, choose the first.
  • For population, report the population count of the most recent (latest) year reported. Show the year in the results as an attribute. (Only have that one attribute, no more!)
    Note that all <population> nodes in the dataset have a year attribute. You may assume this.
  • Size is reported as attribute area in the document.
  • For the inception data, use the value of the node <indep_date> under <country>.
    If a country has no indep_date reported, do _not include <inception> for that country in the results.
  • Within root <summary>, sort the <country> list by name.

Answer XMLD-summary.xml

E-alpha

Question

For each country, report the alpha city for that country; that is, the city in the country with the largest population.

Structure

<alpha>
    <country name='…'>
        <alpha name='…' population='…'/>
    </country>
        ⋮
</alpha>

Instructions

  • Use the country’s name, not country code.
  • In many cases, several populations are reported for a city, as measured in different years; use the latest year’s population.
  • If a city has no population reported, ignore it.
    In the dataset, when a <population> node is present for a city, it has a year attribute. You may assume this.
  • Within root <cities>, sort the <country> list by country name.
  • In the extremely rare case a country has more than one largest city (that is, with the exact same reported population for each), report each, ordered by the cities’ names.

Answer XMLE-alpha.xml

F-rivers

Question

For each river mentioned, report it by name (as an attribute) and contain the list of the countries by name that the river runs through.

Structure

<rivers>
    <river name='…'>
        <country name='…'/>
            ⋮
    </river>
        ⋮
</rivers>

Instructions

  • For a country’s name, report the country’s name, not its country code.
  • Do not repeat a <river> more than once; i.e., ensure the list of rivers is distinct.
    • Consider anything in node <located_at> with a value of “river” for attribute watertype as a river. Take the value of attribute river of such nodes in the document to be the river’s name.
    • Rename the river not to include “river-” or ”sea-” at the front. See function replace for this.
      Note, do all XPath matching you need to do before this string replace!
  • Do not repeat a <country> more than once within a <river> node; i.e., ensure the list of countries per river is distinct.
  • Within root <rivers>, order the <river> list by the rivers’ names.
  • Within a <river> node, order the <country> nodes by the countries’ names.

Answer XMLF-rivers.xml

Notes

  • There is one value of attribute river in a matching <located_in> node that contains more than one ”value”; that is, the string contains two names separated by a space. (This is “river-Missouri_River river-Mississippi_River”.)
    Ignore this one exception in the semantics and construction of your query. In the end, convert this exceptional value to “Missouri_River”. This can be accomplished at the time you do your replace to rid of the “river-” prefix; e.g., replace($rName, '^[^-]*-([^-]*).*$', '$1').

G-bordering

Question

For each country, list the countries that border it by name. Place within the bordering <neighbour> a node <length> that contains the length of the shared border.

Structure

<countries>
    <country name='…'>
        <neighbour name='…'>
            <length>…</length>
        </neighbour>
            ⋮
    </country>
        ⋮
</countries>

Instructions

  • For countries and neighbours, use the country’s name, not country code.
  • If a country’s node contains no border information, do not list it.
  • Within root <countries>, sort the <country> list by name.
  • Within each <country> node, sort the <neighbour> nodes by name.

Answer XMLG-bordering.xml

H-languages

Question

Generate a document that reports for each language, the countries that have a reported population that speaks that language. Report in an attribute speakers for <country> an estimate of the number of speakers of that language (as country’s population times the percentage that speak that language).

Structure

<languages>
    <language name='…'>
        <country name='…' speakers='…'/>
            ⋮
    </language>
        ⋮
</languages>

Instructions

  • For calculating speakers, when more than one population number is reported for a country, use the one with the latest year.
  • If the language’s percentage for that country is missing, or if the country’s population is missing, leave out speakers.
  • Sort the <language> list within the <languages> root by language name (ascending), and the <country> list within each <language> node by number of speakers descending. (Place any that have no speakers at the end sorted by the languages’ names.)

Answer XMLH-languages.xml

Notes

  • Because “/” is so overloaded in XML & XQuery, it is not used for the divide operator. Instead, the keyword div is used instead.
  • To cast a value to an integer is xs:integer(…).
  • The function round(…) rounds its real-number argument to the nearest integer.
  • To order a for‘s results descending, use order by … descendingAscending is the default.
    To order by a “numeric” value, cast the variable to numeric. E.g., order by xs:integer($speakers) descending.

I-gdppc

Question

Report the aggregate gdp per capita (gdppc) for democracies versus non-democracies.

Structure

<gdp_per_capita>
    <countries government='…' gdppc='…'/>
        ⋮
</gdp_per_capita>

Instructions

  • Only consider countries with a reported gdp_total.
  • Consider a country a democracy if it has “democracy”, “republic”, or “constitutional monarchy”, but not “dictator”, in the contents of its <government> node. In the answer XML, set the value of the attribute government for the report node <countries> for it to be “democracy”. Consider all other countries (with a reported gdp_totalnon-democracies. In the answer XML, set the value of the attribute government for the report node <countries> for it to be “non-democracy”.
  • Compute the GDP per capita as the summed GDP’s of the countries divided by the sum of their populations. Find the population of each country, using the last reported year for its population.
  • The GDP’s reported in the dataset seem to be in US dollars and in millions. Adjust to report your gross domestic product per capita in dollars; e.g., $11,507.48.

Answer XMLI-gdppc.xml

Notes

  • “||” is string concatenation in XQuery. E.g., '$' || '1000' would result in $1000.
  • The function format-number(…, …) can be used to format a number. the first argument is the number value, the second is the format directive. E.g., format-number('12345.6789', '#,##0.00') would result in 12,345.68.

J-continents

Question

Report for each continent the land area of the continent as size and the number of countries on that continent.

Structure

<continents>
    <continent name='…' size='…' countries='…'>
        <country name='…' size='…'/>
            ⋮
    </continent>
        ⋮
</continents>

Instructions

  • As from an earlier query, some countries straddle several continents. The node <encompassed> reports the percentage of land area of the country within that continent. Pro-rate the country’s contribution to the continent’s size by the percentage.
    Note that every <encompassed> has a percentage attribute; the value is 100% for any country entirely within that continent.
    For country count, count the country with each continent that it straddles.
  • Within root <continents>, order the <continent> list by name.
  • Within a node <continent>, order the <country> list by country name. For area in country, list the area of the country belonging to that continent.
  • For area of country reported, round to integer.

Answer XMLJ-continents.xml

Deliverables

Submit

Use the “submit” command on a PRISM machine to turn in your program. Create a directory named xquery/ for your assignment. Have the following 21 files in it:

  • cover page
    • readme.txt,
  • the queries
    • A-buddhist.xq
    • B-straddle.xq
    • C-woe.xq
    • D-summary.xq
    • E-alpha.xq
    • F-rivers.xq
    • G-bordering.xq
    • H-languages.xq
    • I-gdppc.xq
    • J-continents.xq
  • the answer XML documents
    • A-buddhist.xml
    • B-straddle.xml
    • C-woe.xml
    • D-summary.xml
    • E-alpha.xml
    • F-rivers.xml
    • G-bordering.xml
    • H-languages.xml
    • I-gdppc.xml
    • J-continents.xml

In the readme file, write in plain-text your name and student#, and include a small write-up of any issues or problems that you encountered that you would like me to consider, or deviations in your answers that you feel are justified. (The write-up can be empty, if you have nothing to highlight.)

The .xq files are plain-text files with your XQuery query implementations. The .xml files are the corresponding answer XML files, as evaluated against mondial-2015.xml.

% submit 4415 xquery xquery/

DueMonday 22 March before midnight.

FAQ

Should I reference to a local copy of mondial-2015.xml or the copy at EECS through the URI for what I turn in?

Ideally, via the URI. But it is okay if you do it via a local reference.

When I reference the mondial-2015.xml document via the URI, some of my answers have some text with messed up characters. This does not happen when I reference a local copy of mondial-2015.xml. What is the difference?

It is an artifact of how the EECS’s Apache web server is delivering the document. The document’s character encoding is unicode. But that is lost on delivery.

There should not be a difference if all were working correctly. (I have tried to track down that Apache bug in our Apache configuration — I am assuming it is a bug — but haven’t found it yet.)

Don’t worry about the character encoding issue. I will ignore this when marking the assignment.

I cannot understand the results of the religion.xq example query.

Indeed! It was not doing what I intended; though, it wasn’t stated what the query was meant to evaluate. Of course, the query itself is reasonably self-explanatory.

I added an explanation above and fixed the flaw.

PG hanko