Samples for taxonomy_xml

Distributed with the taxonomy_xml module is a collection of starter vocabularies intended to both illustrate the various formats, and provide a few useful topic sets.

The content of each of the demo vocabularies was the responsibility of the original publishers at the time it was imported. All imports were done in a semi-automated manner with no editorial input. I am not responsible for errors of fact or spelling.
Structural problems, Character encoding problems and the occasional ommissionare probably my fault. Caveat Lector
Credit is given here to the institutions that made this data available. All data redistributed here has carefully been selected as being free for copyright-free transformative re-use.
In some cases, tools or instructions will also be provided for you to import your own versions of vocabulary libraries for reasons of either scale, timeliness or copyright. In cases of copyright you should read and understand the terms of use of those respective data sources. Usually it's "free for personal use but not redistribution" and the taxonomy_xml module can enable that use.

Dewey Decimal System

Subject area: Publishing, General Interest.

Taxonomy Format: CSV.

Although the ownership on the Dewey Decimal system is claimed by OCLC - Online Computer Library Center they don't actually provide any list (or offer access to a list) as a machine-readable download, so I was unable to use them as a source.
Instead I found a public library website that provided the Dewey lists into the Public Domain. (Since gone away)

As samples, the taxonomy_xml module contains both a 100-term and 1000-term* version of the Dewey classification scheme, with the implied decimal heirarchy and the 'Dewey Number' supplied as a synonym.
As the Dewey system is extremely simple, it is provided as an example of the CSV format.

Geography & history (900)
 +  History of ancient world (930)
 +   +  History of ancient world China (931)
 +   +  History of ancient world Egypt (932)
 +   +  History of ancient world Europe north & west of Italy  (936)
 +   +  History of ancient world Greece (938)
* There's not really 1000 terms in use at that level. There are however many more subsections on a truly decimal breakdown in some areas (not included).

International Press Telecommunications Council (IPTC) Topic Catalog

Subject area: Publishing, News Media.

Taxonomy Format: RDF.

From the International Press Telecommunications Council we have a 'TopicSet' of 1365 controlled vocabulary words and phrases (subjectCodes) useful for classifying news stories and tagging media releases.

Subject areas include branches like:

The taxonomy is hierarchical, and contains full-text descriptions of each terms and a UID number provided by the IPTC. It does not contain synonyms or related terms (although it probably should).

unrest, conflicts and war
 +  act of terror
 +  armed conflict
 +  civil unrest
 +   +  political dissent
 +   +  rebellions
 +   +  religious conflict
 +   +  revolutions

This data was imported by way of an XSL transformation from an XML file topicset.iptc-subjectcode.xml taken from the site in 2007. The IPTC also maintains several other useful vocabularies on their (hard to bookmark) Resource page. Visit them for more.

Services of New Zealand (SONZ) Suggested Vocabulary

Subject area: Government.

Taxonomy Format: CSV/Service.

The E-government Initiative from the New Zealand government has produced the NZGLS thesauri - including a list of 2364 keyword-type ratified terms to be used when classifying government services or interest areas. It is only lightly hierarchical, and exists mainly as a synonym collapser and list of 'preferred' consistent terminology.

It contains many 'related terms' as well as several weaker synonyms for many terms.

  (Related Terms: Pilots, Aviation) 
  (Synonyms: Light aircraft, Airships, Aeroplanes)
 +  Helicopters
 +  Microlite Aircraft
  (Related Terms: Aviation) 

This data is currently being retrieved directly from the website as a demonstration of the simplest kind of web service the taxonomy_xml module supports. The original file is provided as a CSV which is retrieved directly from the URL when the taxonomy_xml admin selects [Web Service][SONZ] as an import source.

This dataset is in fact the first test case, and the reason I started developing syntax readers for Drupal Taxonomies

Google Merchant "Product Type" taxonomy

Subject area: Commerce.

Taxonomy Format: CSV-ancestry.

This is a copy of a subset of the Google merchant recommended product category labels. The full thing is documented and downloadable from the Google Merchant Centre Help Pages

The distributed version contains only the top two levels (200 terms). The full thing - which you can download, convert to CSV and import yourself - can go to 5 levels deep and contain close to 4000 terms.

This is an alternate CSV format, taking each term on a new line with its ancestors repeated in each previous column.

Media, Books
Media, Books, Fiction
Media, Books, Non-fiction
Media, DVDs & Videos
Media, Magazines & Newspapers
Media, Music
Media, Sheet Music

...etc, It's very limited (and wordy), but also about as obvious as possible.

This format was used by google base for its merchant product taxonomy, and represents the terms it wants to see in product descriptions. It could serve as a start for organizing an ecommerce store.

Top-level headings are:
Arts & Entertainment
Baby & Toddler
Business & Industrial
Cameras & Optics
Clothing & Accessories
Food, Beverages & Tobacco
Health & Beauty
Home & Garden
Office Supplies
Sporting Goods
Toys & Games
Vehicles & Parts