DAAO Search - background information

Scope

This page is primarily intended to help understand the operation of the two main DAAO searches available to the general public. These are the "quick search" and the "advanced search". Both are available from buttom links on the main DAAO page. The quick search also appears in a different guise at the top right of every page as button labelled "Search" next to a small text input box.

Data stored in the DAAO

DAAO is a system for authoring, controlling, maintaining and displaying data about Australian artists. It is stored in a standard relational database. The data is stored in several database tables. These tables are: artists, bios, events, footprints, projects, publications, reviewer_media, reviewer_regions, reviewers, supplements, topics, users. The bios and topics tables are the biggest and nearly all the information shown to the web user when they view an artist's biography comes from these 2 tables. The other tables are part of the infrastructure needed to support the authoring system and are not intended for public display. Nevertheless, there are "behind the scenes" searches done on this non-public data (eg for maintainers to find an appropriately qualified reviewer). However, such searches are out of the scope of this background information page.

The bios table contains the main information on an artist, namely the actual text, the prose constituting the biography. There is one entry per artist (technically, one row in the bios table per artist). Currently (mid 2007) there are over 7000 rows in this table.

Actually it is not QUITE true to say one bios row per artist. That is because the DAAO system allows biographies to be substantially revised to create a new version. This new version is assigned a status of 'Current' (aka 'Public') to distinguish it from the previous version which is given a status of 'Superseded' but which is still available on the system (and discoverable by standard search). Furthermore, the DAAO system allows, in special circumstances, multiple biographies to be in 'Current' status for a given artist - this is when a second biography approaches the artist from a very different slant. We call these "parallel" or "variant" biographies, and of course they are also discoverable by the standard searches available to the general public.

The bios table has links with the projects, users, supplements and topics tables. The topics table holds the associated structured metadata and is an integral part of the whole biography as displayed to the user. Topics are searchable by the public user and are discussed in greater detail below. The users table contains details associated with the userid of the member who created the DAAO biography entry (not be confused with the "biographer"!) However the userid is NOT shown in the standard biography and is not usable as a search term in the public searches. Similarly the projects table contains information needed to help track the biography through its authoring workflow but none of this is shown to the general public nor can it be used as a search term in the public searches. The supplements table may sometimes contain "supplementary notes" for an artist's biography. These are potentially ephemeral, unreviewed pieces of commentary, and while they are displayed in the public biography display page, they are not included as explicit or implicit search terms in the standard public searches.

The DAAO search engine

The DAAO is a fairly low budget site and we are working with the inbuilt facilities provided by the open source database we have chosen, namely "MySQL", rather than using a special separate search engine.

MySQL of course provides standard SQL searches through its SELECT statement. The advanced search uses this approach. This enables the advanced search to perform fine grained searching.

MySQL also provides a simple inbuilt search engine. The quick search uses this feature. This enables the quick search to have a simple interface and wide coverage and return results ranked by relevance.

If you are interested, an appendix at the bottom of this page shows the internal SQL generated from some example searches, showing the difference between quick search and advanced search.

The DAAO text indexes

Because search is important to the DAAO, we have set up several "hidden" index fields associated with each biography. These are derived from the display data that a users sees when they view are record but are processed (eg by converting to lower case and excluding html tags and any weird special symbols) so as to be more "search friendly" internally.

These hidden DAAO indexes are updated whenever a significant updating event happens in authoring. If we were to change the rules by which we build these indexes (eg the "dumb down" process described below), then we would have to run a special background job to rebuild them all overnight. The hidden DAAO indexes are listed below:

1. phonetic_key.
This is a special key calculated from the name (eg "ABT" is calculated from "Abbot, Mary"). Its purpose is to enable "fuzzy" searches which can often find similar names or wrongly spelled names. This phonetic index is NOT used in either the quick search or the advanced search but is used in a special search to attempt to find duplicates when an author begins a new biography.

2. content_index
This is an "invisible" part of the biography. It is calculated from the actual content by a "dumbdown" process. That is, any special symbols (like the trademark sign) are dropped, ant letters with diacritics (such as e acute or e grave) are replaced by their base letter, any markup (like bold or italic html tags) is removed, any short words (3 letters or less) are removed. This makes the content_index 75% of the size of the raw content. This index is used by both quick and advanced search.

3. artist_name_index
This is a dumbed down version of the artist name in the biography, as described for content_index but with one difference - the short words (eg "de" or "van") in the name are not removed. This is not used by the advanced search but is thrown into the mix for the quick search. The simple search, however (due to another layer of indexing) cannot find short name components.

4. meta_index
This is stored in the biography for efficiency, although it is derived from associated metadata topic (which are scatterd in the separate topics table). Any text value in the metadata (eg place, person. author note etc, but NOT date values) is lumped into this index and it is "dumbed down" as described for content_index. This is used by the quick search, not the advanced search.

5. val_index
This is derived from a metadata topic and stored with that topic (ie not with the biography). The text will be taken from the text part(s) of the particular topic (eg place, person. author note etc, but NOT date values). It is "dumbed down" as described for content_index. This is not used by the quick search which instead uses the overall lumped index, "meta_index" described above. This val_index is used by the advanced search which needs a topic by topic breakdown. Note that there is a pragmatic simplification here - eg for the Reference topic, the sub fileds "Published", "Title", "Author" and "Note" are concatenated and treated as one text field.

6. date_point and date_end_point
These are not text indexes, but are an additional alternative way the system can search dates. Dates are captured and stored in ISO 8601 standard format by DAAO (eg a date range may be given and stored as "1920-01-01/1993-05-30"). Such a range is also internally stored as 2 SQL dates for added flexibility. Even a point date (eg decimal currency day, 1966-02-14 is stored as a range from 1966-02-14 through to 1966-02-14). This is used by advanced search. See also "Appendix 2 - sample dates in the system", at the bottom of this document.

The search engine index

The Mysql search engine is able to have a "fulltext index" defined on any combination of columns in a table. We have chosen to define a fulltext index on the biography table, encompassing 3 of the hidden index fields described in the previous section. Namely artist_name_index plus content_index plus meta_index.

The search engine fulltext index is used by the quick search. At the moment, no other search uses it.

Certain parameters of the fulltext index generation can be changed. MySQL documentation recommends doing this with caution. We could probably change the stopword list and minimum word size word frequency threshold for example. This would involve modifying a configuration file, restarting the server and rebuilding the fulltext index. Changes like this would make the search slower, the database bigger, and the relevance ranking worse - a classic trade-off between recall and precision/performance.

More information on this fulltext index is available under the description of the quick search.

Quick search

How to recognise quick search

You can do an immediate quick search by clicking "Search:" at the box at the top of each page. If the page title is "Quick search" or "Quick search results" it will have a simple form on it for submitting a quick search. Orange navigation links on a few other pages (eg the main page) may link to the quick search - typically they will be labelled "simple full text of biographies".

What can I search with quick search

Pages will be returned regardless of the biography status, eg "Current", "Superseded" or even "New" [though perhaps the "New", ie being authored, should be automatically excluded!].

Quick search is designed to be simple. The user does not choose what is being searched. The search is done on a special index which is formed from columns in the bios row called "artist_name_index", "content_index" and "meta_index". Effectively this means the search will be against the artist name, the biography text, and all the text in associated metadata.

The 3 index columns mentioned above are derived from the name, biography text and metadata topics respectively, as you would imagine. Any special characters are either deleted or replaced by the nearest ascii equivalent (eg e acute, e grave, e circumflex, e umlaut etc are replaced by plain 'e'). All upper case is folded into lower (eg "Jane Citizen" becomes "jane citizen"). For the biography content and the metadata content (but not the name) words shorter than 4 letters are dropped. The metadata index comes from words found in any metadata topics for a biography - in the value, place, author, notes etc textual metadata subfields, but NOT any date subfields (Quick search is a text search1).

Quick search does not directly search the above three indexes however, but a FOURTH index built on top of the indexes described above. Quick search (unlike advanced search) uses a search engine facility built into the MySQL database. This underlying search engine facility creates its own index from the combination. This involes storing word frequencies, and ignoring stop words and extremely common words as is typical of efficient search engines. Consequently quick search can deliver results in relevance order. And users who inadvertantly or naively use words such as "the" in their searches will not be penalised by getting irrelevant results.

How is my quick search query treated?

In quick search, the query is treated as a list of words (separated by spaces). The words are given similar treatment to the index words they are matched against - ie letters with diacritics are replaced by ordinary letters, and they are made lower case.

The biographies actually retrieved are those which contain some or all of the query words to a greater extent than average. For example if the query was "Ballarat sculpture" then biographies (and associated metadata) which are "heavy" in both these words will be at the top of the relevance ranking (ie at the top of the result list). However it is possible that a high ranking document may not even mention "sculpture" if is very heavily mentions "ballarat".

Advanced search

How to recognise advanced search

You can navigate to the advanced search from the main page or from other search pages. The link will be labelled "advanced search of biographies". The actual search form will be seen on pages entitled either "Advanced search" or "Advanced search results".

How is advanced search different to quick search

As is plain to see, advanced search allows boolean (ie 'and', 'or' or 'not') combinations of metadata to be added to the mix.

But that is not the only difference. Even if the boolean options are not used, advanced search will return different results to quick search using the same query. This is because advanced search uses SQL wildcard select statements instead of the MySQL search engine facility. Below is a list of differences to be found between a quick search and a limited advanced search using ONLY the content box:

What can I search with advanced search

Pages will be returned regardless of the biography status, eg "Current", "Superseded" or even "New" [though perhaps the "New", ie being authored, should be automatically excluded!].

how are dates searched in advanced search

To understand this section on dates you may wish to familiarise yourself with how dates are stored in DAAO - see "Appendix 2 - sample dates in the system".

The advanced search is generous and flexible in allowing the user to specify dates. Either yyyy mm dd or even dd mm yyyy for a single date or as parts of a date range as well as plain yyyy as a single date or part of a date range. The month and day separator can be entered as a blank, '/' or '-'. The system also recognises a few oddities such as "1900's" which are unlikely to be actually used. Don't bother mixing date formats in a range or attempting "yyyy mm" without the "dd" - it won't be accepted.

Examples are:
1960-1965 = anywhere in range 1960 through 1965
2000 2010 = first decade of 21st century
1800/1850 = anywhere in range 1800 through 1850
14-02-1966 = decimal currency day
2001/9/11 = "9-11" USA twin towers bombing day
1999-12-24/2000-1-1 = Y2K christmas festivity period

Note that the query date value is internally representated as a time range. For example April fool's day 2008 is the range from 2008-04-01 (start of day) to 2008-04-01 (midnight). This raises the important question of how to match query dates with stored dates, given that both are internally represented as a time range. The answer is that DAAO advanced search considers it a match if the query range overlaps the stored range.

what if a metadata topic has date AND text in advanced search

A number of topics are like this, eg Birth has a place and a date. If you only enter a date, the search will not care what, if anything is indexed for the place. Conversely if you only enter a place then the search will not try to match the date, if any. However, if you fill both the text (keywords) and date boxes against (say the Birth" topic) then the search will only condider it a match if BOTH date and text match. [assuming you selected the default boolean 'and' condition].

What about null/empty values in advanced search

Usually this is impossible to find. A blank in a box (eg content or topic keywords) simply means that the search does not even attempt to match that value. The database does not store null values in any case. The only way to SOMETIMES do it, in special situations, for real experts only, is a "not" search. For example if you wanted to find out all the artist who have no gender code, you could try not male and not female (although that would include gender values of "both" and "unknown" which are theoretically allowed).

How do boolean operators 'and', .not' and 'or' combine

It is best to steer clear of these. Using a 'not' will often make the search slow, and using an 'or' will often produce too many results (and may be slow).

If you must use them you will need to know how they "bracket". If all the boolean operators are 'and' or 'or' then it works as expected (no need to care about the order of evaluation). Understanding 'not' is pretty simple too - just remember that it is read as 'and not' and the 'not' applies just to the topic on the same line (this is intuitive). The only slightly tricky case is when there is a single 'or' in the mix. In this case the 'or' is evaluated first. AS an example:
"birth=sydney AND medium=sculpture OR medium=paint"
is equivalent to:
"birth=sydney AND (medium=sculpture OR medium=paint)"
and will find all sydney artist as long as they used either sculpture medium or paint medium (or both).
whereas:
"birth=sydney OR medium=sculpture AND medium=paint"
is equivalent to:
"(birth=sydney OR medium=sculpture) AND medium=paint"
and will find all painters as long as they were either born in sydney or used the sculpture medium (or both).

What is the "raw" option

"raw" only applies to search actions relating to the main content box, not to the topic boxes. Normally the content is searched using the content_index column described elsewhere. However, if you need to search for words (not word fragments) of 3 letters or less, then you should use the "raw" option to search the visible content instead of the hidden content_index. The search is still done ignoring case distinctions, but includes "a", "to", "and" etc. There is a downside - namely that bold or italic markup might sometimes thwart a query match which crosses a markup boundary. That should be a rare case tough.

What is the "exact" option

"exact" only applies to search actions relating to the main content box, not to the topic boxes. "Exact" specifies that blanks in your query are treated as blanks (not as wild cards). Some search engines might call this a "phrase search". If you must include words less than 4 letters in length in the exact phrase, then you should try using "exact" in conjunction with "raw".

Other searches

Name searches

On the main page, there is an a-z name search and a name search starting with a name fragment. These do not use the simple search nor the advanced search under the hood - they simply select a range of biographies using the "artist_name_index" described above. This index also determines the order, results are in lexical order of artist name, ignoring case.

Specialty searches

The "on this day" link involves an underlying search involving metadata values. This is hard coded and does not concern us here. The same goes for the "random" bighraphy and "most recently changed biography which are not really searches.

Metadata explorer

If you click on "explore metadata topics" on the main page, you get a page entitled "Show topics" which summarises the metadata topics (ie gives definitions, subfields and the number of distinct values in the database for each topic). You can drill down by clicking. Eg click on "birth" to get a page with more detail on that topic. This page exposes a simple search form. eg select "Place", enter "sydney" and "sydneyz" for the range and click "scan" to get a list counts of artists born in sydney, then click say "Sydney, NSW" to get the list of artist described as being born in that place (as distinct from "Sydney, N.S.W" and "Sydney, New South Wales"!).

The scan function offers functionality not available in quick or advanced search because it uses the direct displayable metadata values rather than index values (so there is no lumping text together or short word droppping). However its search cannot be combined with a content search and it can be slow sometimes.

Non public searches

There are searches available only to privileged logged in users such as maintainers. (eg search of reviewer credentials). These are separate from the quick search and advanced search and are naturally specific searches, usually searching data NOT in the biography table.

Appendix 1 - search examples, showing internals

Example quick search showing internals

A searcher wants to find any biographies which have the word "enlightenment" in biography or metadata, so they just type that word (without quotes, of course) into the quick search box.

Generated search description:
Searched for Biography contains enlightenment

Generated SQL select statement: select id,version,status,artist_name where match (artist_name_index,content_index,meta_index) against(''enlightenment'') limit 20;

Example trivial advanced search showing internals

this first example shows a trivial use of advanced search - for comparison with quick search.

The same example as for quick search (above). A searcher wants to find any biographies which have the word "enlightenment" in biography content, so they just type that word (without quotes, of course) into the advanced search content box (ignoring all other boxes in the form).

Generated search description:
Searched for Biography index contains enlightenment

Generated SQL select statement: select distinct status, version, artist_name, bios.id where content_index like '%enlightenment%' order by artist_name_index limit 20 ;

Example complex advanced search showing internals

this second example shows a more complex example of advanced search use with multiple metadata constraints.

A searcher wants to find any biographies which have the phrase "women's weekly" in the biography content. They want to refine this to include only those artist who are recorded as painters (ie have medium of "painting" specified in maetadata). They want to further refine this to include only artists born in the early part of last century in Sydney. And lastly they require that the artist is not recorded in DAAO as being dead.

How search was entered:
type "women's weekly" content box;
check the exact box;
select "medium", type "paint" in its keywords;
select "birth", type "sydney" in its keyword, "1900 1920" in its date;
select "death", type "1900 2100" in its date;
select "not" in the "and/or/not" box to the left of "death" topic;
click search button.

Generated search description:
Searched for Biography index contains exact women's weekly AND Medium contains paint AND Birth contains sydney Birth has a date between 1900 and 1920-12-31 NOT Death has a date between 1900 and 2100-12-31

Generated SQL select statement: select distinct status, version, artist_name, bios.id left join topics as t1 on bios.id = t1.bio_id left join topics as t2 on bios.id = t2.bio_id left join topics as tNot3 on bios.id = tNot3.bio_id where content_index like '%women's weekly%' and bios.id = t1.bio_id and t1.rel_type = 'hasMedia' and t1.val_index like '%paint%' and bios.id = t2.bio_id and t2.rel_type = 'hasBirth' and t2.val_index like '%sydney%' and (t2.date_point <= '1920-12-31' and t2.date_end_point >= '1900') and not exists (select * from topics tNot3 where bios.id = tNot3.bio_id and tNot3.rel_type = 'hasDeath' and (tNot3.date_point <= '2100-12-31' and tNot3.date_end_point >= '1900') ) order by artist_name_index limit 20 ;

Appendix 2 - sample dates in the system

Mostly, authors specify dates and date ranges to year accuracy. That is, months and days are omitted. The table below shows some sample ISO standard dates/dateranges found in the DAAO and gives the internally generated start end dates corresponding. Note that a '?' extends the range depending on the precision given (ie years, months or days) with a special twist where dates with year precision ending in zero are extended even more.

+-----------------------+------------+----------------+
| daterange (ISO 8601)  | date_point | date_end_point |
+-----------------------+------------+----------------+
| 1850                  | 1850-01-01 | 1850-12-31     |
| 1853/1858?            | 1850-01-01 | 1861-12-31     |
| 1851/1854             | 1851-01-01 | 1854-12-31     |
| 1854?                 | 1851-01-01 | 1857-12-31     |
| 1854-07/1854?         | 1854-01-01 | 1857-12-31     |
| 1854-09-17/1854?      | 1854-09-03 | 1857-12-31     |
| 1858-01-08            | 1858-01-08 | 1858-01-08     |
| 1859-08               | 1859-08-01 | 1859-08-31     |
| 1871/1871-06-29?      | 1868-01-01 | 1871-07-13     |
| 1885-10-29/1885-10-29 | 1885-10-29 | 1885-10-29     |
| 1951-03?              | 1950-09-01 | 1951-09-30     |
| 1964-09-11/1964-09-19 | 1964-09-11 | 1964-09-19     |
| 1980?                 | 1974-01-01 | 1986-12-31     |
| 2000                  | 2000-01-01 | 2000-12-31     |
+-----------------------+------------+----------------+
Written August 2007 - JGE