Introduction to search options for Django

Search is a basic feature for every application enabled on the web. Users expect the ability to introduce a keyword or phrase and be taken to accurate results in a timely manner, so there's no escaping you'll eventually need to deal with search in a Django project.

There are various ways to work with search in Django. One of the basic approaches is to execute Django queries on behalf of users to perform searches against Django model objects that are stored in a project's relational database. Another approach is to integrate a full-text search engine into a Django project to enable more sophisticated and performant searches. And yet another approach is to rely on a public search engine (e.g. Google, DuckDuckGo, Bing) to crawl a Django project's content and let new users discover & search content belonging to a Django project. Each approach has its advantages and disadvantages, as well as different implementation details that you'll learn in this chapter.

Django model search: Model queries & indexes

Because a Django project's data is stored primarily in a relational database, the most basic approach to Django search consists of using Django queries with the support of indexes to extract results from a relational database connected to a project. For example, let's say you have a coffeehouse application with a Django model for Store objects and want to allow users the ability to search for stores in a given city.

The first step for this process is to create an HTML form so users can introduce a city, similar to the following snippet <form>Search for stores in: <input type="text" name="city"/></form>. This form would then be submit the provided city field value to a view method and execute a Django query like Store.objects.filter(city=user_provided_city). The results of the query would then be passed to a template to generate a display for the user who initiated the search.

One of the most important aspects that's often forgotten with this Django search approach is to add indexes on the model fields on which queries are performed. Indexes can drastically improve search times because they provide dedicated structures on which to perform queries, something that can be critical for models with either a lot of fields or a large amount of objects (i.e. database table rows). Database indexes in general are a deep topic[1], but we'll take a closer look at the details of creating and using indexes with Django models.

The biggest issue you'll face with Django model searches -- such as those presented in the earlier chapter Django model queries & managers -- is they can quickly break down in terms of relevance & performance for more demanding searches. Searches for low cardinality fields (e.g. store city names or true/false values), meaning there are only a couple of possible value variations across all records, can be done efficiently because relational database indexes are specially designed to deal with low cardinality fields. But suppose you want to allow users to search for objects that are part of an Item model that contains certain keywords (e.g. low calories, vegan) in a more open-ended description field.

Although the Django model API supports query operators such as contains and icontains that can solve this problem, these type of Django queries get converted into very inefficient SQL queries (e.g. Item.objects.filter(description__contains='vegan') gets translated into 'SELECT ... WHERE description LIKE "%vegan%";').

The underlying problem of queries that use SQL keywords such as LIKE is that most relational databases ignore regular indexes for these types of queries. Since these types of search queries require inspecting the entire text contents of a column, the search is done directly on the column and regular indexes become a moot point. So if you have thousands of Item objects that have a description field, which in itself contains hundreds of words and you're trying to search a single word across all descriptions, it can be a very time consuming query.

So just because you can get away using Django queries on Django models to create more open ended searches doesn't mean it's a good choice. For searches spanning more than a couple of hundreds records with open ended text, a better technique is to use full-text search.

Django full-text search: Postgres contrib & Haystack (Solr, Elasticsearch, Whoosh and Xapian)

Full-text search on relational databases requires an entirely different approach than search on fields with a limited set of values (a.k.a. low cardinality) (e.g. cities, sizes). The first thing you need to realize is that full-text search also use indexes, but not regular indexes which is what relational databases generally use and Django models can create for you out of the box.

Full-text search requires full-text indexes. In very simple terms, a full-text index consists of splitting open ended text values (e.g.The quick brown fox jumps over the lazy dog) into keywords (e.g.quick, brown) and creating an index from the latter values to use when a search query is made on the text. Because a full-text search index strips stop words (e.g.the) and is a dedicated structure containing the most relevant keywords, full-text searches become more efficient vs. directly making a search on the full-text.

In addition, full-text search often uses stemming, a process that consists of adding equivalent words (e.g.jumps, jumped, jumping) to match a single word (e.g. jump) in order to increase the scope of results. On top of this, full-text search also often uses metrics like scores and ranks to classify the most relevant results for a given search term (e.g. assign more relevance to results with words appearing toward the start or more than once in the full-text).

In essence, full-text search can become very complex compared to Django model searches which perform basic database queries with regular indexes. Full-text search has grown in complexity to the point it varies depending on the relational database brand (e.g. Oracle, Postgres) and there are also completely separate products -- that work alongside relational databases -- known as 'search engine' platforms specifically designed to deal with full-text search.

Django in its out of box state only supports full-text search for Postgres databases through the django.contrib.postgres.search module, which means you can use full-text search features in Django without having to tweak or configure Postgres directly. Although Django supports other relational databases (e.g. MySQL, Oracle) that can do full-text search, Django in itself doesn't support full-text search for these brands, which means you need to take additional steps to use full-text search with Django and these other relational databases (e.g. create full-text indexes manually, create raw SQL statements to run full-text searches).

When it comes to Django support for search engine platforms designed to do full-text search, the leading choice is a Django package called Haystack[2]. Haystack in itself isn't a search engine platform, but rather standardizes access to search engine platforms in Django. Haystack supports four leading search engine platforms: Solr, Elasticsearch, Whoosh and Xapian.

Basically, Haystack is to full-text search in Django, what Django's built-in model API is to relational databases, it shields Django full-text search logic to operate across any search engine platform supported by Haystack. This allows you to write full-text search logic that isn't search engine platform specific and if you want your Django projects to use a different search engine platform in the future, Haystack allows you to easily make this change, just like the Django built-in model API allows you to easily change a Django project's relational database.

Django public search discovery: Sitemaps & robots.txt file

In addition to supporting search functionality for users of your Django applications, another important aspect related to search in Django is search discovery. Even if your Django applications offer the best search and full-text search for users, new users depend on the ability of your applications being discovered, which is where public search engines (e.g. Google, DuckDuckGo, Bing) come into the picture.

Although it's sometimes only a question of time for public search engines to discover a Django application's content, the process can be made easier -- or restricted -- if you follow certain search engines guidelines. To make search discovery easier, search engines rely on a sitemap, which is a file (or files) that contain a web site's various URLs including their characteristics (e.g. how often content changes, relative weight to other URLs). To restrict search discovery, search engines rely on a robots.txt file, which is a file that contains instructions for search engine bots to not crawl certain or all sections of a web site.

Django offers the ability to create sitemaps through the django.contrib.sitemaps module, which in turn allows the exposure of site URLs based on Django urls & models (e.g. A Store model's URLs /stores/1,/stores/2/). Since a robots.txt file isn't as data driven as a sitemap, you can just create a flat file and serve it as a Django static file under a site's main directory, as it was described in Set up static web page resources -- Images, CSS, JavaScript.

  1. https://en.wikipedia.org/wiki/Database_index    

  2. https://django-haystack.readthedocs.io/en/master/