C:\tools\ruby21\bin> asciidoctor -r asciidoctor-pdf -d book -b pdf -a toc D:/elasticsearch-definitive-guide/book.asciidoc > book.txt Failed to parse formatted text: This portion of the query is a range filter, which will find all ages older than 30—`gt` stands for greater than. Failed to parse formatted text: To add data to Elasticsearch, we need an index—a place to store related data. In reality, an index is just a logical namespace that points to one or more physical shards. Failed to parse formatted text: A cluster health of yellow means that all primary shards are up and running (the cluster is capable of serving any request successfully) but not all replica shards are active. In fact, all three of our replica shards are currently unassigned—they haven’t been allocated to a node. It doesn’t make sense to store copies of the same data on the same node. If we were to lose that node, we would lose all copies of our data. Failed to parse formatted text: A document doesn’t consist only of its data. It also has metadata—information about the document. The three required metadata elements are as follows: Failed to parse formatted text: Documents are indexed—stored and made searchable—​by using the index API. But first, we need to decide where the document lives. As we just discussed, a document’s _index, _type, and _id uniquely identify the document. We can either provide our own _id value or let the index API generate one for us. Failed to parse formatted text: Did you notice that the results from the preceding empty search contained documents of different types—`user` and tweet—from two different indices—`us` and gb? Failed to parse formatted text: However, arrays are indexed—made searchable—​as multivalue fields, which are unordered. At search time, you can’t refer to the first element'' or the last element.'' Rather, think of an array as a bag of values. Failed to parse formatted text: However, arrays are indexed—made searchable—​as multivalue fields, which are unordered. At search time, you can’t refer to the first element'' or the last element.'' Rather, think of an array as a bag of values. Failed to parse formatted text: Search lite—a query-string search—is useful for ad hoc queries from the command line. To harness the full power of search, however, you should use the request body search API, so called because most parameters are passed in the HTTP request body instead of in the query string. Failed to parse formatted text: Request body search—​henceforth known as search—not only handles the query itself, but also allows you to return highlighted snippets from your results, aggregate analytics across all results or subsets of results, and return did-you-mean suggestions, which will help guide your users to the best results quickly. Failed to parse formatted text: The truth is that RFC 7231—the RFC that deals with HTTP semantics and content—​does not define what should happen to a GET request with a body! As a result, some HTTP servers allow it, and some—​especially caching proxies—​don’t. Failed to parse formatted text: The truth is that RFC 7231—the RFC that deals with HTTP semantics and content—​does not define what should happen to a GET request with a body! As a result, some HTTP servers allow it, and some—​especially caching proxies—​don’t. Failed to parse formatted text: The empty search—`{}`—is functionally equivalent to using the match_all query clause, which, as the name suggests, matches all documents: Failed to parse formatted text: Containing the words quick, brown, and fox—the closer together they are, the more relevant the document Failed to parse formatted text: Tagged with lucene, search, or java—the more tags, the more relevant the document Failed to parse formatted text: By default, results are returned sorted by relevance—with the most relevant docs first. Later in this chapter, we explain what we mean by relevance and how it is calculated, but let’s start by looking at the sort parameter and how to use it. Failed to parse formatted text: The first part is the summary of the calculation. It tells us that it has calculated the weight—the TF/IDF—​of the term honeymoon in the field tweet, for document 0. (This is an internal document ID and, for our purposes, can be ignored.) Failed to parse formatted text: Each shard executes the query locally and builds a sorted priority queue of length from + size—in other words, enough results to satisfy the global search request all by itself. It returns a lightweight list of results to the coordinating node, which contains just the doc IDs and any values required for sorting, such as the _score. Failed to parse formatted text: Character filters are used to `tidy up'' a string before it is tokenized. For instance, if our text is in HTML format, it will contain HTML tags like `<p> or <div> that we don’t want to be indexed. We can use the html_strip character filter to remove all HTML tags and to convert HTML entities like Á into the corresponding Unicode character A?. Failed to parse formatted text: A type in Elasticsearch represents a class of similar documents. A type consists of a name—such as user or blogpost—and a mapping. The mapping, like a database schema, describes the fields or properties that documents of that type may have, the datatype of each field—​such as string, integer, or date—and how those fields should be indexed and stored by Lucene. Failed to parse formatted text: We can avoid this problem either by naming the fields differently—​for example, title_en and title_es—or by explicitly including the type name in the field name and querying each field separately: Failed to parse formatted text: Lucene, the Java libraries on which Elasticsearch is based, introduced the concept of per-segment search. A segment is an inverted index in its own right, but now the word index in Lucene came to mean a collection of segments plus a commit point—a file that lists all known segments, as depicted in A Lucene index with a commit point and three segments. New documents are first added to an in-memory indexing buffer, as shown in A Lucene index with new documents in the in-memory buffer, ready to commit, before being written to an on-disk segment, as in After a commit, a new segment is added to the commit point and the buffer is cleared Failed to parse formatted text: The disk is fsync’ed—all writes waiting in the filesystem cache are flushed to disk, to ensure that they have been physically written. Failed to parse formatted text: String ranges are fine on a field with low cardinality—a small number of unique terms. But the more unique terms you have, the slower the string range will be. Failed to parse formatted text: String ranges are fine on a field with low cardinality—a small number of unique terms. But the more unique terms you have, the slower the string range will be. Failed to parse formatted text: Most leaf filters—those dealing directly with fields like the term filter—​are cached, while compound filters, like the bool filter, are not. Failed to parse formatted text: Because the match query has to look for two terms—`["brown","dog"]—internally it has to execute two `term queries and combine their individual results into the overall result. To do this, it wraps the two term queries in a bool query, which we examine in detail in Combining Queries. Failed to parse formatted text: The problem is that, these days, users expect to be able to type all of their search terms into a single field, and expect that the application will figure out how to give them the right results. It is ironic that the multifield search form is known as Advanced Search—it may appear advanced to the user, but it is much simpler to implement. Failed to parse formatted text: Full-text search is a battle between recall—returning all the documents that are relevant—​and precision—not returning irrelevant documents. The goal is to present the user with the most relevant documents on the first page of results. Failed to parse formatted text: Whereas a phrase query simply excludes documents that don’t contain the exact query phrase, a proximity query—a phrase query where slop is greater than 0—incorporates the proximity of the query terms into the final relevance _score. By setting a high slop value like 50 or 100, you can exclude documents in which the words are really too far apart, but give a higher score to documents in which the words are closer together. Failed to parse formatted text: Instead of using proximity matching as an absolute requirement, we can use it as a signal—as one of potentially many queries, each of which contributes to the overall score for each document (see Most Fields). Failed to parse formatted text: search-as-you-type—displaying the most likely results before the user has finished typing the search terms Failed to parse formatted text: This technique is used to increase recall—the number of relevant documents that a search returns. It is usually used in combination with other techniques, such as shingles (see Finding Associated Words) to improve precision and the relevance score of each document. Failed to parse formatted text: Imagine that we have a query for `happy hippopotamus.'' A common word like `happy will have a low weight, while an uncommon term like hippopotamus will have a high weight. Let’s assume that happy has a weight of 2 and hippopotamus has a weight of 5. We can plot this simple two-dimensional vector—`[2,5]`—as a line on a graph starting at point (0,0) and ending at point (2,5), as shown in A two-dimensional query vector for ``happy hippopotamus'' represented. Failed to parse formatted text: We can create a similar vector for each document, consisting of the weight of each query term—`happy` and hippopotamus—that appears in the document, and plot these vectors on the same graph, as shown in Query and document vectors for ``happy hippopotamus'': Failed to parse formatted text: Document 1: (happy,__)—`[2,0]` Failed to parse formatted text: Document 2: ( _ ,hippopotamus)—`[0,5]` Failed to parse formatted text: Document 3: (happy,hippopotamus)—`[2,5]` Failed to parse formatted text: In practice, only two-dimensional vectors (queries with two terms) can be plotted easily on a graph. Fortunately, linear algebra—the branch of mathematics that deals with vectors—​provides tools to compare the angle between multidimensional vectors, which means that we can apply the same principles explained above to queries that consist of many terms. Failed to parse formatted text: In practice, only two-dimensional vectors (queries with two terms) can be plotted easily on a graph. Fortunately, linear algebra—the branch of mathematics that deals with vectors—​provides tools to compare the angle between multidimensional vectors, which means that we can apply the same principles explained above to queries that consist of many terms. Failed to parse formatted text: Decay functions—`linear`, exp, gauss Failed to parse formatted text: The three decay functions—​called linear, exp, and gauss—operate on numeric fields, date fields, or lat/lon geo-points. All three take the same parameters: Failed to parse formatted text: The curves shown in Decay function curves all have their origin—the central point—​set to 40. The offset is 5, meaning that all values in the range 40 - 5 ⇐ value ⇐ 40 + 5 are treated as though they were at the origin—they all get the full score of 1.0. Failed to parse formatted text: Full-text search is a battle between precision—returning as few irrelevant documents as possible—​and recall—returning as many relevant documents as possible. While matching only the exact words that the user has queried would be precise, it is not enough. We would miss out on many documents that the user would consider to be relevant. Instead, we need to spread the net wider, to also search for words that are not exactly the same as the original but are related. Failed to parse formatted text: Remove the distinction between singular and plural—`fox` versus foxes—or between tenses—`jumping` versus jumped versus jumps—by stemming each word to its root form. See Reducing Words to Their Root Form. Failed to parse formatted text: Check for misspellings or alternate spellings, or match on homophones—words that sound the same, like their versus there, meat versus meet versus mete. See Typoes and Mispelings. Failed to parse formatted text: A single predominant language per document requires a relatively simple setup. Documents from different languages can be stored in separate indices—`blogs-en`, blogs-fr, and so forth—that use the same type and the same fields for each index, just with different analyzers: Failed to parse formatted text: Another difference between the standard tokenizer and the icu_tokenizer is that the latter will break a word containing characters written in different scripts (for example, I²eta) into separate tokens—`I²`, eta—while the former will emit the word as a single token: I²eta. Failed to parse formatted text: Character filters can be added to an analyzer to preprocess the text before it is passed to the tokenizer. In this case, we can use the html_strip character filter to remove HTML tags and to decode HTML entities such as é into the corresponding Unicode characters. Failed to parse formatted text: Apostrophe (')—the original ASCII character Failed to parse formatted text: Left single-quotation mark (â?~)—opening quote when single-quoting Failed to parse formatted text: Right single-quotation mark (â?T)—closing quote when single-quoting, but also the preferred character to use as an apostrophe Failed to parse formatted text: Single high-reversed-9 quotation mark (â?>)—same as U+2018 but differs in appearance Failed to parse formatted text: Left single-quotation mark in ISO-8859-1—should not be used in Unicode Failed to parse formatted text: Right single-quotation mark in ISO-8859-1—should not be used in Unicode Failed to parse formatted text: Even when using the `acceptable'' quotation marks, a word written with a single right quotation mark—`Youâ?Tre—is not the same as the word written with an apostrophe—`You’re`—which means that a query for one variant will not find the other. Failed to parse formatted text: English uses diacritics (like A', ^, and A") only for imported words—​like rA'le, dAcjA , and dA☼is—but usually they are optional. Other languages require diacritics in order to be correct. Of course, just because words are spelled correctly in your index doesn’t mean that the user will search for the correct spelling. Failed to parse formatted text: For instance, what’s the difference between é and ? It depends on who you ask. According to Elasticsearch, the first one consists of the two bytes 0xC3 0xA9, and the second one consists of three bytes, 0x65 0xCC 0x81. Failed to parse formatted text: The composed forms—`nfc` and nfkc—represent characters in the fewest bytes possible. So Ac is represented as the single letter Ac. The decomposed forms—`nfd` and nfkd—represent characters by their constituent parts, that is e + A'. Failed to parse formatted text: The canonical forms—`nfc` and nfd—represent ligatures like ffi or Å" as a single character, while the compatibility forms—`nfkc` and nfkd—break down these composed characters into a simpler multiletter equivalent: f + f + i or o + e. Failed to parse formatted text: The composed forms—`nfc` and nfkc—represent characters in the fewest bytes possible. So Ac is represented as the single letter Ac. The decomposed forms—`nfd` and nfkd—represent characters by their constituent parts, that is e + A'. Failed to parse formatted text: The canonical forms—`nfc` and nfd—represent ligatures like ffi or Å" as a single character, while the compatibility forms—`nfkc` and nfkd—break down these composed characters into a simpler multiletter equivalent: f + f + i or o + e. Failed to parse formatted text: Lemmatization, like stemming, tries to group related words, but it goes one step further than stemming in that it tries to group words by their word sense, or meaning. The same word may represent two meanings—for example,wake can mean to wake up or a funeral. While lemmatization would try to distinguish these two word senses, stemming would incorrectly conflate them. Failed to parse formatted text: Lemmatization, like stemming, tries to group related words, but it goes one step further than stemming in that it tries to group words by their word sense, or meaning. The same word may represent two meanings—for example,wake can mean to wake up or a funeral. While lemmatization would try to distinguish these two word senses, stemming would incorrectly conflate them. Failed to parse formatted text: First we will discuss the two classes of stemmers available in Elasticsearch—Algorithmic Stemmers and Dictionary Stemmers—and then look at how to choose the right stemmer for your needs in Choosing a Stemmer. Finally, we will discuss options for tailoring stemming in Controlling Stemming and Stemming in situ. Failed to parse formatted text: Recognize the distinction between words that are similar but have different word senses—for example, organ and organization Failed to parse formatted text: A Hunspell dictionary consists of two files with the same base name—​such as en_US—but with one of two extensions: Failed to parse formatted text: The problem is that the quick brown fox is really a query for the OR quick OR brown OR fox—any document that contains nothing more than the almost meaningless term the is included in the result set. What we need is a way of reducing the number of documents that need to be scored. Failed to parse formatted text: The must clause means that at least one of the low-frequency terms—`quick` or dead—_must_ be present for a document to be considered a match. All other documents are excluded. The should clause then looks for the high-frequency terms and and the, but only in the documents collected by the must clause. The sole job of the should clause is to score a document like Quick and the dead'' higher than The quick but dead''. This approach greatly reduces the number of documents that need to be examined and scored. Failed to parse formatted text: An or query for high-frequency terms only—``To be, or not to be''—is the worst case for performance. It is pointless to score all the documents that contain only one of these terms in order to return just the top 10 matches. We are really interested only in documents in which the terms all occur together, so in the case where there are no low-frequency terms, the query is rewritten to make all high-frequency terms required: Failed to parse formatted text: All terms are output as unigrams—`the`, quick, and so forth—​but if a word is a common word or is followed by a common word, then it also outputs a bigram in the same position as the unigram—`the_quick`, quick_and, and_brown. Failed to parse formatted text: Clearly, bieber is a long way from beaver—they are too far apart to be considered a simple misspelling. Damerau observed that 80% of human misspellings have an edit distance of 1. In other words, 80% of misspellings could be corrected with a single edit to the original string. Failed to parse formatted text: The fuzzy query works by taking the original term and building a Levenshtein automaton—like a big graph representing all the strings that are within the specified edit distance of the original string. Failed to parse formatted text: Multivalue buckets—​the terms, histogram, and date_histogram—dynamically produce many buckets. How does Elasticsearch decide the order that these buckets are presented to the user? Failed to parse formatted text: precision_threshold accepts a number from 0–40,000. Larger values are treated as equivalent to 40,000. Failed to parse formatted text: On the opposite end of the spectrum, you have tiny merchants such as the corner drug store. These are commonly uncommon—only one or two customers have transactions from the merchant. We can rule these out as well. Since all of the compromised cards did not interact with the merchant, we can be sure it was not to blame for the security breach. Failed to parse formatted text: By default, this setting is unbounded—Elasticsearch will never evict data from fielddata. Failed to parse formatted text: Doc values are now only about 10–25% slower than in-memory fielddata, and come with two major advantages: Failed to parse formatted text: But there is a problem. Remember that fielddata caches are per segment. If one segment contains only two statuses—`status_deleted` and status_published—then the resulting ordinals (0 and 1) will not be the same as the ordinals for a segment that contains all three statuses. Failed to parse formatted text: Everybody gets caught at least once: string geo-points are "latitude,longitude", while array geo-points are [longitude,latitude]—the opposite order! Failed to parse formatted text: Everybody gets caught at least once: string geo-points are "latitude,longitude", while array geo-points are [longitude,latitude]—the opposite order! Failed to parse formatted text: In other words, the longer the geohash string, the more accurate it is. If two geohashes share a prefix— and gcpuuz—then it implies that they are near each other. The longer the shared prefix, the closer they are. Failed to parse formatted text: Geo-points can index their associated geohashes automatically, but more important, they can also index all geohash prefixes. Indexing the location of the entrance to Buckingham Palace—​latitude 51.501568 and longitude -0.141257—would index all of the geohashes listed in the following table, along with the approximate dimensions of each geohash cell: Failed to parse formatted text: This filter translates the lat/lon point into a geohash of the appropriate length—​in this example dr5rsk—and looks for all locations that contain that exact term. Failed to parse formatted text: The aggregation is sparse—it returns only cells that contain documents. If your geohashes are too precise and too many buckets are generated, it will return, by default, the 10,000 most populous cells—​those containing the most documents. However, it still needs to generate all the buckets in order to figure out which are the most populous 10,000. You need to control the number of buckets generated by doing the following: Failed to parse formatted text: You can specify precisions by using distances—​for example, 50m or 2km—but ultimately these distances are converted to the same levels as described in Geohashes. Failed to parse formatted text: Shapes are represented using GeoJSON, a simple open standard for encoding two-dimensional shapes in JSON. Each shape definition contains the type of shape—`point`, line, polygon, envelope,—and one or more arrays of longitude/latitude points. Failed to parse formatted text: A nested filter behaves much like a nested query, except that it doesn’t accept the score_mode parameter. It can be used only in filter context—such as inside a filtered query—​and it behaves like any other filter: it includes or excludes, but it doesn’t score. Failed to parse formatted text: A nested filter behaves much like a nested query, except that it doesn’t accept the score_mode parameter. It can be used only in filter context—such as inside a filtered query—​and it behaves like any other filter: it includes or excludes, but it doesn’t score. Failed to parse formatted text: The has_child filter works in the same way as the has_child query, except that it doesn’t support the score_mode parameter. It can be used only in filter context—such as inside a filtered query—​and behaves like any other filter: it includes or excludes, but doesn’t score. Failed to parse formatted text: The has_child filter works in the same way as the has_child query, except that it doesn’t support the score_mode parameter. It can be used only in filter context—such as inside a filtered query—​and behaves like any other filter: it includes or excludes, but doesn’t score. Failed to parse formatted text: The has_parent filter works in the same way as the has_parent query, except that it doesn’t support the score_mode parameter. It can be used only in filter context—such as inside a filtered query—​and behaves like any other filter: it includes or excludes, but doesn’t score. Failed to parse formatted text: The has_parent filter works in the same way as the has_parent query, except that it doesn’t support the score_mode parameter. It can be used only in filter context—such as inside a filtered query—​and behaves like any other filter: it includes or excludes, but doesn’t score. Failed to parse formatted text: The shard routing of the employee document would be decided by the parent ID—`london`—but the london document was routed to a shard by its own parent ID—`uk`. It is very likely that the grandchild would end up on a different shard from its parent and grandparent, which would prevent the same-shard parent-child mapping from functioning. Failed to parse formatted text: Users often ask why Elasticsearch doesn’t support shard-splitting—the ability to split each shard into two or more pieces. The reason is that shard-splitting is a bad idea: 15:24Failed to parse formatted text: Users often ask why Elasticsearch doesn’t support shard-splitting—the ability to split each shard into two or more pieces. The reason is that shard-splitting is a bad idea: Failed to parse formatted text: One of the most common use cases for Elasticsearch is for logging, so common in fact that Elasticsearch provides an integrated logging platform called the ELK stack—Elasticsearch, Logstash, and Kibana—​to make the process easy. Failed to parse formatted text: A much better approach is to use nested objects, with one field for the parameter name—`referer`— and another field for its associated value—`count`: Failed to parse formatted text: If you create an index, Elasticsearch must broadcast the change in cluster state to all nodes. Those nodes must initialize those new shards, and then respond to the master that the shards are Started. This process is fast, but because network latency may take 10–20ms. Failed to parse formatted text: segments will tell you the number of Lucene segments this node currently serves. This can be an important number. Most indices should have around 50–150 segments, even if they are terabytes in size with billions of documents. Large numbers of segments can indicate a problem with merging (for example, merging is not keeping up with segment creation). Note that this statistic is the aggregate total of all indices on the node, so keep that in mind. Failed to parse formatted text: The space where newly instantiated objects are allocated. The young generation space is often quite small, usually 100 MB–500 MB. The young-gen also contains two survivor spaces. Failed to parse formatted text: The space where newly instantiated objects are allocated. The young generation space is often quite small, usually 100 MB–500 MB. The young-gen also contains two survivor spaces. Failed to parse formatted text: If the heap usage is consistently >=85%, you are in trouble. Heaps over 90–95% are in risk of horrible performance with long 10–30s GCs at best, and out-of-memory (OOM) exceptions at worst. Failed to parse formatted text: Pause the import thread for 3–5 seconds. Failed to parse formatted text: Pause the import thread for 3–5 seconds. Failed to parse formatted text: Similar to the NAS argument, everyone claims that their pipe between data centers is robust and low latency. This is true—​until it isn’t (a network failure will happen eventually; you can count on it). From our experience, the hassle of managing cross–data center clusters is simply not worth the cost. Failed to parse formatted text: Once you cross that magical 30.5 GB boundary, the pointers switch back to ordinary object pointers. The size of each pointer grows, more CPU-memory bandwidth is used, and you effectively lose memory. In fact, it takes until around 40–50 GB of allocated heap before you have the same effective memory of a 30.5 GB heap using compressed oops. Failed to parse formatted text: The 30.5 GB line is fairly important. So what do you do when your machine has a lot of memory? It is becoming increasingly common to see super-servers with 512–768 GB of RAM. Failed to parse formatted text: The 30.5 GB line is fairly important. So what do you do when your machine has a lot of memory? It is becoming increasingly common to see super-servers with 512–768 GB of RAM. Failed to parse formatted text: This should be fairly obvious, but use bulk indexing requests for optimal performance. Bulk sizing is dependent on your data, analysis, and cluster configuration, but a good starting point is 5–15 MB per bulk. Note that this is physical size. Document count is not a good metric for bulk size. For example, if you are indexing 1,000 documents per bulk, keep the following in mind: Failed to parse formatted text: Start with a bulk size around 5–15 MB and slowly increase it until you do not see performance gains anymore. Then start increasing the concurrency of your bulk ingestion (multiple threads, and so forth). Failed to parse formatted text: The default is 20 MB/s, which is a good setting for spinning disks. If you have SSDs, you might consider increasing this to 100–200 MB/s. Test to see what works for your system: