- Deterministic shuffling: They can be used to randomly reorder elements in a predictable way.
-
Similarity detection: Some hash functions, like
Simhash
, produce similar hash values for similar inputs. This property is useful for identifying near-duplicate items. - Data integrity: Hash functions can generate unique identifiers for data, helping to verify its integrity.
- Efficient lookups: Hash values can be used as keys in hash tables for fast data retrieval.
halfMD5
Calculates a 64-bit hash value using a modified MD5 algorithm. Syntaxarg1, ...
(any data type): A variable number of arguments of any data type.
- A
UInt64
hash value.
UInt64
in big-endian byte order.
This function is relatively slow (processing about 5 million short strings per second per processor core). Consider using the sipHash64
function instead for better performance.
Example
halfMD5
calculates a hash value for the combination of ‘taco’, ‘tuesday’, and the number 3.
For some data types, the calculated hash value may be the same for different input values if their string representations are identical (e.g., integers of different sizes, named and unnamed Tuples with the same data).
MD4
Calculates the MD4 hash of a string and returns the result as a FixedString(16). Syntaxstring
(String
): The input string to hash.
- A
FixedString(16)
containing the MD4 hash of the input.
MD4 is considered cryptographically weak. For secure hashing, consider using more robust algorithms like SHA-256.
MD5
Calculates the MD5 hash of a string and returns the result as a FixedString(16). Syntaxstring
(String
): The input string to hash.
- A
FixedString(16)
containing the MD5 hash value.
If you need a 128-bit cryptographic hash function but don’t specifically require MD5, consider using the
sipHash128
function instead. It’s generally faster and provides better hash distribution.To get the same result as the md5sum utility, use lower(hex(MD5(s)))
.sipHash64
Produces a 64-bit SipHash hash value. Syntax:- A variable number of parameters of any of the supported data types.
- A
UInt64
data type hash value.
- The first and second hash values are concatenated to an array which is hashed.
- The previously calculated hash value and the hash of the third input parameter are hashed in a similar way.
- This calculation is repeated for all remaining hash values of the original input.
- Integers of different sizes
- Named and unnamed Tuples with the same data
- Map and the corresponding
Array(Tuple(key, value))
type with the same data
If you need a decent cryptographic 128-bit hash instead, consider using the
sipHash128
function.sipHash64Keyed
Calculates a 64-bit SipHash hash value using a specified key. Syntax(k0, k1)
(Tuple(UInt64, UInt64)
): A tuple of two UInt64 values representing the key.par1, ...
(any): A variable number of parameters of any supported data type.
- A
UInt64
data type hash value.
For some data types, the calculated hash value may be the same for different argument types (e.g., integers of different sizes, named and unnamed Tuples with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
sipHash128
Produces a 128-bit SipHash hash value. Syntax:par1, ...
(any of the supported data types): A variable number of parameters.
- A 128-bit SipHash hash value of type
FixedString(16)
.
This 128-bit variant differs from the reference implementation and it’s weaker. This version exists because, when it was written, there was no official 128-bit extension for SipHash. New projects should probably use
sipHash128Reference
instead.sipHash128Keyed
Calculates a 128-bit SipHash hash value using an explicit key. Syntax(k0, k1)
(tuple): A tuple of twoUInt64
values representing the key.par1, ...
(any): A variable number of parameters of any supported data type.
- A 128-bit SipHash hash value of type
FixedString(16)
.
This 128-bit variant differs from the reference implementation and is weaker. New projects should consider using
sipHash128ReferenceKeyed
instead.sipHash128Reference
Produces a 128-bit SipHash hash value using the reference implementation from the original authors of SipHash. Syntaxpar1, ...
— A variable number of parameters of any of the supported data types.
- A 128-bit SipHash hash value as a
FixedString(16)
.
- We use the
sipHash128Reference
function to calculate the hash of three taco-related strings. - The
hex
function is used to represent the result as a hex-encoded string for readability.
sipHash128ReferenceKeyed
Calculates a 128-bit SipHash hash value using an explicit key. Syntax(k0, k1)
(Tuple(UInt64, UInt64)
): A tuple of two UInt64 values representing the key.par1, ...
(any data type): A variable number of parameters of any data type.
FixedString(16)
.
Example
This function provides better security compared to sipHash128Keyed, which uses a non-standard 128-bit extension.
cityHash64
Produces a 64-bit CityHash hash value. Syntax:- A variable number of input parameters of any of the supported data types.
- A
UInt64
data type hash value.
- This is a fast non-cryptographic hash function. It uses the CityHash algorithm for string parameters and implementation-specific fast non-cryptographic hash functions for parameters with other data types.
- The function uses the CityHash combiner to get the final results.
- ClickHouse’s
cityHash64
corresponds to CityHash v1.0.2. - For some data types, the calculated value of the hash function may be the same for the same values even if types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
cityHash64
to compute a checksum of an entire table with accuracy up to the row order:
taco_orders
table, which can be useful for quickly comparing table contents or detecting changes.
intHash32
Calculates a 32-bit hash code from any type of integer. This is a relatively fast non-cryptographic hash function of average quality for numbers. Syntax:int
(UInt8
,UInt16
,UInt32
,UInt64
,Int8
,Int16
,Int32
,Int64
): Integer to hash.
- 32-bit hash code.
UInt32
.
The
intHash32
function is useful when you need a quick hash of integer values, such as for distributing data across shards or creating probabilistic data structures. However, it’s not suitable for cryptographic purposes due to its simplicity and speed.intHash64
Calculates a 64-bit hash code from any type of integer. This is a relatively fast non-cryptographic hash function of average quality for numbers. Syntaxint
(UInt8
,UInt16
,UInt32
,UInt64
,Int8
,Int16
,Int32
,Int64
): Integer to hash.
- 64-bit hash code. (
UInt64
)
The function produces a deterministic result, meaning the same input will always produce the same hash value. This makes it suitable for scenarios where reproducibility is important, such as generating consistent IDs for taco ingredients across different database instances.
SHA1
Calculates the SHA-1 hash of a string and returns the result as a FixedString(20). Syntaxstring
(String
): The input string to hash.
- A
FixedString(20)
containing the SHA-1 hash value.
hex
function to display the result as a hexadecimal string.
The SHA-1 function works relatively slowly (processing about 5 million short strings per second per processor core). Consider using faster hash functions like
sipHash64
if cryptographic properties are not required.For security-critical applications, it’s recommended to use stronger hash functions like SHA-256 or SHA-512, as SHA-1 is no longer considered cryptographically secure against well-funded opponents.
SHA224
Calculates the SHA-224 hash of a string and returns the result as a FixedString(28). Syntaxstring
(String
): The input string to hash.
- A
FixedString(28)
containing the SHA-224 hash of the input string.
The SHA-224 function is a cryptographic hash function that produces a 224-bit (28-byte) hash value. It’s part of the SHA-2 family of hash functions and is considered cryptographically strong. However, for most applications, SHA-256 is more commonly used. Use SHA-224 only if you specifically need a 224-bit hash or if you’re working with a system that requires it.
SHA256
Calculates the SHA-256 hash of a string and returns the result as a FixedString(32). Syntaxstring
(String
): The input string to hash.
- A
FixedString(32)
containing the SHA-256 hash value.
The SHA-256 function is a cryptographic hash function that generates a 256-bit (32-byte) hash value. It’s commonly used for integrity verification of data and in various security applications. However, for password hashing, it’s recommended to use specialized password hashing functions with salt and key stretching.
SHA512
Calculates the SHA-512 hash from a string and returns the resulting set of bytes as a FixedString(64). Syntaxstring
(String
): The input string to hash.
- A
FixedString(64)
containing the SHA-512 hash value.
The SHA-512 function works relatively slowly (about 5 million short strings per second per processor core). We recommend using this function only when necessary, such as for cryptographic purposes. For faster non-cryptographic hashing, consider using functions like
sipHash64
.SHA512_256
Calculates the SHA-512/256 hash of a string and returns the result as a FixedString(32). Syntaxstring
(String
): The input string to hash.
- A
FixedString(32)
containing the SHA-512/256 hash of the input string.
The SHA-512/256 function is a truncated version of SHA-512, providing a 256-bit (32-byte) hash value. It’s designed to provide the security of SHA-512 with a more compact output size.
BLAKE3
Calculates the BLAKE3 hash of a string and returns the result as a FixedString(32). Syntaxs
(String
): Input string for BLAKE3 hash calculation.
- BLAKE3 hash as a byte array. [
FixedString(32)
]
hex
function to represent the result as a hex-encoded string:
URLHash
Calculates a hash value from a URL string using normalization. Syntaxurl
(String
): The URL string to hash.N
(UInt8
, optional): The number of URL hierarchy levels to include in the hash calculation.
- A hash value of type
UInt64
.
URLHash(s)
calculates a hash from the string after removing one of the trailing symbols/
,?
or#
at the end, if present.URLHash(s, N)
calculates a hash from the string up to the N-th level in the URL hierarchy, after removing one of the trailing symbols/
,?
or#
at the end, if present.
The levels in the URL hierarchy are the same as those used in the URLHierarchy function.
farmFingerprint64
Produces a 64-bit FarmHash Fingerprint value. Syntax:par1, ...
(any): A variable number of parameters that can be any of the supported data types.
- A
UInt64
data type hash value.
The farmFingerprint64 function is preferred for a stable and portable hash value. For some data types, the calculated value of the hash function may be the same for the same values even if types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
farmHash64
Produces a 64-bit FarmHash value. Syntax:par1, ...
(any of the supported data types): A variable number of parameters of any of the supported data types.
- A
UInt64
data type hash value.
For some data types, the calculated hash value may be the same for the same values even if the types of arguments differ (e.g., integers of different sizes, named and unnamed Tuples with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
javaHash
Calculates JavaHash from a string or numeric value. Syntaxvalue
(String
,Byte
,Short
,Integer
, orLong
): The input value.
- A hash value of type
Int32
.
Note that Java only supports calculating signed integer hashes. If you want to calculate unsigned integer hashes, you must cast the input to the proper signed ClickHouse type.
cityHash64
or xxHash64
.
javaHashUTF16LE
Calculates JavaHash from a string, assuming it contains bytes representing a string in UTF-16LE encoding. SyntaxstringUtf16le
(String
): A string in UTF-16LE encoding.
- A hash value. (
Int32
)
The
javaHashUTF16LE
function is designed to work with UTF-16LE encoded strings. Make sure your input is properly encoded to get correct results.hiveHash
Calculates HiveHash from a string. Syntax:string
(String
): Input string.
- HiveHash value. [
Int32
]
This hash function is neither fast nor of particularly good quality. Its primary use case is when you need to calculate exactly the same result as used in another system that implements HiveHash.
If you don’t specifically need HiveHash compatibility, consider using other hash functions like [cityHash64] or [xxHash64] for better performance and hash quality.
metroHash64
Produces a 64-bit MetroHash hash value. Syntax:par1, ...
(any): A variable number of parameters that can be any of the supported data types.
- A
UInt64
data type hash value.
For some data types, the calculated value of the hash function may be the same for the same values even if the types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
jumpConsistentHash
Calculates a JumpConsistentHash value for a given key and number of buckets. Syntax:key
(UInt64
): The input key to hash.buckets
(Integer): The number of buckets to distribute the hash values across.
Int32
value representing the bucket number for the given key.
Example:
42
is hashed and assigned to bucket 6 out of 10 possible buckets.
JumpConsistentHash is an efficient algorithm for distributing items across a fixed number of buckets. It’s particularly useful for distributed systems and can be used for tasks like sharding data or distributing load across servers.The function has the following properties:
- It’s fast and requires minimal memory.
- It’s consistent, meaning the same key will always map to the same bucket as long as the number of buckets doesn’t change.
- When the number of buckets increases, it minimizes the number of keys that need to be remapped.
kostikConsistentHash
An O(1) time and space consistent hash algorithm by Konstantin ‘kostik’ Oblakov. Previously known as yandexConsistentHash. Syntax:- yandexConsistentHash (left for backwards compatibility)
input
(UInt64
): A UInt64-type key.n
(UInt16
): Number of buckets.
- A
UInt16
data type hash value.
<=
32768.
Example:
ripeMD160
Calculates the RIPEMD-160 hash of a string. Syntaxinput
(String
): The input string to hash.
- A
UInt256
value containing the 160-bit RIPEMD-160 hash in the first 20 bytes. The remaining 12 bytes are zero-padded.
hex()
function is used to convert the binary hash to a readable format.
RIPEMD-160 is a cryptographic hash function designed as a replacement for MD4 and MD5. It produces a 160-bit (20-byte) hash value, which is then zero-padded to fit into a UInt256 in ClickHouse.
murmurHash2_32
Calculates a 32-bit MurmurHash2 hash value. Syntaxexpr1
,expr2
, … (any data type): A variable number of expressions of any data type.
- A hash value of type
UInt32
.
The MurmurHash2 algorithm is designed for speed and simplicity rather than cryptographic security. For cryptographic purposes, consider using functions like SHA256 instead.
murmurHash2_64
Produces a 64-bit MurmurHash2 hash value. Syntax:par1, ...
(any of the supported data types): A variable number of parameters.
- A
UInt64
data type hash value.
For some data types, the calculated value of the hash function may be the same for the same values even if the types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
gccMurmurHash
Calculates a 64-bit MurmurHash2 hash value using the same hash seed as gcc. It is portable between Clang and GCC builds. Syntaxpar1, ...
: A variable number of parameters that can be any of the supported data types.
- A
UInt64
hash value.
res1
calculates the hash of a combination of taco-related strings and a number.res2
demonstrates hashing a complex nested structure containing taco ingredients and numbers.
kafkaMurmurHash
Calculates a 32-bit MurmurHash2 hash value using the same hash seed as Kafka and without the highest bit to be compatible with Default Partitioner. Syntaxpar1, ...
— A variable number of parameters that can be any of the supported data types.
- A calculated hash value of type
UInt32
.
res1
calculates the hash for a single string ‘carne asada’.res2
demonstrates hashing multiple arguments of different types, including an array, a string, a number, and a datetime.
murmurHash3_32
Calculates a 32-bit MurmurHash3 hash value. Syntaxexpr1
,expr2
, … (any data type): A variable number of expressions of any data type.
- A 32-bit hash value of type
UInt32
.
- This function calculates a MurmurHash3 hash for the provided arguments.
- It can accept multiple arguments of various data types.
- For some data types, the calculated hash value may be the same for identical values even if the argument types differ (e.g., integers of different sizes, named and unnamed Tuples with the same data, Map and the corresponding
Array(Tuple(key, value))
type with the same data).
The MurmurHash3 algorithm is designed to be fast and have good distribution properties, making it suitable for hash-based lookups and data structures. However, it is not cryptographically secure and should not be used for security-sensitive applications.
murmurHash3_64
Produces a 64-bit MurmurHash3 hash value. Syntax:par1, ...
(any of the supported data types): A variable number of parameters.
- A
UInt64
data type hash value.
For some data types, the calculated value of the hash function may be the same for the same values even if the types of arguments differ (e.g., integers of different sizes, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
murmurHash3_128
Produces a 128-bit MurmurHash3 hash value. Syntax:expr
(String
): A list of expressions of any data type.
FixedString(16)
.
Example:
The MurmurHash3_128 function is particularly useful when you need a high-quality 128-bit hash value for large amounts of data. It provides a good balance between speed and hash quality, making it suitable for various applications such as data fingerprinting, caching, and bloom filters in distributed systems.
xxh3
Produces a 64-bit xxh3 hash value. Syntax:expr
(any data type): A list of expressions of any data type.
UInt64
.
Example:
xxh3
function calculates a hash value for the combination of ‘Crunchy Taco’ and ‘Supreme’.
The
xxh3
function is a fast non-cryptographic hash function, suitable for general hash-based lookup. It provides a good balance of speed and distribution quality.xxHash32
Calculates the 32-bit xxHash value for a given string. Syntaxstring
(String
): The input string to hash.
- A 32-bit hash value. Type:
UInt32
.
xxHash32 is a fast non-cryptographic hash algorithm. It’s suitable for tasks like checksumming or fingerprinting, but should not be used for cryptographic purposes.
xxHash64
Calculates a 64-bit xxHash hash value from a string. Syntaxstring
(String
): The input string to hash.
- A
UInt64
hash value.
xxHash64 is a fast non-cryptographic hash algorithm, suitable for hash tables, checksums, and other non-security applications. It provides a good balance between speed and hash quality.
ngramSimHash
Calculates the n-gram simhash of a string. This function is useful for detecting semi-duplicate strings. Syntaxstring
(String
): Input string.ngramsize
(UInt8
, optional): Size of the n-gram. Default value: 3. Possible values: any number from 1 to 25.
- A 64-bit hash value. (
UInt64
)
bitHammingDistance
to detect similar strings. The smaller the Hamming distance between the simhashes of two strings, the more likely these strings are similar.
Example
The ngramSimHash function is particularly useful in scenarios where you need to identify similar text entries, such as finding similar taco recipes or menu descriptions in a large database of food items.
ngramSimHashCaseInsensitive
Splits an ASCII string into n-grams ofngramsize
symbols and returns the n-gram simhash. This function is case insensitive.
It can be used for detecting semi-duplicate strings when combined with the bitHammingDistance
function. The smaller the Hamming Distance between the calculated simhashes of two strings, the more likely these strings are similar.
Syntax
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.
- A hash value. (
UInt64
)
The function is useful for fuzzy matching and finding similar strings, regardless of letter case. This can be particularly helpful when comparing user-generated content or searching for menu items with slight variations in capitalization.
ngramSimHashUTF8
Splits a UTF-8 string into n-grams ofngramsize
symbols and returns the n-gram simhash. Is case sensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance
. The smaller the Hamming Distance of the calculated simhashes of two strings, the more likely these strings are the same.
Syntax:
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.
- Hash value. Type:
UInt64
.
ngramSimHashUTF8
calculates the simhash for the taco menu item ‘Crunchy Taco Supreme’. This hash can be used to find similar menu items or detect near-duplicate entries in a taco restaurant’s menu database.
The function is case-sensitive, so ‘Crunchy Taco Supreme’ and ‘crunchy taco supreme’ would produce different hash values.
ngramSimHashCaseInsensitiveUTF8
Splits a UTF-8 string into n-grams ofngramsize
symbols and returns the n-gram simhash. This function is case insensitive.
This function can be used for detecting semi-duplicate strings when combined with the bitHammingDistance
function. The smaller the Hamming Distance between the calculated simhashes of two strings, the more likely these strings are similar.
Syntax
string
(String
): Input string to hash.ngramsize
(UInt8
, optional): Size of the n-gram. Possible values: any number from 1 to 25. Default value: 3.
- A hash value. Type:
UInt64
.
The function is particularly useful for fuzzy matching of text, such as finding similar taco recipes or menu items, even if they have slight variations in spelling or capitalization.
wordShingleSimHash
Splits a string into word shingles and calculates a SimHash value. This function is case-sensitive. Syntaxstring
(String
): Input string.shinglesize
(UInt8
, optional): Size of word shingles. Default value: 3. Possible values: any number from 1 to 25.
- A 64-bit hash value. (
UInt64
)
The SimHash algorithm is particularly useful for finding near-duplicate content, making it valuable for tasks like detecting similar taco recipes or menu descriptions in a large database of taco-related text.
wordShingleSimHashCaseInsensitive
Splits a ASCII string into parts (shingles) ofshinglesize
words and returns the word shingle simhash. Is case insensitive.
Can be used for detection of semi-duplicate strings with bitHammingDistance
. The smaller the Hamming Distance of the calculated simhashes of two strings, the more likely these strings are the same.
Syntax:
string
(String
): Input string.shinglesize
(UInt8
, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.
- Hash value. Type:
UInt64
.
The function is useful for finding similar text content, such as duplicate menu descriptions or customer reviews about tacos, even if they have slight variations in wording or capitalization.
wordShingleSimHashUTF8
Splits a UTF-8 string into word shingles and calculates a SimHash value. This function is case-sensitive. Syntaxstring
(String
): Input string to hash.shinglesize
(UInt8
, optional): Size of word shingles. Default value: 3. Possible values: any number from 1 to 25.
- A 64-bit hash value. (
UInt64
)
For case-insensitive hashing of UTF-8 strings, use the wordShingleSimHashCaseInsensitiveUTF8 function.
wordShingleSimHashCaseInsensitiveUTF8
Splits a UTF-8 string into parts (shingles) of shinglesize words and returns the word shingle simhash. Is case insensitive. Can be used for detection of semi-duplicate strings withbitHammingDistance
. The smaller the Hamming Distance of the calculated simhashes of two strings, the more likely these strings are the same.
Syntax:
string
(String
): Input string.shinglesize
(UInt8
, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.
- Hash value. Type:
UInt64
.
This function is particularly useful for finding similar text content in large datasets, such as identifying near-duplicate taco recipes or reviews with slight variations.
wyHash64
Produces a 64-bit wyHash64 hash value. Syntax:string
(String
): Input string to hash.
- A 64-bit hash value.
UInt64
.
wyHash64 is a fast, high-quality hash function that can be useful for various purposes such as hash tables, checksums, or generating unique identifiers for strings.
ngramMinHash
Splits a string into n-grams and calculates hash values for each n-gram. Returns a tuple with minimum and maximum hashes. Syntaxstring
(String
): Input string.ngramsize
(UInt8
, optional): Size of each n-gram. Default value: 3. Range: 1 to 25.hashnum
(UInt8
, optional): Number of minimum and maximum hashes to calculate. Default value: 6. Range: 1 to 25.
- A tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
tupleHammingDistance
. If one of the returned hashes is the same for two strings, those strings are likely similar.
The function is case-sensitive. For case-insensitive hashing, use
ngramMinHashCaseInsensitive
.ngramMinHashCaseInsensitive
Splits an ASCII string into n-grams ofngramsize
symbols and calculates hash values for each n-gram. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. This function is case insensitive.
This function can be used for detecting semi-duplicate strings when combined with the tupleHammingDistance
function. If one of the returned hashes is the same for two strings, those strings are considered similar.
Syntax
string
(String
): Input string.ngramsize
(UInt8
, optional): Size of each n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- A tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
The actual hash values may differ in your results due to the nature of hash functions.
ngramMinHashUTF8
Splits a UTF-8 string into n-grams ofngramsize
symbols and calculates hash values for each n-gram. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case sensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance
. For two strings: if one of the returned hashes is the same for both strings, we consider those strings to be similar.
Syntax
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- A tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
ngramMinHashUTF8
calculates the hash values for the taco name. The result can be used to find similar taco names or descriptions in a large dataset.
ngramMinHashCaseInsensitiveUTF8
Splits a UTF-8 string into n-grams ofngramsize
symbols and calculates hash values for each n-gram. Uses hashnum
minimum hashes to calculate the minimum hash and hashnum
maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive.
Can be used for detection of semi-duplicate strings with tupleHammingDistance
. For two strings: if one of the returned hashes is the same for both strings, we consider those strings to be similar.
Syntax
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- Tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
This function is particularly useful for finding similar strings in large datasets, such as menu items or product names, regardless of case differences.
ngramMinHashArg
Splits an ASCII string into n-grams of a specified size and returns the n-grams with minimum and maximum hashes, as calculated by thengramMinHash
function with the same input. This function is case-sensitive.
Syntax
string
(String
): Input string.ngramsize
(UInt8
, optional): Size of each n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- A tuple containing two tuples, each with
hashnum
n-grams. (Tuple(Tuple(String), Tuple(String))
)
- The function splits ‘Tasty Taco Tuesday’ into 3-character n-grams.
- It returns two tuples, each containing two n-grams (as specified by
hashnum = 2
). - The first tuple contains n-grams with minimum hashes, and the second tuple contains n-grams with maximum hashes.
ngramMinHashArgCaseInsensitive
Splits an ASCII string into n-grams ofngramsize
symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashCaseInsensitive
function with the same input. This function is case insensitive.
Syntax
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- A tuple with two tuples, each containing
hashnum
n-grams. (Tuple(Tuple(String), Tuple(String))
)
- The function splits ‘Crunchy Taco Supreme’ into 3-character n-grams.
- It returns 2 minimum and 2 maximum hashes (due to
hashnum = 2
). - The result shows the n-grams corresponding to these hashes.
- Note that the case is ignored: ‘chy’ could come from ‘Chy’ in ‘Crunchy’.
ngramMinHashArgUTF8
Splits a UTF-8 string into n-grams ofngramsize
symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashUTF8
function with the same input. Is case sensitive.
Syntax
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- Tuple with two tuples with
hashnum
n-grams each. (Tuple(Tuple(String), Tuple(String))
)
- The function splits the input string into 1-gram words.
- It returns two tuples, each containing 3 words (as specified by
hashnum = 3
). - The first tuple contains words with minimum hashes, and the second tuple contains words with maximum hashes.
ngramMinHashArgCaseInsensitiveUTF8
Splits a UTF-8 string into n-grams ofngramsize
symbols and returns the n-grams with minimum and maximum hashes, calculated by the ngramMinHashCaseInsensitiveUTF8
function with the same input. This function is case insensitive.
Syntax
string
(String
): Input string.ngramsize
(UInt8
, optional): The size of an n-gram. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- Tuple with two tuples, each containing
hashnum
n-grams. (Tuple(Tuple(String), Tuple(String))
)
- The function splits ‘Crunchy Taco Supreme’ into 2-grams (bigrams).
- It returns two tuples, each containing 3 bigrams.
- The first tuple contains the bigrams with the minimum hashes.
- The second tuple contains the bigrams with the maximum hashes.
- The function is case-insensitive, so ‘TA’ and ‘ta’ are treated the same.
wordShingleMinHash
Splits a string into word shingles and calculates hash values for each shingle. Returns a tuple with minimum and maximum hashes. This function is case-sensitive. Syntaxstring
(String
): Input string.shinglesize
(UInt8
, optional): Size of word shingles. Possible values: 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used. Possible values: 1 to 25. Default value: 6.
- A tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
The function is designed for ASCII strings. For UTF-8 strings, use
wordShingleMinHashUTF8
.wordShingleMinHashCaseInsensitive
Splits an ASCII string into word shingles and calculates a case-insensitive hash value for each shingle. Uses a specified number of minimum and maximum hashes to compute the final result. Syntaxstring
(String
): Input string to process.shinglesize
(UInt8
, optional): Size of each word shingle. Default value: 3. Range: 1 to 25.hashnum
(UInt8
, optional): Number of minimum and maximum hashes to use. Default value: 6. Range: 1 to 25.
- A tuple containing two hash values: the minimum and maximum. (
Tuple(UInt64, UInt64)
)
tupleHammingDistance
. If one of the returned hashes is the same for two strings, those strings are likely similar.
The function is case-insensitive, meaning “Taco” and “taco” will be treated the same.
Example
For UTF-8 strings, use the
wordShingleMinHashCaseInsensitiveUTF8
function instead.wordShingleMinHashUTF8
Splits a UTF-8 string into word shingles and calculates hash values for each shingle. Uses the minimum and maximum hashes to produce a tuple of hash values. This function is case-sensitive. Syntaxstring
(String
): Input string to be hashed.shinglesize
(UInt8
, optional): Size of each word shingle. Default value: 3. Range: 1 to 25.hashnum
(UInt8
, optional): Number of minimum and maximum hashes to calculate. Default value: 6. Range: 1 to 25.
- A tuple containing two hash values: the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
This function is particularly useful for text similarity analysis, especially when dealing with multilingual content or texts containing special characters, as it properly handles UTF-8 encoded strings.
wordShingleMinHashCaseInsensitiveUTF8
Splits a UTF-8 string into parts (shingles) of shinglesize words and calculates hash values for each word shingle. Uses hashnum minimum hashes to calculate the minimum hash and hashnum maximum hashes to calculate the maximum hash. Returns a tuple with these hashes. Is case insensitive. Can be used for detection of semi-duplicate strings withtupleHammingDistance
. For two strings: if one of the returned hashes is the same for both strings, we think that those strings are the same.
Syntax
string
(String
): Input string.shinglesize
(UInt8
, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- Tuple with two hashes — the minimum and the maximum. (
Tuple(UInt64, UInt64)
)
wordShingleMinHashArg
Splits a ASCII string into parts (shingles) of a specified word size and returns the shingles with minimum and maximum word hashes. This function is case sensitive. Syntaxstring
(String
): Input string.shinglesize
(UInt8
, optional): The size of a word shingle. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): The number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- A tuple with two tuples, each containing
hashnum
word shingles. (Tuple(Tuple(String), Tuple(String))
)
- The function splits the input string into single words (
shinglesize = 1
). - It returns two tuples with 3 words each (
hashnum = 3
). - The first tuple contains words with minimum hashes, and the second tuple contains words with maximum hashes.
The results may vary depending on the input string and the hash function used internally.
wordShingleMinHashArgCaseInsensitive
Splits an ASCII string into word shingles and returns the shingles with minimum and maximum word hashes. This function is case insensitive. Syntaxstring
(String
): Input string.shinglesize
(UInt8
, optional): Size of each word shingle. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- A tuple containing two tuples, each with
hashnum
word shingles. (Tuple(Tuple(String), Tuple(String))
)
tupleHammingDistance
. The smaller the Hamming distance between the calculated hashes of two strings, the more likely these strings are similar.
This function is the case-insensitive version of
wordShingleMinHashArg
. It’s particularly useful when you want to compare strings regardless of their letter casing.wordShingleMinHashArgUTF8
Splits a UTF-8 string into word shingles and returns the shingles with minimum and maximum word hashes. This function is case-sensitive. Syntaxstring
(String
): Input string.shinglesize
(UInt8
, optional): Size of each word shingle. Possible values: 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: 1 to 25. Default value: 6.
- A tuple containing two tuples, each with
hashnum
word shingles. (Tuple(Tuple(String), Tuple(String))
)
- The function splits the input string into individual words.
- It returns two tuples: one with the three words that have the minimum hashes, and another with the two words that have the maximum hashes.
- The shingle size is set to 1 (individual words) and the number of hashes is set to 3.
wordShingleMinHashArgCaseInsensitiveUTF8
Splits a UTF-8 string into word shingles and returns the shingles with minimum and maximum word hashes. This function is case insensitive. Syntaxstring
(String
): Input string.shinglesize
(UInt8
, optional): Size of each word shingle. Possible values: any number from 1 to 25. Default value: 3.hashnum
(UInt8
, optional): Number of minimum and maximum hashes used to calculate the result. Possible values: any number from 1 to 25. Default value: 6.
- A tuple containing two tuples, each with
hashnum
word shingles. (Tuple(Tuple(String), Tuple(String))
)
- The function splits the taco order description into individual words.
- It returns two tuples: one with the shingles that produced the minimum hashes, and another with those that produced the maximum hashes.
- The shingle size is set to 1 (individual words) and we’re getting 3 shingles for each tuple.
sqidEncode
Encodes numbers as a Sqid which is a YouTube-like ID string. The output alphabet is abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. Do not use this function for hashing - the generated IDs can be decoded back into the original numbers. Syntax:- sqid
- A variable number of
UInt8
,UInt16
,UInt32
orUInt64
numbers.
String
.
Example:
sqidEncode
generates a unique ID for a taco order using the numbers 1, 2, 3, 4, and 5. This could be useful for creating short, readable order IDs in a taco ordering system.
The
sqidEncode
function is particularly useful when you need to generate short, URL-friendly IDs from a series of numbers. It’s ideal for creating readable identifiers for orders, products, or any other entities in your database that need a compact, reversible representation.sqidDecode
Decodes a Sqid back into its original numbers. Syntax:sqid
(String
): A Sqid string to decode.
- An array of numbers decoded from the Sqid. Returns an empty array if the input string is not a valid Sqid. (
Array(UInt64)
)
sqidDecode
function decodes it back into its original numeric components, which could represent various aspects of the order such as order number, number of tacos, sauce type, etc.
This function is the inverse of
sqidEncode
. It’s useful for decoding short, URL-friendly IDs back into their original numeric form. However, it should not be used for security-sensitive applications as the encoding is reversible.