Post by account_disabled on Mar 11, 2024 5:57:58 GMT
Embeddings and GPT-4 to cluster product reviews First of all a little refresher. In the field of statistics, clustering refers to a set of data exploration methods that aim to identify and group similar elements within a dataset . Grouping strings through ChatGPT or the OpenAI APIs ( with the GPT-3, gpt-3.5-turbo or gpt-4 templates ) is relatively simple. The following table shows an example of clustering of some Amazon product reviews obtained by ChatGPT through GPT-4 . An example of a cluster table obtained from ChatGPT through GPT-4 An example of a cluster table obtained from ChatGPT through GPT-4 The prompt ? The following is exactly the input I used. Cluster the following reviews by topic and sentiment.
Create a table with the following columns: "Cluster Number", "Cluster India Mobile Number Data Name", "Review (example)". Create 5 clusters. In the "Review (example)" column, indicate an example review. REVIEWS <review 1> <review 2> < ... > Clearly this is an example: the prompt could be optimized and much more detailed in the instructions. However, although the " 32k " model of GPT-4 ( gpt-4-32k ) allows us to process a very large input, proceeding in this way would not be advisable if we were to process tens of thousands or hundreds of thousands of reviews. The best solution involves the use of embeddings . What are embeddings Embeddings actually represent a vectorization of strings . To put it in simpler terms, we can say that these are numerical representations ( sequences ) of one or more words that make it easier for machines to understand the relationships between the concepts expressed.
The transformation of a string into embedding The transformation of a string into embedding They are useful for working with natural language and code , because they can be easily used and compared by other machine learning models and algorithms to do things like: search , where results are sorted based on relevance to a search query; clustering , where text strings are grouped together by similarity ( and we will see an example of this in this post ) ; recommendations , to suggest related elements based on their content ; similarity , where similarity distributions are analyzed; classification , where text strings are labeled based on their characteristics . The distance between two vectors ( the numerical sequences ) measures their connection.
Create a table with the following columns: "Cluster Number", "Cluster India Mobile Number Data Name", "Review (example)". Create 5 clusters. In the "Review (example)" column, indicate an example review. REVIEWS <review 1> <review 2> < ... > Clearly this is an example: the prompt could be optimized and much more detailed in the instructions. However, although the " 32k " model of GPT-4 ( gpt-4-32k ) allows us to process a very large input, proceeding in this way would not be advisable if we were to process tens of thousands or hundreds of thousands of reviews. The best solution involves the use of embeddings . What are embeddings Embeddings actually represent a vectorization of strings . To put it in simpler terms, we can say that these are numerical representations ( sequences ) of one or more words that make it easier for machines to understand the relationships between the concepts expressed.
The transformation of a string into embedding The transformation of a string into embedding They are useful for working with natural language and code , because they can be easily used and compared by other machine learning models and algorithms to do things like: search , where results are sorted based on relevance to a search query; clustering , where text strings are grouped together by similarity ( and we will see an example of this in this post ) ; recommendations , to suggest related elements based on their content ; similarity , where similarity distributions are analyzed; classification , where text strings are labeled based on their characteristics . The distance between two vectors ( the numerical sequences ) measures their connection.