Chunk Size #2438

MrBrabus75 · 2024-02-12T15:08:05Z

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to file this report (我已阅读并同意 Language Policy).

Dify version

0.5.2

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

How reproduce ?

Use a Embedding with <1000 dimensions.

✔️ Expected Behavior

Working ?

❌ Actual Behavior

Error chunk size 0 - 1000

How upgrade chunk size limite ? 0-1000 to 0-4096 or higher ?

dosubot · 2024-02-12T15:08:29Z

🤖

Hey there, @MrBrabus75! Nice to see you diving into dify's depths again. How's everything going on your end?

Based on the information provided, it seems like you're encountering an issue with the chunk size when using an Embedding with less than 1000 dimensions. However, the current limit for the chunk size in the dify codebase is not explicitly mentioned in the provided context.

The codebase does have limits set for file size, as seen in the file_service.py file. Here, the limit for image files is defined by the configuration value "UPLOAD_IMAGE_FILE_SIZE_LIMIT", and for other files, it is defined by "UPLOAD_FILE_SIZE_LIMIT". Both of these values are multiplied by 1024 twice, suggesting that the limit is set in megabytes. If the size of the uploaded file exceeds the limit, an error is raised.

As for the specific error message you're receiving when trying to use an Embedding with less than 1000 dimensions, I wasn't able to find an answer within the repository.

Regarding the part of the dify codebase that handles the chunk size for Embeddings, the embed_documents method in the ZhipuAITextEmbeddingModel class processes a list of texts to generate embeddings. While the method itself does not explicitly mention "chunk size," it iterates over each text in the texts list and calls the embeddings.create method on the ZhipuAI client for each text. This suggests that the chunk size, in this case, is implicitly determined by the size of each individual text in the texts list.

Unfortunately, I wasn't able to find a configuration setting in the dify codebase that allows the user to adjust the chunk size limit or a way to increase the chunk size limit in the dify codebase.

Could you please provide more information about the error you're encountering? Specifically, could you share the exact error message and stack trace, and any additional steps or code that leads to the error? This will help in understanding the issue better and finding a suitable solution.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

takatost · 2024-02-13T03:08:58Z

Usually we don't recommend setting such a large chunk size. Below 1000 is a relatively appropriate size for semantic search. I guess you might want the chunk content as LLM context to be as complete as possible. Maybe you can try the QA splitting mode. If you're not satisfied, you can continue modifying the chunk content. Use Q as the semantic searching content and A as the context for LLM.

MrBrabus75 · 2024-02-13T09:00:57Z

@takatost The problem is that the Q/A model is available only in Chinese and English, and not at all in French, even though my model is multilingual, which is a shame… (That feature would be nice, haha).

Then, I thought it would be good to increase the chunk limit for embeddings with large dimensions like Jina, Voyager, SFR Mistral mteb.

And finally, if I embed 50 documents (installation guides), and each document is divided into 50 chunks, the LLM struggles a bit to provide me with a detailed installation guide.

Thank you for your responses; I hope you will help me find a solution for these points.

takatost · 2024-02-13T15:49:38Z

@takatost The problem is that the Q/A model is available only in Chinese and English, and not at all in French, even though my model is multilingual, which is a shame… (That feature would be nice, haha).

Then, I thought it would be good to increase the chunk limit for embeddings with large dimensions like Jina, Voyager, SFR Mistral mteb.

And finally, if I embed 50 documents (installation guides), and each document is divided into 50 chunks, the LLM struggles a bit to provide me with a detailed installation guide.

Thank you for your responses; I hope you will help me find a solution for these points.

QA mode does not currently support French, and the effect is also not as expected. What I mean is that we can indirectly create a knowledge base for QA mode by generating QAs, and then add or modify QA chunks based on this. The chunk limit of 1000 or less means that the search content (query) and smaller chunk (Q) have a higher semantic match, while the context of LLM can be extended to larger chunk.

In the future, within a month perhaps, we will provide the Parent Document Index mode to support the indexing mode mentioned above. For example, a sentence can be used as a chunk, and the paragraph or the content of the current page where this sentence is located as the Parent can be used as context for LLM.

If this answer doesn't meet your expectations, could you please provide some more detailed scenarios for us to refer to? Thank you very much! 😊

MrBrabus75 · 2024-02-13T20:03:27Z

@takatost The problem is that the Q/A model is available only in Chinese and English, and not at all in French, even though my model is multilingual, which is a shame… (That feature would be nice, haha).

Then, I thought it would be good to increase the chunk limit for embeddings with large dimensions like Jina, Voyager, SFR Mistral mteb.

And finally, if I embed 50 documents (installation guides), and each document is divided into 50 chunks, the LLM struggles a bit to provide me with a detailed installation guide.

Thank you for your responses; I hope you will help me find a solution for these points.

QA mode does not currently support French, and the effect is also not as expected. What I mean is that we can indirectly create a knowledge base for QA mode by generating QAs, and then add or modify QA chunks based on this. The chunk limit of 1000 or less means that the search content (query) and smaller chunk (Q) have a higher semantic match, while the context of LLM can be extended to larger chunk.

In the future, within a month perhaps, we will provide the Parent Document Index mode to support the indexing mode mentioned above. For example, a sentence can be used as a chunk, and the paragraph or the content of the current page where this sentence is located as the Parent can be used as context for LLM.

If this answer doesn't meet your expectations, could you please provide some more detailed scenarios for us to refer to? Thank you very much! 😊

Thank you for your time and your very detailed response.

However, I have a few questions:

1. Is it possible by me or by you to modify the language of questions generated during Q/A? Are they generated by the LLM model?
1. Would you recommend that I wait for the PDI (Parent Document Index) mode, which will arrive in about a month?
1. I am embedding about 50 documents or more, which are quite long, one or two pages, and are chunked into several pieces. Should I stick with a limit between 512 and 1000 in size? Won't this limit the result of the responses by the LLM?

Thank you for everything you do.

takatost · 2024-02-14T08:58:54Z

Yeah, they are generated by LLM. It's not a perfect solution, but it effectively allows for separate editing of Q&A content in each chunk, achieving the desired result of retrieving Q and returning A. Currently, language setting is not supported, but we plan to redesign this feature in the future. For now, we'll stick with the current design.
To enhance the experience, we suggest waiting for about a month. Of course, you can also try the solution mentioned in 1 for now, although it might be a bit cumbersome.
As mentioned in 1, the QA mode separates the retrieval and content given to LLM. Regardless of how it's divided, this is just a general approach. Based on the principles of embedding, manual intervention or LLM involvement is needed to split/summarize and create chunks for retrieval, while the content serving as LLM context could be its parent (e.g., the corresponding page or the entire document).

Hope this can help you! 🤗

takatost · 2024-02-14T09:00:47Z

cc @VincePotato @JohnJyong

MrBrabus75 · 2024-02-14T09:47:55Z

okay i see, thanks you for all

github-actions · 2024-03-01T03:06:30Z

Close due to it's no longer active, if you have any questions, you can reopen it.

MrBrabus75 added the 🐞 bug Something isn't working label Feb 12, 2024

dosubot bot added the 💪 enhancement New feature or request label Feb 12, 2024

crazywoola removed the 🐞 bug Something isn't working label Feb 13, 2024

takatost assigned VincePotato and JohnJyong Feb 14, 2024

takatost added the 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. label Feb 14, 2024

takatost self-assigned this Feb 14, 2024

github-actions bot added the no-issue-activity Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 1, 2024

github-actions bot closed this as completed Mar 4, 2024

rainchen mentioned this issue May 14, 2024

allow to config max segmentation tokens length for RAG document using environment variable #4375

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk Size #2438

Chunk Size #2438

MrBrabus75 commented Feb 12, 2024

dosubot bot commented Feb 12, 2024 •

edited

About Dosu

takatost commented Feb 13, 2024

MrBrabus75 commented Feb 13, 2024

takatost commented Feb 13, 2024

MrBrabus75 commented Feb 13, 2024

takatost commented Feb 14, 2024

takatost commented Feb 14, 2024

MrBrabus75 commented Feb 14, 2024

github-actions bot commented Mar 1, 2024

Chunk Size #2438

Chunk Size #2438

Comments

MrBrabus75 commented Feb 12, 2024

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Feb 12, 2024 • edited

Sources

About Dosu

takatost commented Feb 13, 2024

MrBrabus75 commented Feb 13, 2024

takatost commented Feb 13, 2024

MrBrabus75 commented Feb 13, 2024

takatost commented Feb 14, 2024

takatost commented Feb 14, 2024

MrBrabus75 commented Feb 14, 2024

github-actions bot commented Mar 1, 2024

dosubot bot commented Feb 12, 2024 •

edited