Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk Size #2438

Closed
2 tasks done
MrBrabus75 opened this issue Feb 12, 2024 · 9 comments · Fixed by #4375
Closed
2 tasks done

Chunk Size #2438

MrBrabus75 opened this issue Feb 12, 2024 · 9 comments · Fixed by #4375
Assignees
Labels
💪 enhancement New feature or request 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. no-issue-activity Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed

Comments

@MrBrabus75
Copy link

Self Checks

Dify version

0.5.2

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

How reproduce ?

Use a Embedding with <1000 dimensions.

✔️ Expected Behavior

Working ?

❌ Actual Behavior

Error chunk size 0 - 1000
Problem chunk size

How upgrade chunk size limite ? 0-1000 to 0-4096 or higher ?

@MrBrabus75 MrBrabus75 added the 🐞 bug Something isn't working label Feb 12, 2024
Copy link

dosubot bot commented Feb 12, 2024

🤖

Hey there, @MrBrabus75! Nice to see you diving into dify's depths again. How's everything going on your end?

Based on the information provided, it seems like you're encountering an issue with the chunk size when using an Embedding with less than 1000 dimensions. However, the current limit for the chunk size in the dify codebase is not explicitly mentioned in the provided context.

The codebase does have limits set for file size, as seen in the file_service.py file. Here, the limit for image files is defined by the configuration value "UPLOAD_IMAGE_FILE_SIZE_LIMIT", and for other files, it is defined by "UPLOAD_FILE_SIZE_LIMIT". Both of these values are multiplied by 1024 twice, suggesting that the limit is set in megabytes. If the size of the uploaded file exceeds the limit, an error is raised.

As for the specific error message you're receiving when trying to use an Embedding with less than 1000 dimensions, I wasn't able to find an answer within the repository.

Regarding the part of the dify codebase that handles the chunk size for Embeddings, the embed_documents method in the ZhipuAITextEmbeddingModel class processes a list of texts to generate embeddings. While the method itself does not explicitly mention "chunk size," it iterates over each text in the texts list and calls the embeddings.create method on the ZhipuAI client for each text. This suggests that the chunk size, in this case, is implicitly determined by the size of each individual text in the texts list.

Unfortunately, I wasn't able to find a configuration setting in the dify codebase that allows the user to adjust the chunk size limit or a way to increase the chunk size limit in the dify codebase.

Could you please provide more information about the error you're encountering? Specifically, could you share the exact error message and stack trace, and any additional steps or code that leads to the error? This will help in understanding the issue better and finding a suitable solution.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added the 💪 enhancement New feature or request label Feb 12, 2024
@takatost
Copy link
Collaborator

Usually we don't recommend setting such a large chunk size. Below 1000 is a relatively appropriate size for semantic search. I guess you might want the chunk content as LLM context to be as complete as possible. Maybe you can try the QA splitting mode. If you're not satisfied, you can continue modifying the chunk content. Use Q as the semantic searching content and A as the context for LLM.

@crazywoola crazywoola removed the 🐞 bug Something isn't working label Feb 13, 2024
@MrBrabus75
Copy link
Author

@takatost The problem is that the Q/A model is available only in Chinese and English, and not at all in French, even though my model is multilingual, which is a shame… (That feature would be nice, haha).

Then, I thought it would be good to increase the chunk limit for embeddings with large dimensions like Jina, Voyager, SFR Mistral mteb.

And finally, if I embed 50 documents (installation guides), and each document is divided into 50 chunks, the LLM struggles a bit to provide me with a detailed installation guide.

Thank you for your responses; I hope you will help me find a solution for these points.

@takatost
Copy link
Collaborator

@takatost The problem is that the Q/A model is available only in Chinese and English, and not at all in French, even though my model is multilingual, which is a shame… (That feature would be nice, haha).

Then, I thought it would be good to increase the chunk limit for embeddings with large dimensions like Jina, Voyager, SFR Mistral mteb.

And finally, if I embed 50 documents (installation guides), and each document is divided into 50 chunks, the LLM struggles a bit to provide me with a detailed installation guide.

Thank you for your responses; I hope you will help me find a solution for these points.

QA mode does not currently support French, and the effect is also not as expected. What I mean is that we can indirectly create a knowledge base for QA mode by generating QAs, and then add or modify QA chunks based on this. The chunk limit of 1000 or less means that the search content (query) and smaller chunk (Q) have a higher semantic match, while the context of LLM can be extended to larger chunk.

In the future, within a month perhaps, we will provide the Parent Document Index mode to support the indexing mode mentioned above. For example, a sentence can be used as a chunk, and the paragraph or the content of the current page where this sentence is located as the Parent can be used as context for LLM.

If this answer doesn't meet your expectations, could you please provide some more detailed scenarios for us to refer to? Thank you very much! 😊

@MrBrabus75
Copy link
Author

@takatost The problem is that the Q/A model is available only in Chinese and English, and not at all in French, even though my model is multilingual, which is a shame… (That feature would be nice, haha).

Then, I thought it would be good to increase the chunk limit for embeddings with large dimensions like Jina, Voyager, SFR Mistral mteb.

And finally, if I embed 50 documents (installation guides), and each document is divided into 50 chunks, the LLM struggles a bit to provide me with a detailed installation guide.

Thank you for your responses; I hope you will help me find a solution for these points.

QA mode does not currently support French, and the effect is also not as expected. What I mean is that we can indirectly create a knowledge base for QA mode by generating QAs, and then add or modify QA chunks based on this. The chunk limit of 1000 or less means that the search content (query) and smaller chunk (Q) have a higher semantic match, while the context of LLM can be extended to larger chunk.

In the future, within a month perhaps, we will provide the Parent Document Index mode to support the indexing mode mentioned above. For example, a sentence can be used as a chunk, and the paragraph or the content of the current page where this sentence is located as the Parent can be used as context for LLM.

If this answer doesn't meet your expectations, could you please provide some more detailed scenarios for us to refer to? Thank you very much! 😊

Thank you for your time and your very detailed response.

However, I have a few questions:

    1. Is it possible by me or by you to modify the language of questions generated during Q/A? Are they generated by the LLM model?
    1. Would you recommend that I wait for the PDI (Parent Document Index) mode, which will arrive in about a month?
    1. I am embedding about 50 documents or more, which are quite long, one or two pages, and are chunked into several pieces. Should I stick with a limit between 512 and 1000 in size? Won't this limit the result of the responses by the LLM?

Thank you for everything you do.

@takatost
Copy link
Collaborator

  1. Yeah, they are generated by LLM. It's not a perfect solution, but it effectively allows for separate editing of Q&A content in each chunk, achieving the desired result of retrieving Q and returning A. Currently, language setting is not supported, but we plan to redesign this feature in the future. For now, we'll stick with the current design.
  2. To enhance the experience, we suggest waiting for about a month. Of course, you can also try the solution mentioned in 1 for now, although it might be a bit cumbersome.
  3. As mentioned in 1, the QA mode separates the retrieval and content given to LLM. Regardless of how it's divided, this is just a general approach. Based on the principles of embedding, manual intervention or LLM involvement is needed to split/summarize and create chunks for retrieval, while the content serving as LLM context could be its parent (e.g., the corresponding page or the entire document).

Hope this can help you! 🤗

@takatost takatost added the 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. label Feb 14, 2024
@takatost takatost self-assigned this Feb 14, 2024
@takatost
Copy link
Collaborator

cc @VincePotato @JohnJyong

@MrBrabus75
Copy link
Author

okay i see, thanks you for all

Copy link

github-actions bot commented Mar 1, 2024

Close due to it's no longer active, if you have any questions, you can reopen it.

@github-actions github-actions bot added the no-issue-activity Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Mar 1, 2024
@github-actions github-actions bot closed this as completed Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 enhancement New feature or request 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. no-issue-activity Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed
Projects
None yet
5 participants