If you have ever worked on a Spring application using Hibernate 5, you probably encountered the challenge of dealing with duplicated entities.
It is an odd scenario where your collection contains multiple entries referencing the same entity. This is a quite known issue, and there are ways to address it whether using the DISTINCT
keyword, Sets, or JPA EntityGraph.
However, I happened to have quite disturbing behavior using EntityGraph returning duplicated entities! We will be diving a bit under the hood of Hibernate 5 to understand what happened.
To illustrate Hibernate 5 behavior, let’s start with a basic Spring Boot project connected to a database:
The database schema is a simple model as follows:
The Hibernate entities are defined as follows:
Our focus will be on the queries generated by this JPA repository:
@Repository
public interface AuthorRepository extends JpaRepository<Author, Integer> {
@Query("SELECT a FROM Author a LEFT JOIN FETCH a.books")
List<Author> findAuthorsLeftJoinFetchNoEntityGraph();
@EntityGraph(attributePaths = "books")
@Query("FROM Author a LEFT JOIN a.books")
List<Author> findAuthorsLeftJoinWithEntityGraph();
@Query("FROM Author a LEFT JOIN a.books")
List<Author> findAuthorsLeftJoinWithoutEntityGraph();
@EntityGraph(attributePaths = "books")
@Query("FROM Author a INNER JOIN a.agent WHERE a.agent.name like '%John%'")
List<Author> findAuthorsWithJoinNotFetchedAndEntityGraph();
@EntityGraph(attributePaths = "books")
@Query("SELECT a FROM Author a INNER JOIN a.agent WHERE a.agent.name like '%John%'")
List<Author> findAuthorsWithJoinNotFetchedAndEntityGraphAndSelect();
}
To understand the occurrence of entity duplication, let’s consider a use case where the objective is to retrieve all authors and their respective books. In order to avoid the N+1 query issue I include in my JPQL query the keyword FETCH
.
It enables the retrieval of both authors and their associated books in a single SQL query, eliminating the necessity to individually fetch books for each author.
@Query("SELECT a FROM Author a LEFT JOIN FETCH a.books")
List<Author> findAuthorsLeftJoinFetchNoEntityGraph();
Invoking this method will result in the compilation of an SQL query equivalent to the following:
SELECT A, B FROM AUTHOR A LEFT JOIN BOOK B ON A.ID = B.BOOK_AUTHOR_ID;
In a case where you have 2 authors that have respectively 2 and 3 books, the result of the SQL query will yield the following table:
Depending on the type of join used, the association between the Author
and Book
tables are established based on the provided condition.
This will produce for each Book
that matched an Author
, a new row containing the columns of the Book
table but also the columns of the Author
table.
Hibernate, upon retrieving this result set, intelligently resolves the entity without duplication. In our case, we end up with only two instances of the Author
entity. However, it handles resolution on a row-by-row basis and does not inherently deduplicate the rows by default.
This explains why the List
, returned by calling findAuthorsLeftJoinFetchNoEntityGraph
, returns the following collection:
[Author@5640, Author@5640, Author@5641, Author@5641, Author@5641]
Author@5640
being the instance representing J.R.R Tolkien Author
entity and Author@5641
representing the Author
Isaac Asimov.
This duplication is mostly observed on @OneToMany
and @ManyToMany
relationships since joining with these types of relations might match multiple rows of the associated table.
On the contrary, @OneToOne
and @ManyToOne
relationships match at most only 1 entry on the associated table and will not induce duplication.
Although this duplication issue is quite common and often occurs during the development of applications, there are ways to address and resolve this issue.
Set
A straightforward solution involves using the returning type Set<Author>
instead of List<Author>
:
@Query("FROM Author a LEFT JOIN FETCH a.books")
Set<Author> findAuthorsLeftJoinFetchNoEntityGraph();
When the results are collected, duplicated entities will be filtered through the Set
.
DISTINCT
keyword
Another approach is to use the DISTINCT
keyword:
@Query("SELECT DISTINCT a FROM Author a LEFT JOIN FETCH a.books")
List<Author> findAuthorsLeftJoinFetchNoEntityGraph();
It should be noted that using DISTINCT
as two semantic meanings:
When the DISTINCT
keyword is used in a JPQL query, it leads to the generation of an SQL query featuring the DISTINCT
keyword but also instructing Hibernate to filter duplicated entities.
However, using DISTINCT
in SQL induces an overhead when executed in the database, and in some cases, the goal is to only filter duplicate entities induced by the join. For such cases, the Hibernate hint HINT_PASS_DISTINCT_THROUGH
can be employed to avoid the generation of DISTINCT
in the SQL query:
@QueryHints(@QueryHint(name = HINT_PASS_DISTINCT_THROUGH, value = "false"))
@Query("SELECT DISTINCT a FROM Author a LEFT JOIN FETCH a.books")
Set<Author> findDistinctAuthorsLeftJoinFetchNoEntityGraph();
EntityGraph
In JPA 2.1, EntityGraph was introduced as a powerful feature allowing developers to define the loading strategies of entities. In our case, we can just use the @EntityGraph
annotation to load the books along with the Author
, as illustrated below:
@EntityGraph(attributePaths = "books")
@Query("FROM Author a LEFT JOIN a.books WHERE a.books.size > 2")
List<Author> findAuthorsLeftJoinWithEntityGraph();
This JPQL query is designed to retrieve all authors, along with their respective books, who possess more than 2 books.
Even without explicitly including the FETCH
keyword in the SQL query, the EntityGraph ensures the books are loaded and the parent entity Author
remains non-duplicated.
Note: This is a simple example of using EntityGraph; a more complete and sophisticated way of using it exists, but this is not the scope of this section.
While it is established that EntityGraph handles entity deduplication, let's look into a more complicated scenario.
Consider the Author
entity, which has a @ManyToOne
relation with Agent
. Suppose the objective is to retrieve all authors and their books where the author's agent has a name containing "John", yet we have no intention of loading the Agent
entity for an arbitrary reason. In this case, we can accomplish this by the following code:
@EntityGraph(attributePaths = "books")
@Query("FROM Author a INNER JOIN a.agent WHERE a.agent.name like '%John%'")
List<Author> findAuthorsWithJoinNotFetchedAndEntityGraph();
Since we are using EntityGraph, we expect to avoid duplicated Author
, but upon executing the method, the outcome is unexpected:
Duplicated entities are retrieved despite the use of EntityGraph. To understand the underlying reasons for this behavior, let’s dive a bit under the hood of Hibernate!
Hibernate is a quite complete and complex ORM, it can be hard to grasp entirely its fine details, but in our case, since we want to understand deduplication mechanisms, we will be focusing on the method
For the sake of clarity, some portions of the code have been omitted in order to highlight the most crucial aspects.
Upon invoking a method from the AuthorRepository
, the execution will eventually reach this code snippet. Here is a breakdown of its functionality:
It computes a boolean called needsDistincting
, and as the name implies, it identifies whether the result set needs to be deduplicated.
We can observe when we do use aDISTINCT
keyword, EntityGraph, or limit the size of retrieval, AND there is a FETCH
operation in the JPQL query; the distinction needs to be applied.
In our issue, it can be observed that whenever EntityGraph is used, the distinction will be applied since EntityGraph will edit the source query to “FETCH
” child entities given in the attribute paths.
This method invocation executes the query in the database, retrieves the result set, and resolves the entities. This list might contain duplicated entities.
This is the most interesting part here; when it does need to distinguish entities, Hibernate uses an IdentitySet
to filter out the duplicated entity.
At this point, it is established that using EntityGraph should filter duplicated entities, but what happened to cause the unexpected outcome previously discussed?
In order to have a better understanding of the situation, let’s compare the results of queryLoader.list( session, queryParametersToUse )
(2) when invoking the following methods:
@EntityGraph(attributePaths = "books")
@Query("FROM Author a LEFT JOIN a.books")
List<Author> findAuthorsLeftJoinWithEntityGraph();
and
@EntityGraph(attributePaths = "books")
@Query("FROM Author a INNER JOIN a.agent WHERE a.agent.name like '%John%'")
List<Author> findAuthorsWithJoinNotFetchedAndEntityGraph();
Upon calling findAuthorsLeftJoinWithEntityGraph
, the returned results contains:
Given the EntityGraph is used, the duplication will be resolved through the IdentitySet
, and at the end, the method will return a List<Author>
consisting of 3 authors/instances.
However, upon callingfindAuthorsWithJoinNotFetchedAndEntityGraph
, the returned results contains the following:
In this scenario, the result is not a List of Author
instances but rather a List
of distinct array instances of Objects; each encapsulating duplicated Author
and Agent
instances. The reason behind the presence of duplicate entities becomes evident.
Each array (
Object[2]@13551
, Object[2]@135512
, ..., Object[2]@13555
), being a unique instance, is not filtered during the distinction process with the IdentitySet
.
Consequently, all arrays are considered as distinct instances, leading to the observed duplication.
In order to understand the difference between obtaining a List of Author
instances in the former case and a list of arrays containing Author
and Agent
instances in the latter case, we need to explore deeper; let’s take a look at the class
This initialization code has interesting parts, particularly the one highlighted in red. It creates an array of boolean includeInSelect
where it checks which entities should be selected and be returned.
Let’s take for example:
@Query("SELECT a, b FROM Author a LEFT JOIN FETCH a.books b")
The initialize method checks theSELECT
content one by one which is a
and b
; if the selected entity originates from a FETCH
in the FROM
clause, then it will be omitted since FETCH
is used to load children entities (but not in the case of the parent entity). Otherwise, the entity should be selected.
In this example, the array includeInSelect
contains [true (Author), false (Book)]
.
In our case, of entity duplication with EntityGraph:
@EntityGraph(attributePaths = "books")
@Query("FROM Author a INNER JOIN a.agent WHERE a.agent.name like 'John'")
List<Author> findAuthorsWithJoinNotFetchedAndEntityGraph();
We did not specify the SELECT
clause, so by default, it will select all the entities from all the tables specified in the FROM
clause.
The includeInSelect
logic returns an array [true (Author), false (Book), true (Agent)]
where Author
and Agent
(2 elements) should be selected because only Book
is subject to a FETCH
operation due to the use of EntityGraph.
This explains the difference between findAuthorsLeftJoinWithEntityGraph
and findAuthorsWithJoinNotFetchedAndEntityGraph
.
The former returns only one value per row because the SELECT
clause contained only Author
and Book
entities (the select of Book
was generated by EntityGraph), however, since Book
was subject to a FETCH
, it was not taken into account in the includeInSelect
.
The latter returns an array per row because Author
and Agent
were in the SELECT
clause and are not subject to a FETCH
operation unlike Book
. Thus, returning an array consisting of 2 elements per row.
In summary, Hibernate consistently returns an array, and the content of this array is determined by the entities specified in the SELECT
clause that are not associated with FETCH
operations.
However, if we follow the preceding logic, findAuthorsLeftJoinWithEntityGraph
should also return an array but an array of a single element. This is accurate; it does. However, upon examining the implementation at **
To ensure correct duplicate filtering when applying a distinction, the result set should be a list of entities rather than a list of entity arrays. This behavior is determined by the entities selected.
A straightforward solution is to include the desired entities in the SELECT
clause:
@EntityGraph(attributePaths = "books")
@Query("SELECT a FROM Author a INNER JOIN a.agent WHERE a.agent.name like '%John%'")
List<Author> findAuthorsWithJoinNotFetchedAndEntityGraph();
The SELECT
clause will only contain Author
and Book
(the Book
select being generated by the EntityGraph):
Since Book
is a FETCH
operation, the includeInSelect
will only select one element (Author
), and instead of returning an array, it will return directly the Author
instance which then will pass through the IdentitySet
to be deduplicated.
Entity duplication occurs when tables are joined, and multiple rows are matched. There are several methods to address this issue, including using Set
, the DISTINCT
keyword, or EntityGraph to deduplicate the result set.
In the last two cases, deduplication is achieved through a Hibernate flag that accumulates results into an IdentitySet
.
However, it may not work in certain scenarios due to the fact that, depending on your SELECT
clause, it could return newly instantiated arrays of your selected entities for each row. This could undermine deduplication through the IdentitySet
.
Therefore, it is advisable, when writing JPQL queries, to always specify the SELECT
clause. If not supplied, all entities in the FROM
clause that are not subject to a FETCH
operation will be selected by default.
The issue of duplicated parent entities is addressed in Hibernate 6 (shipped in Spring Boot 3), as it has introduced automatic handling of duplicated entities. However, considering that the release of Hibernate 6 is relatively recent, many applications still run on Hibernate 5 or earlier versions, and migration from Hibernate 5 to 6 might be challenging.
Thus, understanding how Hibernate 5 handles these situations remains relevant for the considerable number of applications still using it.
Also published here