Currently I try to store some news from a web api with the help of JPA.
I have 3 entities i need to store: Webpage, NewsPost and the Query that returned the news post. I have one table for each of the three. My simpliefied JPA entities looking like the following ones:
@Entity
@Data
@Table(name = "NewsPosts", schema = "data")
@EqualsAndHashCode
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class NewsPost {
@Id
@Column(name = "id")
private long id;
@Basic
@Column(name = "subject")
private String subject;
@Basic
@Column(name = "post_text")
private String postText;
@ManyToOne(fetch = FetchType.LAZY, cascade = CascadeType.MERGE)
@JoinColumn(name = "newsSite")
private NewsSite site;
@ManyToMany(fetch = FetchType.EAGER, cascade = CascadeType.MERGE)
@JoinTable(name = "query_news_post", joinColumns = @JoinColumn(name = "newsid"), inverseJoinColumns = @JoinColumn(name = "queryid"))
private Set<QueryEntity> queries;
}
@Entity
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
@Table(name = "queries", schema = "data")
@EqualsAndHashCode
public class QueryEntity {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
@Column(name = "id")
private int id;
@EqualsAndHashCode.Exclude
@Basic
@Column(name = "query")
private String query;
// needs to be exclueded otherwise we can create stack overflow, because of circular references...
@EqualsAndHashCode.Exclude
@ToString.Exclude
@ManyToMany(mappedBy = "queries", fetch = FetchType.LAZY, cascade = CascadeType.MERGE)
Set<PostsEntity> posts;
}
@Entity
@Data
@Table(name = "sites", schema = "data")
@EqualsAndHashCode
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class newsSite {
@Id
@Column(name = "SiteId")
private long id;
@Basic
@Column(name = "SiteName")
private String site;
}
Currently I’m doing the following: I create the query and retrieve the of the query. Then i start crawling:
I get the objects from the web api back in paginated fashion with pagesize of 100 newsPosts i use an object mapper to map the json response to my entity classes.
Afterwards i tried two different thins:
- I added the query ID as Set to the NewsPost and wrote it back to DB with the
EntityManager
‘s merge option. This works quite well until i came to the point, where I got a NewsPost again for another query, then the new query is overwritten by the old one. To solve this it tried 2. - I check if the NewsPost already exists if it does i retrieved the post added the new query to the existing one and merged it back to the database as i did before. When doing this i works quite well and i get the expected result for the first batches but then suddenly the application start to consume more and more memory for the thrid batch. I attachted a screenshot from JavaVisualVM. Has somebody an idea why this happens?
Edit:
As some Questions were raised in the comments i would like to provide the answers to the questions here.
I think with the crawling everything works fine. The return of the Webapi comes as json. I’m using jackson mapper to map this to a POJO and afterwards I’m using the Dozer mapper to convert to POJO to the Entity. (Yes i need the step to POJO first for other purposes in the application this is workin fine).
Regarding the writing with the EntityManager I’m not sure if I’m doing that correctly.
At first i created a JPA repo for checking if the post already exists (to get the old query ids and avoid the issue with the overwriting in the queryid, postid table). My JPA repo looks as follows.
@Repository
public interface PostRepo extends JpaRepository<NewsPost, Long> {
NewsPost getById(long id);
}
To update the posts I’m doing this as follows:
private void updatePosts(List<NewsPost> posts){
posts.forEach(post->{
NewsPost foundPost = postRepo.getById(post.getId());
if(foundPost!=null){
post.getQueries().addAll(foundPost.getQueries());
}});
}
I’m currently writing my entities as follows i have a list of entities the contains also the updated posts and i have an autowired EntityManagerFactory
in my class that handles the writing.
EntityManager em = entityManagerFactory.createEntityManager();
try {
EntityTransaction transaction = em.getTransaction();
transaction.begin();
entities.forEach(entity->em.merge(entity))
em.flush();
transaction.commit();
} finally {
em.clear();
em.close();
}
I’m pretty sure that it is the writing process. If i keep the logic of my software the same but only skip the merge or just printing or dumping the entities to a file everything works and fast and no error appears so it seems to be an issue with the merge comment?
Regarding the question if my program dies because of the memory consumption it depends. If I run it on my mac is consumes up to 8+ gigabytes of ram but MAC OS is handling this and swaps the ram to disk. If I run it as a docker container von CentOS the process is killed due to to less memory.
Don’t now if this is relevant, but I’m using OpenJDK 11, Springboot 2.2.6, and a MYSQL 8 Database.
I configured jpa as follows in my application.yml:
spring:
main:
allow-bean-definition-overriding: true
datasource:
url: "jdbc:mysql://db"
username: user
password: secret
driver-class-name: com.mysql.cj.jdbc.Driver
test-while-idle: true
validation-query: Select 1
jpa:
database-platform: org.hibernate.dialect.MySQL8Dialect
hibernate:
ddl-auto: none
properties:
hibernate:
event:
merge:
entity_copy_observer: allow
```
3
Answers
Solved it on my own with trying around. I created an entity for the many to many relation. Afterwards i created CRUD repositories for each entity and used
saveAll
from crud repository. This is working fine also with the memory. The GC now produces the expected chainsaw pattern in the memory visualisation. But I still have no clue why the many to many relation I created before with the join table in the annotation created the issues regarding the memory management. Could somebody explain why this solves my problem is ManyToMany creating circular dependencies? But as far as I know GC also finds circular dependencies.An EAGER relation in ManyToMany brings up many objects. Regarding LAZY realtion, make sure to fetch them, because if you don’t, going through the complete object to convert it to JSON or POJO will throw a query for each object that has not been initialized with a fetch, something dangerous. If you don’t need all of them you can use the @JsonIgnore annotation.
If the merge process is the problem, a quick fix to keep memory consumption low in the
entityManager
could be add aem.flush();
andem.clear();
after every merge:However, I think you should change your model. Loading all the existing queries of every post just to add new ones is very inefficient. You could model the N-M relation into a new entity and just persist new relations.