Listen to this story
William Guo, WhaleOps CEO, Apache Software Foundation Member
Due to the need to promote the upgrade of Apache DolphinScheduler at work, preliminary research that I have done shows significant improvements from version 1.3.4 to 3.1.2, with substantial enhancements in performance and functionality. So, an upgrade is recommended.
The official upgrade documentation provides the upgrade scripts. For minor version updates, executing the script suffices, but upgrading across multiple major versions can still lead to various issues. The following is a summary of these issues.
Old Version: 1.3.4
New Version: 3.1.2
After the upgrade, using the resource center results in an error: IllegalArgumentException: Failed to specify server's Kerberos principal name
.
The resource center uses HDFS with Kerberos authentication enabled.
Solution:
Edit dolphinscheduler/api-server/conf/hdfs-site.xml
and add the following content:
<property>
<name>dfs.namenode.kerberos.principal.pattern</name>
<value>*</value>
</property>
2. Task Instance Log Loss
After the upgrade, viewing task instance logs results in an error indicating that the logs cannot be found. By checking the error message and the new version’s directory structure and log paths, it is found that the log path has changed.
Before upgrade: log path is under /logs/
After the upgrade: the log path is under /worker-server/logs/
Solution: Execute SQL to modify the log path
update t_ds_task_instance set log_path=replace(log_path,'/logs/','/worker-server/logs/');
Then copy the original log files to the new log path:
cp -r {old_dolphinscheduler_directory}/logs/[1-9]* {new_dolphinscheduler_directory}/worker-server/logs/*
3. Error Creating Workflow After Upgrade
The error message indicates that the initial values of the primary keys in t_ds_process_definition_log
and t_ds_process_definition
are inconsistent. Synchronizing these values resolves the issue.
Solution: Execute the following SQL
-- Get the auto-increment value of the primary key
select AUTO_INCREMENT FROM information_schema.TABLES WHERE TABLE_SCHEMA = 'dolphinscheduler' AND TABLE_NAME = 't_ds_process_definition' limit 1;
-- Use the result of the above SQL to execute the following
alter table dolphinscheduler_bak1.t_ds_process_definition_log auto_increment = {max_id};
4. Empty Task Instance List After Upgrade
By checking the SQL query in dolphinscheduler-dao/src/main/resources/org/apache/dolphinscheduler/dao/mapper/TaskInstanceMapper.xml
, it is found that the SQL for querying task instance lists references the t_ds_task_definition_log
table, but the join condition define.code=instance.task_code
does not match.
Considering the condition define.project_code = #{projectCode}
, the join with t_ds_task_definition_log
is mainly for filtering by projectCode
. Modify the SQL as follows:
Solution:
select
<include refid="baseSqlV2">
<property name="alias" value="instance"/>
</include>
,
process.name as process_instance_name
from t_ds_task_instance instance
-- left join t_ds_task_definition_log define
-- on define.code = instance.task_code and
-- define.version = instance.task_definition_version
join t_ds_process_instance process
on process.id = instance.process_instance_id
join t_ds_process_definition define
on define.code = process.process_definition_code
where define.project_code = #{projectCode}
<if test="startTime != null">
and instance.start_time <![CDATA[ >= ]] > #{startTime}
</if>
By directly joining with t_ds_process_definition
, which also contains the project_code
field for filtering, the query will return the correct data.
5. Null Pointer Exception During Upgrade Script Execution
(1) Analyze the log and locate the issue at line 517 in UpgradeDao.java
513 if (TASK_TYPE_SUB_PROCESS.equals(taskType)) {
514 JsonNode jsonNodeDefinitionId = param.get("processDefinitionId");
515 if (jsonNodeDefinitionId != null) {
516 param.put("processDefinitionCode",
517 processDefinitionMap.get(jsonNodeDefinitionId.asInt()).getCode());
518 param.remove("processDefinitionId");
519 }
520 }
The issue is that processDefinitionMap.get(jsonNodeDefinitionId.asInt())
returns null. Add a null check, and if null, log the relevant information for later review.
Solution:
if (jsonNodeDefinitionId != null) {
if (processDefinitionMap.get(jsonNodeDefinitionId.asInt()) != null) {
param.put("processDefinitionCode",processDefinitionMap.get(jsonNodeDefinitionId.asInt()).getCode());
param.remove("processDefinitionId");
} else {
logger.error("*******************error");
logger.error("*******************param:" + param);
logger.error("*******************jsonNodeDefinitionId:" + jsonNodeDefinitionId);
}
}
(2) Analyze the log and locate the issue at line 675 in UpgradeDao.java
Original Code:
669 if (mapEntry.isPresent()) {
670 Map.Entry<Long, Map<String, Long>> processCodeTaskNameCodeEntry = mapEntry.get();
671 dependItem.put("definitionCode", processCodeTaskNameCodeEntry.getKey());
672 String depTasks = dependItem.get("depTasks").asText();
673 long taskCode =
674 "ALL".equals(depTasks) || processCodeTaskNameCodeEntry.getValue() == null ? 0L
675 : processCodeTaskNameCodeEntry.getValue().get(depTasks);
676 dependItem.put("depTaskCode", taskCode);
677 }
The issue is that processCodeTaskNameCodeEntry.getValue().get(depTasks)
returns null. Modify the logic to check for null before assigning the value and log the relevant information.
Solution:
long taskCode =0;
if (processCodeTaskNameCodeEntry.getValue() != null
&&processCodeTaskNameCodeEntry.getValue().get(depTasks)!=null){
taskCode =processCodeTaskNameCodeEntry.getValue().get(depTasks);
}else{
logger.error("******************** depTasks:"+depTasks);
logger.error("******************** taskCode not in "+JSONUtils.toJsonString(processCodeTaskNameCodeEntry));
}
dependItem.put("depTaskCode", taskCode);
6. Login Failure After LDAP Integration, Uncertain Email Field Name
Configure LDAP integration in api-server/conf/application.yaml
security:
authentication:
# Authentication types (supported types: PASSWORD,LDAP)
type: LDAP
# IF you set type `LDAP`, below config will be effective
ldap:
# ldap server config
urls: xxx
base-dn: xxx
username: xxx
password: xxx
user:
# admin userId when you use LDAP login
admin: xxx
identity-attribute: xxx
email-attribute: xxx
# action when ldap user is not exist (supported types: CREATE,DENY)
not-exist-action: CREATE
To successfully integrate LDAP, the fields urls
, base-dn
, username
, password
, identity
, and email
must be correctly filled. If the email field name is unknown, leave it empty initially.
After starting the service, attempt to log in with an LDAP user.
Solution: The LDAP authentication code is located in dolphinscheduler-api/src/main/java/org/apache/dolphinscheduler/api/security/impl/ldap/LdapService.java
in the ldapLogin()
method
ctx = new InitialLdapContext(searchEnv, null);
SearchControls sc = new SearchControls();
sc.setReturningAttributes(new String[]{ldapEmailAttribute});
sc.setSearchScope(SearchControls.SUBTREE_SCOPE);
EqualsFilter filter = new EqualsFilter(ldapUserIdentifyingAttribute, userId);
NamingEnumeration<SearchResult> results = ctx.search(ldapBaseDn, filter.toString(), sc);
if (results.hasMore()) {
// get the users DN (distinguishedName) from the result
SearchResult result = results.next();
NamingEnumeration<? extends Attribute> attrs = result.getAttributes().getAll();
while (attrs.hasMore()) {
// Open another connection to the LDAP server with the found DN and the password
searchEnv.put(Context.SECURITY_PRINCIPAL, result.getNameInNamespace());
searchEnv.put(Context.SECURITY_CREDENTIALS, userPwd);
try {
new InitialDirContext(searchEnv);
} catch (Exception e) {
logger.warn("invalid ldap credentials or ldap search error", e);
return null;
}
Attribute attr = attrs.next();
if (attr.getID().equals(ldapEmailAttribute)) {
return (String) attr.get();
}
}
}
The third line filters based on the specified email attribute field. Comment out this line temporarily:
// sc.setReturningAttributes(new String[]{ldapEmailAttribute});
Rerun the code; the tenth line will return all fields:
NamingEnumeration<? extends Attribute> attrs = result.getAttributes().getAll();
By printing or debugging, identify the correct email field and update the configuration file accordingly. Then, uncomment the code and restart the service to integrate LDAP login successfully.
7. Resource File Authorization Not Effective for Regular Users by Admin
After multiple tests, it is found that regular users can only see resource files owned by themselves. Authorization by the admin does not make the files visible.
Solution:
In the listAuthorizedResource
method of ResourcePermissionCheckServiceImpl.java
file located at dolphinscheduler-api/src/main/java/org/apache/dolphinscheduler/api/permission/
, modify the return collection to relationResources
:
@Override
public Set<Integer> listAuthorizedResource(int userId, Logger logger) {
List<Resource> relationResources;
if (userId == 0) {
relationResources = new ArrayList<>();
} else {
// query resource relation
List<Integer> resIds = resourceUserMapper.queryResourcesIdListByUserIdAndPerm(userId, 0);
relationResources = CollectionUtils.isEmpty(resIds) ? new ArrayList<>() : resourceMapper.queryResourceListById(resIds);
}
List<Resource> ownResourceList = resourceMapper.queryResourceListAuthored(userId, -1);
relationResources.addAll(ownResourceList);
return relationResources.stream().map(Resource::getId).collect(toSet()); // Fixed the issue that the resource file authorization was invalid
// return ownResourceList.stream().map(Resource::getId).collect(toSet());
}
Check the new version’s change log and find that this bug has been fixed in version 3.1.3: GitHub PR #13318
8. Kerberos Expiry Issues
Kerberos is configured with a ticket expiration time, causing the HDFS resources in the resource center to become inaccessible after a period. The best solution is to add logic for periodic credential renewal.
Solution:
Add the following method in CommonUtils.java
located at dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/utils/
:
/**
* Periodically update credentials
*/
private static void startCheckKeytabTgtAndReloginJob() {
// Periodically update credentials daily
Executors.newScheduledThreadPool(1).scheduleWithFixedDelay(() -> {
try {
UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab();
logger.warn("Check Kerberos Tgt And Relogin From Keytab Finish.");
} catch (IOException e) {
logger.error("Check Kerberos Tgt And Relogin From Keytab Error", e);
}
}, 0, 1, TimeUnit.DAYS);
logger.info("Start Check Keytab TGT And Relogin Job Success.");
}
Then call this method in the loadKerberosConf
method before returning true:
public static boolean loadKerberosConf(String javaSecurityKrb5Conf, String loginUserKeytabUsername,
String loginUserKeytabPath, Configuration configuration) throws IOException {
if (CommonUtils.getKerberosStartupState()) {
System.setProperty(Constants.JAVA_SECURITY_KRB5_CONF, StringUtils.defaultIfBlank(javaSecurityKrb5Conf,
PropertyUtils.getString(Constants.JAVA_SECURITY_KRB5_CONF_PATH)));
configuration.set(Constants.HADOOP_SECURITY_AUTHENTICATION, Constants.KERBEROS);
UserGroupInformation.setConfiguration(configuration);
UserGroupInformation.loginUserFromKeytab(
StringUtils.defaultIfBlank(loginUserKeytabUsername,
PropertyUtils.getString(Constants.LOGIN_USER_KEY_TAB_USERNAME)),
StringUtils.defaultIfBlank(loginUserKeytabPath,
PropertyUtils.getString(Constants.LOGIN_USER_KEY_TAB_PATH)));
startCheckKeytabTgtAndReloginJob(); // Call here
return true;
}
return false;
}
This article primarily records the issues encountered during the upgrade process and aims to help the community that facing similar challenges.