Associate Professor  Hongyu Zhang

Associate Professor Hongyu Zhang

Honorary Associate Professor

School of Information and Physical Sciences (Computer Science and Software Engineering)

Intelligent software engineering

By mining a vast amount of software data, Associate Professor Hongyu Zhang is developing intelligent methods and tools that improve software quality and development productivity.

Hongyu Zhang

“Currently, software development is largely a manual, time-consuming, and error-prone process”, says Hongyu, “we can improve such a process by learning from software that was written before”.

“Over the years, a large number of software systems have been developed. These software systems are associated with a variety of data such as source code, bugs, logs, incident reports, metric data, etc. The availability of vast amounts of software data opens the opportunity for us to improve software quality and productivity.”

Together with his collaborators and students, Hongyu has proposed many data mining, machine learning (including deep learning), data mining, and information retrieval based methods to extract knowledge from the software data and solve software engineering problems. Some of his works are as follows:

Intelligent programming

To help programmers program, Hongyu proposed many innovative methods that learn from a large amount of source code for effective code search, code summarization, code generation, and code pattern mining. For example, he proposed one of the first deep learning based methods for source code search and API recommendation (FSE’16, ICSE’18), which can help programmers write new programs by searching and reusing existing code. He also proposed neural programming by example (AAAI’17), which targets at a challenging problem of automatically generating a program based on input/output examples through a deep neural network.

Intelligent quality prediction

The quality of software is important. Hongyu proposed many machine learning based methods for predicting defect-prone software modules. He also worked on cloud failure prediction, which predicts future failures of a computing node or a hard disk in a large-scale cloud system based on historical system metric and failure data (FSE’18). Hongyu proposed DeepPerf (ICSE 2019), which utilizes a deep feedforward neural network for predicting the runtime performance of a highly configurable software system. It was the first time that deep neural network was applied for successful software performance prediction.

Intelligent fault detection and diagnosis

Software systems always contain faults (bugs). Hongyu proposed many innovative methods for log-based fault detection, crash-based fault localization, bug report analytics, and incident management. For example, he proposed BugLocator (ICSE’12), which automatically locates buggy source code files based on a bug report. He also works on data-driven methods for compiler testing, with the aim of improving the efficiency of compiler testing.

Making real impact

Apart from scholarly publications, Hongyu is also keen to see the impact of research on practice. When Hongyu was working in Microsoft, he worked closely with Bing and Visual Studio teams on the Bing Developer Assistant (BDA) project. BDA is a Visual Studio Extension that allows developers to search for reusable code snippets based on queries. The BDA tool received more than 450K downloads in 2016. Hongyu has been collaborating with Microsoft Research teams and published many innovative techniques, which were also successfully deployed to real-world online service systems in Microsoft.

An independent 2019 Elsevier Bibliometric Assessment of Software Engineering Scholars ranks Hongyu as the world’s top 20 most prolific Software Engineering researcher in the past decade. He has been recognised in The Australian’s Top Researchers special edition publication (09/2020) as the leading researcher in the field of Software Systems.

Hongyu Zhang

Intelligent software engineering

By mining a vast amount of software data, Associate Professor Hongyu Zhang is developing intelligent methods and tools that improve software quality and development productivity.

Read more

Career Summary

Biography

My research is in the area of Software Engineering, in particular, intelligent software engineering, software analytics, fault diagnosis, maintenance, and reuse. The main theme of my research is to improve software quality and productivity by mining and analyzing software data. I have published more than 200 Research Papers in international journals and conferences, including TSE, TOSEM, ICSE, FSE, ASE, ISSTA, POPL, AAAI, IJCAI, KDD, ICSME, ICDM, and USENIX ATC. I received more than 8 ACM Distinguished Paper awards and Best Paper awards. I have also served as a program committee member/track chair for many software engineering conferences. I am an associate editor of ACM Computing Surveys and Automated Software EngineeringI am a Senior Member of IEEE, a Distinguished Member of ACM, a Distinguished Member of CCF, and a Fellow of Engineers Australia (FIEAust). 

More information about me can be found at my Personal WebpageI can be always reached at hongyujohn@gmail.com.

Research Area:

My research area is software engineering, in particular:

  • software analytics, mining software repository, data-driven software engineering
  • intelligent software and service engineering
  • software testing, debugging, fault diagnosis
  • software maintenance and reuse
The main theme of my research is to improve software quality and productivity by utilizing knowledge mined from software data. Over the years, a software organization could accumulate a large amount of data including source code, bug reports, execution logs, changes, metrics, documents, and so on. Data mining, machine learning, and information retrieval techniques can be applied to extract knowledge from the software data and solve software engineering problems. 
 

Recent Program Organizations:

  • Technical Briefings co-chair: The 45th International Conference on Software Engineering (ICSE 2023)
  • General co-chair: The 36th International Conference on Software Maintenance and Evolution (ICSME 2020)
  • Tool Demonstration co-chair: The International Symposium of Software Testing and Analysis (ISSTA 2019) 
  • Program co-chair, The 18th IEEE International Conference on Software Quality, Reliability, and Security (QRS 2018).
  • Program co-chair, The 25th Asia-Pacific Software Engineering Conference (APSEC 2018).
  • Co-organizer: Dagstuhl Seminar 17502 on "Testing and Verification of Compilers", Dec 2017, Germany.
  • Program co-chair, The 12th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE’16).
  • The International Conference on Predictive Models in Software Engineering (PROMISE), 2014-2018.   (Steering Committee Member)

Recent Program Committees:

  • The IEEE/ACM International Conference on Automated Software Engineering (ASE 2015ASE 2018-2023). 
  • The IEEE International Conference on Software Maintenance and Evolution (ICSME 2013-2019)
  • The Working Conference on Mining Software Repositories (MSR 2013-2017MSR 2020-2022)
  • The International Conference on Software Engineering: ICSE 2021, ICSE 2023 (Technical Program Committee)
  • The International Conference on Software Analysis, Evolution and Reengineering (SANER 2015-2017,  SANER 2018/2020/2021(industry), SANER 2023)
  • The thirty-seventh International Conference on Machine Learning (ICML 2020, ICML 2021, ICML 2022)
  • The eighth International Conference on Learning Representations (ICLR 2020, ICLR 2021, ICLR 2022)
  • The AAAI Conference on Artificial Intelligence (AAAI 2021, AAAI 2022, AAAI 2023)
  • ACM SIGSOFT Symposium on the Foundations of Software Engineering: FSE 2022 (Technical Program Committee)
  • The International Symposium on Empirical Software Engineering and Measurement (ESEM 2016 - 2022)

Qualifications

  • Doctor of Philosophy, National University of Singapore

Keywords

  • Artificial Intelligence
  • Data Mining
  • Software Engineering

Languages

  • Mandarin (Mother)
  • English (Fluent)

Fields of Research

Code Description Percentage
461201 Automated software engineering 70
461207 Software quality, processes and metrics 30

Professional Experience

UON Appointment

Title Organisation / Department
Associate Professor University of Newcastle
School of Electrical Engineering and Computing
Australia
Edit

Publications

For publications that are currently unpublished or in-press, details are shown in italics.


Chapter (2 outputs)

Year Citation Altmetrics Link
2016 Hou Z, Zhang H, Zhang H, Zhang D, 'Visual analytics for software engineering data', Perspectives on Data Science for Software Engineering 77-80 (2016)

Many data analysis techniques require substantial knowledge and skills and are typically performed by ¿data scientists¿. Ordinary users may find it difficult to apply these techni... [more]

Many data analysis techniques require substantial knowledge and skills and are typically performed by ¿data scientists¿. Ordinary users may find it difficult to apply these techniques to quickly explore the data by themselves. We propose MetroEyes, a visual analytics tool for interactive data exploration. We have successfully transferred the main concepts and experiences of MetroEyes to Microsoft Power BI.

DOI 10.1016/B978-0-12-804206-9.00015-5
Citations Scopus - 2
2016 Lin Q, Lou JG, Zhang H, Zhang D, 'How to tame your online services', Perspectives on Data Science for Software Engineering 63-65 (2016)

Online service systems have become increasingly popular and important. Service incidents can lead to huge economic loss. We designed a set of incident management techniques based ... [more]

Online service systems have become increasingly popular and important. Service incidents can lead to huge economic loss. We designed a set of incident management techniques based on the analysis of a huge amount of data collected at service runtime. Our tool is called Service Analysis Studio (SAS), which has been successfully applied to large-scale online services provided by Microsoft.

DOI 10.1016/B978-0-12-804206-9.00012-X
Citations Scopus - 2

Journal article (49 outputs)

Year Citation Altmetrics Link
2024 Wu D, Feng Y, Zhang H, Xu B, 'Automatic recognizing relevant fragments of APIs using API references', Automated Software Engineering, 31 (2024) [C1]

API tutorials are crucial resources as they often provide detailed explanations of how to utilize APIs. Typically, an API tutorial is segmented into a number of consecutive fragme... [more]

API tutorials are crucial resources as they often provide detailed explanations of how to utilize APIs. Typically, an API tutorial is segmented into a number of consecutive fragments. If a fragment explains API usage, we regard it as a relevant fragment of the API. Recognizing relevant fragments can aid developers in comprehending, learning, and using APIs. Recently, some studies have presented relevant fragments recognition approaches, which mainly focused on using API tutorials or Stack Overflow to train the recognition model. API references are also important API learning resources as they contain abundant API knowledge. Considering the similarity between API tutorials and API references (both provide API knowledge), we believe that using API knowledge from API references could help recognize relevant tutorial fragments of APIs effectively. However, it is non-trivial to leverage API references to build a supervised learning-based recognition model. Two major problems are the lack of labeled API references and the unavailability of engineered features of API references. We propose a supervised learning based approach named RRTR (which stands for Recognize Relevant Tutorial fragments using API References) to address the above problems. For the problem of lacking labeled API references, RRTR designs heuristic rules to automatically collect relevant and irrelevant API references for APIs. Regarding the unavailable engineered features issue, we adopt the pre-trained SBERT model (SBERT stands for Sentence-BERT) to automatically learn semantic features for API references. More specifically, we first automatically generate labeled < API, ARE> pairs (ARE stands for an API reference) via our heuristic rules of API references. We then use SBERT to automatically learn semantic features for the collected pairs and train a supervised learning based recognition model. Finally, we can recognize the relevant tutorial fragments of APIs based on the trained model. To evaluate the effectiveness of RRTR, we collected Java and Android API reference datasets containing a total of 20,680 labeled < API, ARE> pairs. Experimental results demonstrate that RRTR outperforms state-of-the-art approaches in terms of F-Measure on two datasets. In addition, we conducted a user study to investigate the practicality of RRTR and the results further illustrate the effectiveness of RRTR in practice. The proposed RRTR approach can effectively recognize relevant fragments of APIs with API references by solving the problems of lacking labeled API references and engineered features of API references.

DOI 10.1007/s10515-023-00401-0
2023 Alharbi F, Luo S, Zhang H, Shaukat K, Yang G, Wheeler CA, Chen Z, 'A Brief Review of Acoustic and Vibration Signal-Based Fault Detection for Belt Conveyor Idlers Using Machine Learning Models', SENSORS, 23 (2023) [C1]
DOI 10.3390/s23041902
Citations Scopus - 16Web of Science - 6
Co-authors Zhiyong Chen, Craig Wheeler, Suhuai Luo
2023 Li Z, Zhang H, Jing XY, Xie J, Guo M, Ren J, 'DSSDPP: Data Selection and Sampling Based Domain Programming Predictor for Cross-Project Defect Prediction', IEEE Transactions on Software Engineering, 49 1941-1963 (2023) [C1]

Cross-project defect prediction (CPDP) refers to recognizing defective software modules in one project (i.e., target) using historical data collected from other projects (i.e., so... [more]

Cross-project defect prediction (CPDP) refers to recognizing defective software modules in one project (i.e., target) using historical data collected from other projects (i.e., source), which can help developers find defects and prioritize their testing efforts. Unfortunately, there often exists large distribution difference between the source and target data. Most CPDP methods neglect to select the appropriate source data for a given target at the project level. More importantly, existing CPDP models are parametric methods, which usually require intensive parameter selection and tuning to achieve better prediction performance. This would hinder wide applicability of CPDP in practice. Moreover, most CPDP methods do not address the cross-project class imbalance problem. These limitations lead to suboptimal CPDP results. In this paper, we propose a novel data selection and sampling based domain programming predictor (DSSDPP) for CPDP, which addresses the above limitations. DSSDPP is a non-parametric CPDP method, which can perform knowledge transfer across projects without the need for parameter selection and tuning. By exploiting the structures of source and target data, DSSDPP can learn a discriminative transfer classifier for identifying defects of the target project. Extensive experiments on 22 projects from four datasets indicate that DSSDPP achieves better MCC and AUC results against a range of competing methods both in the single-source and multi-source scenarios. Since DSSDPP is easy, effective, extensible, and efficient, we suggest that future work can use it with the well-chosen source data to conduct CPDP especially for the projects with limited computational budget.

DOI 10.1109/TSE.2022.3204589
Citations Scopus - 2
2023 Zhang B, Zhang H, Le VH, Moscato P, Zhang A, 'Semi-supervised and unsupervised anomaly detection by mining numerical workflow relations from system logs', Automated Software Engineering, 30 (2023) [C1]

Large-scale software-intensive systems often generate logs for troubleshooting purpose. The system logs are semi-structured text messages that record the internal status of a syst... [more]

Large-scale software-intensive systems often generate logs for troubleshooting purpose. The system logs are semi-structured text messages that record the internal status of a system at runtime. In this paper, we propose ADR (Anomaly Detection by workflow Relations), which can mine numerical relations from logs and then utilize the discovered relations to detect system anomalies. Firstly the raw log entries are parsed into sequences of log events and transformed to an extended event-count-matrix. The relations among the matrix columns represent the relations among the system events in workflows. Next, ADR evaluates the matrix¿s nullspace that corresponds to the linearly dependent relations of the columns. Anomalies can be detected by evaluating whether or not the logs violate the mined relations. We design two types of ADR: sADR (for semi-supervised learning) and uADR (for unsupervised learning). We have evaluated them on four public log datasets. The experimental results show that ADR can extract the workflow relations from log data, and is effective for log-based anomaly detection in both semi-supervised and unsupervised manners.

DOI 10.1007/s10515-022-00370-w
Citations Scopus - 4
Co-authors Pablo Moscato
2023 Wang W, Chen J, Yang L, Zhang H, Wang Z, 'Understanding and predicting incident mitigation time', INFORMATION AND SOFTWARE TECHNOLOGY, 155 (2023) [C1]
DOI 10.1016/j.infsof.2022.107119
Citations Scopus - 2
2023 Wu D, Jing X-Y, Zhang H, Zhou Y, Xu B, 'Leveraging Stack Overflow to detect relevant tutorial fragments of APIs', EMPIRICAL SOFTWARE ENGINEERING, 28 (2023) [C1]
DOI 10.1007/s10664-022-10235-1
Citations Scopus - 4
2023 Wu D, Jing X-Y, Zhang H, Feng Y, Chen H, Zhou Y, Xu B, 'Retrieving API Knowledge from Tutorials and Stack Overflow Based on Natural Language Queries', ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 32 (2023) [C1]
DOI 10.1145/3565799
2023 Wang C, Yang Y, Gao C, Peng Y, Zhang H, Lyu MR, 'Prompt Tuning in Code Intelligence: An Experimental Evaluation', IEEE Transactions on Software Engineering, 49 4869-4885 (2023) [C1]

Pre-trained models have been shown effective in many code intelligence tasks, such as automatic code summarization and defect prediction. These models are pre-trained on large-sca... [more]

Pre-trained models have been shown effective in many code intelligence tasks, such as automatic code summarization and defect prediction. These models are pre-trained on large-scale unlabeled corpus and then fine-tuned in downstream tasks. However, as the inputs to pre-training and downstream tasks are in different forms, it is hard to fully explore the knowledge of pre-trained models. Besides, the performance of fine-tuning strongly relies on the amount of downstream task data, while in practice, the data scarcity scenarios are common. Recent studies in the natural language processing (NLP) field show that prompt tuning, a new paradigm for tuning, alleviates the above issues and achieves promising results in various NLP tasks. In prompt tuning, the prompts inserted during tuning provide task-specific knowledge, which is especially beneficial for tasks with relatively scarce data. In this article, we empirically evaluate the usage and effect of prompt tuning in code intelligence tasks. We conduct prompt tuning on popular pre-trained models CodeBERT and CodeT5 and experiment with four code intelligence tasks including defect prediction, code search, code summarization, and code translation. Our experimental results show that prompt tuning consistently outperforms fine-tuning in all four tasks. In addition, prompt tuning shows great potential in low-resource scenarios, e.g., improving the BLEU scores of fine-tuning by more than 26% on average for code summarization. Our results suggest that instead of fine-tuning, we could adapt prompt tuning for code intelligence tasks to achieve better performance, especially when lacking task-specific data. We also discuss the implications for adapting prompt tuning in code intelligence tasks.

DOI 10.1109/TSE.2023.3313881
2023 Gao Y, Zhang H, Lyu C, 'EnCoSum: enhanced semantic features for multi-scale multi-modal source code summarization', Empirical Software Engineering, 28 (2023) [C1]

Code summarization aims to generate concise natural language descriptions for a piece of code, which can help developers comprehend the source code. Analysis of current work shows... [more]

Code summarization aims to generate concise natural language descriptions for a piece of code, which can help developers comprehend the source code. Analysis of current work shows that the extraction of syntactic and semantic features of source code is crucial for generating high-quality summaries. To provide a more comprehensive feature representation of source code from different perspectives, we propose an approach named EnCoSum, which enhances semantic features for the multi-scale multi-modal code summarization method. This method complements our previously proposed M2TS approach (multi-scale multi-modal approach based on Transformer for source code summarization), which uses the multi-scale method to capture Abstract Syntax Trees (ASTs) structural information more completely and accurately at multiple local and global levels. In addition, we devise a new cross-modal fusion method to fuse source code and AST features, which can highlight key features in each modality that help generate summaries. To obtain richer semantic information, we improve M2TS. First, we add data flow and control flow to ASTs, and added-edge ASTs, called Enhanced-ASTs (E-ASTs). In addition, we introduce method name sequences extracted in the source code, which exist more knowledge about critical tokens in the corresponding summaries and can help the model generate higher-quality summaries. We conduct extensive experiments on processed Java and Python datasets and evaluate our approach via the four most commonly used machine translation metrics. The experimental results demonstrate that EnCoSum is effective and outperforms current state-of-the-art methods. Further, we perform ablation experiments on each of the model¿s key components, and the results show that they all contribute to the performance of EnCoSum.

DOI 10.1007/s10664-023-10384-x
2023 Zhang W, Guo S, Zhang H, Sui Y, Xue Y, Xu Y, 'Challenging Machine Learning-Based Clone Detectors via Semantic-Preserving Code Transformations', IEEE Transactions on Software Engineering, 49 3052-3070 (2023) [C1]

Software clone detection identifies similar or identical code snippets. It has been an active research topic that attracts extensive attention over the last two decades. In recent... [more]

Software clone detection identifies similar or identical code snippets. It has been an active research topic that attracts extensive attention over the last two decades. In recent years, machine learning (ML) based detectors, especially deep learning-based ones, have demonstrated impressive capability on clone detection. It seems that this longstanding problem has already been tamed owing to the advances in ML techniques. In this work, we would like to challenge the robustness of the recent ML-based clone detectors through code semantic-preserving transformations. We first utilize fifteen simple code transformation operators combined with commonly-used heuristics (i.e., Random Search, Genetic Algorithm, and Markov Chain Monte Carlo) to perform equivalent program transformation. Furthermore, we propose a deep reinforcement learning-based sequence generation (DRLSG) strategy to effectively guide the search process of generating clones that could escape from the detection. We then evaluate the ML-based detectors with the pairs of original and generated clones. We realize our method in a framework named CloneGen (stands for Clone Generator). CloneGen In evaluation, we challenge the three state-of-the-art ML-based detectors and four traditional detectors with the code clones after semantic-preserving transformations via the aid of CloneGen. Surprisingly, our experiments show that, despite the notable successes achieved by existing clone detectors, the ML models inside these detectors still cannot distinguish numerous clones produced by the code transformations in CloneGen. In addition, adversarial training of ML-based clone detectors using clones generated by CloneGen can improve their robustness and accuracy. Meanwhile, compared with the commonly-used heuristics, the DRLSG strategy has shown the best effectiveness in generating code clones to decrease the detection accuracy of the ML-based detectors. Our investigation reveals an explicable but always ignored robustness issue of the latest ML-based detectors. Therefore, we call for more attention to the robustness of these new ML-based detectors.

DOI 10.1109/TSE.2023.3240118
Citations Scopus - 4
2023 Shi E, Wang Y, Du L, Zhang H, Han S, Zhang D, Sun H, 'CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees', EMPIRICAL SOFTWARE ENGINEERING, 28 (2023) [C1]
DOI 10.1007/s10664-023-10378-9
Citations Scopus - 1
2022 Tao W, Wang Y, Shi E, Du L, Han S, Zhang H, et al., 'A large-scale empirical study of commit message generation: models, datasets and evaluation', EMPIRICAL SOFTWARE ENGINEERING, 27 (2022) [C1]
DOI 10.1007/s10664-022-10219-1
Citations Scopus - 4
2022 Qi B, Sun H, Yuan W, Zhang H, Meng X, 'DreamLoc: A Deep Relevance Matching-Based Framework for bug Localization', IEEE Transactions on Reliability, 71 235-249 (2022) [C1]

To improve the software debugging efficiency, bug localization techniques have been developed to automatically locate buggy files based on bug reports. Traditional information ret... [more]

To improve the software debugging efficiency, bug localization techniques have been developed to automatically locate buggy files based on bug reports. Traditional information retrieval-based bug localization cannot deal with the lexical mismatch, thus its performance is limited. In recent years, some deep learning models have been proposed to learn the semantics of bug reports and source files to bridge the lexical gap. However, their accuracy is still limited as building accurate semantic representations of bug reports and source files is very challenging. Recently, relevance matching was proposed to identify whether a document is relevant to a given query by considering both local matching and global matching. In this work, we propose a novel framework DreamLoc, which utilizes a relevance matching model to locate buggy files. Specifically, DreamLoc conducts the local matching by employing an attention-based mechanism to calculate the matching scores between bug report terms and code snippets. It also conducts the global matching by employing a gating mechanism to aggregate results of local matching and obtain the final matching score between a bug report and a source file. Since the local matching considers the relevance between each word and the global matching differentiates the importance of words, DreamLoc can effectively model the characteristics of bug reports and source files. Experimental results on five benchmark datasets show that DreamLoc outperforms five state-of-the-art models. For example, compared with DeepLoc, a recently proposed approach, the evaluation measures Accuracy@10, MAP, and MRR are improved by 6.4%, 7.4%, and 7.2%, respectively.

DOI 10.1109/TR.2021.3104728
Citations Scopus - 8Web of Science - 4
2022 Dinh CT, Vu TT, Tran NH, Dao MN, Zhang H, 'A New Look and Convergence Rate of Federated Multitask Learning With Laplacian Regularization', IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, [C1]
DOI 10.1109/TNNLS.2022.3224252
Citations Scopus - 4Web of Science - 6
2021 Chen J, Wang G, Hao D, Xiong Y, Zhang H, Zhang L, Xie B, 'Coverage Prediction for Accelerating Compiler Testing', IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 47 261-278 (2021) [C1]
DOI 10.1109/TSE.2018.2889771
Citations Scopus - 18Web of Science - 8
2021 Gu W, Li Z, Gao C, Wang C, Zhang H, Xu Z, Lyu MR, 'CRaDLe: Deep code retrieval based on semantic Dependency Learning', Neural Networks, 141 385-394 (2021) [C1]

Code retrieval is a common practice for programmers to reuse existing code snippets in the open-source repositories. Given a user query (i.e., a natural language description), cod... [more]

Code retrieval is a common practice for programmers to reuse existing code snippets in the open-source repositories. Given a user query (i.e., a natural language description), code retrieval aims at searching the most relevant ones from a set of code snippets. The main challenge of effective code retrieval lies in mitigating the semantic gap between natural language descriptions and code snippets. With the ever-increasing amount of available open-source code, recent studies resort to neural networks to learn the semantic matching relationships between the two sources. The statement-level dependency information, which highlights the dependency relations among the program statements during the execution, reflects the structural importance of one statement in the code, which is favorable for accurately capturing the code semantics but has never been explored for the code retrieval task. In this paper, we propose CRaDLe, a novel approach for Code Retrieval based on statement-level semantic Dependency Learning. Specifically, CRaDLe distills code representations through fusing both the dependency and semantic information at the statement level, and then learns a unified vector representation for each code and description pair for modeling the matching relationship. Comprehensive experiments and analysis on real-world datasets show that the proposed approach can accurately retrieve code snippets for a given query and significantly outperform the state-of-the-art approaches on the task.

DOI 10.1016/j.neunet.2021.04.019
Citations Scopus - 24Web of Science - 13
2021 Lyu C, Wang R, Zhang H, Zhang H, Hu S, 'Embedding API dependency graph for neural code generation', Empirical Software Engineering, 26 (2021) [C1]

The problem of code generation from textual program descriptions has long been viewed as a grand challenge in software engineering. In recent years, many deep learning based appro... [more]

The problem of code generation from textual program descriptions has long been viewed as a grand challenge in software engineering. In recent years, many deep learning based approaches have been proposed, which can generate a sequence of code from a sequence of textual program description. However, the existing approaches ignore the global relationships among API methods, which are important for understanding the usage of APIs. In this paper, we propose to model the dependencies among API methods as an API dependency graph (ADG) and incorporate the graph embedding into a sequence-to-sequence (Seq2Seq) model. In addition to the existing encoder-decoder structure, a new module named ¿embedder¿ is introduced. In this way, the decoder can utilize both global structural dependencies and textual program description to predict the target code. We conduct extensive code generation experiments on three public datasets and in two programming languages (Python and Java). Our proposed approach, called ADG-Seq2Seq, yields significant improvements over existing state-of-the-art methods and maintains its performance as the length of the target code increases. Extensive ablation tests show that the proposed ADG embedding is effective and outperforms the baselines.

DOI 10.1007/s10664-021-09968-2
Citations Scopus - 10Web of Science - 1
2021 Wu D, Jing XY, Zhang H, Li B, Xie Y, Xu B, 'Generating API tags for tutorial fragments from Stack Overflow', Empirical Software Engineering, 26 (2021) [C1]

API tutorials are important learning resources as they explain how to use certain APIs in a given programming context. An API tutorial can be split into a number of units. Consecu... [more]

API tutorials are important learning resources as they explain how to use certain APIs in a given programming context. An API tutorial can be split into a number of units. Consecutive units that describe a same topic are often called a tutorial fragment. We consider the API explained by a tutorial fragment as an API tag. Generating API tags for a tutorial fragment can help understand, navigate, and retrieve the fragment. Existing approaches often do not perform well on API tag generation due to high manual effort and low accuracy. Like API tutorials, Stack Overflow (SO) is also an important learning resource that provides the explanations of APIs. Thus, SO posts also contain API tags. Besides, API tags of SO posts are abundant and can be extracted easily. In this paper, we propose a novel approach ATTACK (stands for A PI T ag for T utorial frA gments using C rowd K nowledge), which can automatically generate API tags for tutorial fragments from SO posts. ATTACK first constructs <Q&Apair,tagset> pairs by extracting API tags of SO posts. Then, it trains a deep neural network with the attention mechanism to learn the semantic relatedness between Q&A pairs and the associated API tags, taking into consideration both textual descriptions and code in a Q&A pair. Finally, the trained model is used to generate API tags for tutorial fragments. We evaluate ATTACK on public Java and Android datasets containing 43,132 <Q&Apair,tagset> pairs. Experimental results show that ATTACK is effective and outperforms the state-of-the-art approaches in terms of F-Measure. Our user study further confirms the effectiveness of ATTACK in generating API tags for tutorial fragments. We also apply ATTACK to document linking and the results confirm the usefulness of API tags generated by ATTACK.

DOI 10.1007/s10664-021-09962-8
Citations Scopus - 9Web of Science - 4
2020 Chen J, Patra J, Pradel M, Xiong Y, Zhang H, Hao D, Zhang L, 'A survey of compiler testing', ACM Computing Surveys, 53 (2020) [C1]
DOI 10.1145/3363562
Citations Scopus - 104Web of Science - 60
2020 Wu D, Jing X-Y, Zhang H, Kong X, Xie Y, Huang Z, 'Data-drivenapproach to application programming interface documentation mining: A review', WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 10 (2020) [C1]
DOI 10.1002/widm.1369
Citations Scopus - 9Web of Science - 5
2020 Zhang Z, Sun H, Zhang H, 'Developer recommendation for Topcoder through a meta-learning based policy model', Empirical Software Engineering, 25 859-889 (2020) [C1]
DOI 10.1007/s10664-019-09755-0
Citations Scopus - 16Web of Science - 9
2019 Li Z, Jing X-Y, Zhu X, Zhang H, Xu B, Ying S, 'Heterogeneous defect prediction with two-stage ensemble learning', AUTOMATED SOFTWARE ENGINEERING, 26 599-651 (2019) [C1]
DOI 10.1007/s10515-019-00259-1
Citations Scopus - 45Web of Science - 32
2019 Mirjalili SZ, Mirjalili S, Zhang H, Chalup S, Noman N, 'Improving the reliability of implicit averaging methods using new conditional operators for robust optimization', Swarm and Evolutionary Computation, 51 (2019) [C1]
DOI 10.1016/j.swevo.2019.100579
Citations Scopus - 4Web of Science - 4
Co-authors Nasimul Noman, Stephan Chalup
2019 Chen J, Hu W, Hao D, Xiong Y, Zhang H, Zhang L, 'Static duplicate bug-report identification for compilers', SCIENTIA SINICA Informationis, 49 1283-1298 (2019) [C1]
DOI 10.1360/N112019-00001
2019 Zhiqiang L, Xiao-Yuan J, Xiaoke Z, Zhang H, Baowen X, Shi Y, 'On the Multiple Sources and Privacy Preservation Issues for Heterogeneous Defect Prediction', IEEE Transactions on Software Engineering, 45 391-411 (2019) [C1]
DOI 10.1109/TSE.2017.2780222
Citations Scopus - 59Web of Science - 62
2019 Gu Y, Xuan J, Zhang H, Zhang L, Fan Q, Xie X, Qian T, 'Does the fault reside in a stack trace? Assisting crash localization by predicting crashing fault residence', Journal of Systems and Software, 148 88-104 (2019) [C1]
DOI 10.1016/j.jss.2018.11.004
Citations Scopus - 33Web of Science - 21
2018 Zhang H, Miranskyy A, Bener AB, 'Editorial: Special Section on Best Papers of PROMISE 2016', INFORMATION AND SOFTWARE TECHNOLOGY, 95 295-295 (2018)
DOI 10.1016/j.infsof.2017.12.014
Citations Web of Science - 4
2018 Wu R, Wen M, Cheung SC, Zhang H, 'ChangeLocator: locate crash-inducing changes based on crash reports', EMPIRICAL SOFTWARE ENGINEERING, 23 2866-2900 (2018) [C1]
DOI 10.1007/s10664-017-9567-4
Citations Scopus - 37Web of Science - 29
2017 Xuan J, Jiang H, Zhang H, Ren Z, 'Developer recommendation on bug commenting: a ranking approach for the developer crowd', Science China-Information Sciences, 60 072105-1-072105-18 (2017) [C1]
DOI 10.1007/s11432-015-0582-8
Citations Scopus - 21Web of Science - 13
2016 Xia X, Gong L, Le TDB, Lo D, Jiang L, Zhang H, 'Diversity maximization speedup for localizing faults in single-fault and multi-fault programs', Automated Software Engineering, 23 43-75 (2016) [C1]
DOI 10.1007/s10515-014-0165-z
Citations Scopus - 21Web of Science - 16
2015 Li M, Zhang H, Lo D, Lucia, 'Improving Software Quality and Productivity Leveraging Mining Techniques', ACM SIGSOFT Software Engineering Notes, 40 1-2 (2015)
DOI 10.1145/2693208.2693219
2014 Gong L, Zhang H, Seo H, Kim S, 'Locating Crashing Faults based on Crash Stack Traces.', CoRR, abs/1404.4100 (2014)
2013 Peters F, Menzies T, Gong L, Zhang H, 'Balancing Privacy and Utility in Cross-Company Defect Prediction', IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 39 1054-1068 (2013) [C1]
DOI 10.1109/TSE.2013.6
Citations Scopus - 118Web of Science - 79
2013 Concas G, Lunesu MI, Marchesi M, Zhang H, 'Simulation of software maintenance process, with and without a work-in-process limit', JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 25 1225-1248 (2013)
DOI 10.1002/smr.1599
Citations Scopus - 21Web of Science - 15
2012 Li M, Zhang H, Wu R, Zhou Z-H, 'Sample-based software defect prediction with active and semi-supervised learning', AUTOMATED SOFTWARE ENGINEERING, 19 201-230 (2012) [C1]
DOI 10.1007/s10515-011-0092-1
Citations Scopus - 164Web of Science - 120
2011 Zhang H, Tan HBK, Zhang L, Lin X, Wang X, Zhang C, Mei H, 'Checking enforcement of integrity constraints in database applications based on code patterns', JOURNAL OF SYSTEMS AND SOFTWARE, 84 2253-2264 (2011) [C1]
DOI 10.1016/j.jss.2011.06.044
Citations Scopus - 20Web of Science - 11
2010 Canfora G, Concas G, Marchesi M, Tempero E, Zhang H, '2010 ICSE workshop on emerging trends in software metrics', ACM SIGSOFT Software Engineering Notes, 35 51-53 (2010)
DOI 10.1145/1838687.1838700
2010 Zhang H, Li Y-F, Tan HBK, 'Measuring design complexity of semantic web ontologies', JOURNAL OF SYSTEMS AND SOFTWARE, 83 803-814 (2010)
DOI 10.1016/j.jss.2009.11.735
Citations Scopus - 98Web of Science - 67
2010 Zhang H, Kim S, 'Monitoring Software Quality Evolution for Defects', IEEE SOFTWARE, 27 58-64 (2010)
DOI 10.1109/MS.2010.66
Citations Scopus - 27Web of Science - 18
2010 Concas G, Cantone G, Tempero E, Zhang H, 'New Generation of Software Metrics', Advances in Software Engineering, 2010 1-2 (2010)
DOI 10.1155/2010/913892
2009 Zhang H, 'Discovering power laws in computer programs', INFORMATION PROCESSING & MANAGEMENT, 45 477-483 (2009)
DOI 10.1016/j.ipm.2009.02.001
Citations Scopus - 15Web of Science - 10
2009 Tan HBK, Zhao Y, Zhang H, 'Conceptual Data Model-Based Software Size Estimation for Information Systems', ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 19 (2009)
DOI 10.1145/1571629.1571630
Citations Scopus - 26Web of Science - 24
2009 Zhang H, Tan HBK, Marchesi M, 'The Distribution of Program Sizes and Its Implications: An Eclipse Case Study', CoRR, abs/0905.2288 (2009)
2008 Zhang H, 'On the distribution of software faults', IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 34 301-302 (2008)
DOI 10.1109/TSE.2007.70771
Citations Scopus - 61Web of Science - 39
2007 Zhang H, Zhang X, 'Comments on "data mining static code attributes to learn defect predictors"', IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 33 635-636 (2007)
DOI 10.1109/TSE.2007.70706
Citations Scopus - 110Web of Science - 65
2007 Wang HH, Li YF, Sun J, Zhang H, Pan J, 'Verifying feature models using OWL', JOURNAL OF WEB SEMANTICS, 5 117-129 (2007)
DOI 10.1016/j.websem.2006.11.006
Citations Scopus - 97Web of Science - 67
2005 Zhang H, Jarzabek S, 'A Bayesian Network approach to rational architectural design', International Journal of Software Engineering and Knowledge Engineering, 15 695-717 (2005)

In software architecture design, we explore design alternatives and make decisions about adoption or rejection of a design from a web of complex and often uncertain information. D... [more]

In software architecture design, we explore design alternatives and make decisions about adoption or rejection of a design from a web of complex and often uncertain information. Different architectural design decisions may lead to systems that satisfy the same set of functional requirements but differ in certain quality attributes. In this paper, we propose a Bayesian Network based approach to rational architectural design. Our Bayesian Network helps software architects record and make design decisions. We can perform both qualitative and quantitative analysis over the Bayesian Network to understand how the design decisions influence system quality attributes, and to reason about rational design decisions. We use the KWIC (Key Word In Context) example to illustrate the principles of our approach. © World Scientific Publishing Company.

DOI 10.1142/S0218194005002488
Citations Scopus - 10Web of Science - 4
2004 Zhang H, Jarzabek S, 'XVCL: A mechanism for handling variants in software product lines', Science of Computer Programming, 53 381-407 (2004)

Software reuse focused on product lines has emerged as one of the promising ways to increase software productivity and quality. XVCL (XML-based Variant Configuration Language) is ... [more]

Software reuse focused on product lines has emerged as one of the promising ways to increase software productivity and quality. XVCL (XML-based Variant Configuration Language) is a variability mechanism that we developed for handling variants in software product lines. We apply XVCL to develop product line assets (including the domain model, product line architecture and generic components) as a set of x-frames that are capable of accommodating both commonality and variability in a domain. Specific systems, members of a product line, can be constructed by adapting and composing x-frames. In this paper, we illustrate our approach using examples from our product line project on the Computer Aided Dispatch¿(CAD) domain. © 2004 Elsevier B.V. All rights reserved.

DOI 10.1016/j.scico.2003.04.007
Citations Scopus - 43Web of Science - 20
2003 Zhang H, Jarzabek S, Yang B, 'Quality prediction and assessment for product lines', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2681 681-695 (2003)

In recent years, software product lines have emerged as a promising approach to improve software development productivity in IT industry. In the product line approach, we identify... [more]

In recent years, software product lines have emerged as a promising approach to improve software development productivity in IT industry. In the product line approach, we identify both commonalities and variabilities in a domain, and build generic assets for an organization. Feature diagrams are often used to model common and variant product line requirements and can be considered part of the organizational assets. Despite their importance, quality attributes (or non-functional requirements, NFRs) such as performance and security have not been sufficiently addressed in product line development. A feature diagram alone does not tell us how to select a configuration of variants . to achieve desired quality attributes of a product line member. There is a lack of an explicit model that can represent the impact of variants on quality attributes. In this paper, we propose a Bayesian Belief Network (BBN) based approach to quality prediction and assessment for a software product line. A BBN represents domain experts' knowledge and experiences accumulated from the development of similar projects. It helps us capture the impact of variants on quality attributes, and helps us predict and assess the quality of a product line member by performing quantitative analysis over it. For developing specific systems, members of a product line, we reuse the expertise captured by a BBN instead of working from scratch. We use examples from the Computer Aided Dispatch (CAD) product line project to illustrate our approach. © Springer-Verlag Berlin Heidelberg 2003.

DOI 10.1007/3-540-45017-3_45
Citations Scopus - 39Web of Science - 25
Show 46 more journal articles

Conference (193 outputs)

Year Citation Altmetrics Link
2023 Le V-H, Zhang H, 'Log Parsing: How Far Can ChatGPT Go?', 2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, LUXEMBOURG, Echternach (2023) [E1]
DOI 10.1109/ASE56229.2023.00206
Citations Scopus - 1
2023 Shi E, Wang Y, Zhang H, Du L, Han S, Zhang D, Sun H, 'Towards Efficient Fine-Tuning of Pre-trained Code Models: An Experimental Study and Beyond', PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023, WA, Seattle (2023)
DOI 10.1145/3597926.3598036
2023 Zhao Q, Luo C, Cai S, Wu W, Lin J, Zhang H, Hu C, 'CAmpactor: A Novel and Effective Local Search Algorithm for Optimizing Pairwise Covering Arrays', ESEC/FSE 2023 - Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA (2023) [E1]
DOI 10.1145/3611643.3616284
Citations Scopus - 1
2023 Lin Q, Li T, Zhao P, Liu Y, Ma M, Zheng L, et al., 'EDITS: An Easy-to-difficult Training Strategy for Cloud Failure Prediction', ACM Web Conference 2023 - Companion of the World Wide Web Conference, WWW 2023, Austin, Texas (2023) [E1]
DOI 10.1145/3543873.3584630
Citations Scopus - 3Web of Science - 3
2023 Shi E, Wang Y, Gu W, Du L, Zhang H, Han S, et al., 'CoCoSoDa: Effective Contrastive Learning for Code Search', 2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE, AUSTRALIA, Melbourne (2023) [E1]
DOI 10.1109/ICSE48619.2023.00185
Citations Scopus - 5
2023 Le V-H, Zhang H, 'Log Parsing with Prompt-based Few-shot Learning', 2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE, AUSTRALIA, Melbourne (2023) [E1]
DOI 10.1109/ICSE48619.2023.00204
Citations Scopus - 2
2023 Li L, Zhang X, He S, Kang Y, Zhango H, Ma M, et al., 'CONAN: Diagnosing Batch Failures for Cloud Systems', 2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE, ICSE-SEIP, AUSTRALIA, Melbourne (2023) [E1]
DOI 10.1109/ICSE-SEIP58684.2023.00018
Citations Scopus - 2
2023 Xu Z, Zhou M, Zhao X, Chen Y, Cheng X, Zhang H, 'xASTNN: Improved Code Representations for Industrial Practice', ESEC/FSE 2023 - Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA (2023) [E1]
DOI 10.1145/3611643.3613869
2023 Hu F, Wang Y, Du L, Li X, Zhang H, Han S, Zhang D, 'Revisiting Code Search in a Two-Stage Paradigm', WSDM 2023 - Proceedings of the 16th ACM International Conference on Web Search and Data Mining, Singapore (2023) [E1]
DOI 10.1145/3539597.3570383
Citations Scopus - 6
2022 Song X, Yan J, Huang Y, Sun H, Zhang H, 'A Collaboration-Aware Approach to Profiling Developer Expertise with Cross-Community Data', 2022 IEEE 22ND INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, PEOPLES R CHINA, Guangzhou (2022) [E1]
DOI 10.1109/QRS57517.2022.00043
2022 Liu Y, Zhang X, He S, Zhang H, Li L, Kang Y, et al., 'UniParser: A Unified Log Parser for Heterogeneous Log Data', WWW 2022: Proceedings of the ACM Web Conference 2022, Lyon, France (2022) [E1]
DOI 10.1145/3485447.3511993
Citations Scopus - 27Web of Science - 1
2022 Wang X, Wu Q, Zhang H, Lyu C, Jiang X, Zheng Z, et al., 'HELoC: Hierarchical Contrastive Learning of Source Code Representation', IEEE International Conference on Program Comprehension, Pittsburgh, PA (2022) [E1]
DOI 10.1145/3524610.3527896
Citations Scopus - 8
2022 Tang W, Wang Y, Zhang H, Han S, Luo P, Zhang D, 'LibDB: An Effective and Efficient Framework for Detecting Third-Party Libraries in Binaries', Proceedings: 2022 Mining Software Repositories Conference (MSR 2022), Pittsburgh, PA (2022) [E1]
DOI 10.1145/3524842.3528442
Citations Scopus - 10
2022 Gui Y, Wan Y, Zhang H, Huang H, Sui Y, Xu G, et al., 'Cross-Language Binary-Source Code Matching with Intermediate Representations', Proceedings: 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2022), Honolulu, HI (2022) [E1]
DOI 10.1109/SANER53432.2022.00077
Citations Scopus - 8
2022 Chen Z, Liu J, Su Y, Zhang H, Ling X, Yang Y, Lyu MR, 'Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching', 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), Pittsburgh, PA (2022) [E1]
DOI 10.1145/3510003.3510085
Citations Scopus - 8
2022 Chai Y, Zhang H, Shen B, Gu X, 'Cross-Domain Deep Code Search with Meta Learning', 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), PA, Pittsburgh (2022) [E1]
Citations Scopus - 12Web of Science - 8
2022 Meng X, Wang X, Zhang H, Sun H, Liu X, 'Improving Fault Localization and Program Repair with Deep Semantic Features and Transferred Knowledge', 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), PA, Pittsburgh (2022) [E1]
DOI 10.1145/3510003.3510147
Citations Scopus - 15
2022 Le V-H, Zhang H, 'Log-based Anomaly Detection with Deep Learning: How Far Are We?', 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), Pittsburgh, PA (2022) [E1]
DOI 10.1145/3510003.3510155
Citations Scopus - 53Web of Science - 1
2022 Shi E, Wang Y, Du L, Chen J, Han S, Zhang H, et al., 'On the Evaluation of Neural Code Summarization', 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), PA, Pittsburgh (2022) [E1]
DOI 10.1145/3510003.3510060
Citations Scopus - 24
2022 Gao Y, Li Z, Lin H, Zhang H, Wu M, Yang M, 'REFTY: Refinement Types for Valid Deep Learning Models', 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), PA, Pittsburgh (2022) [E1]
DOI 10.1145/3510003.3510077
2022 Wan Y, Zhao W, Zhang H, Sui Y, Xu G, Jin H, 'What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code', 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), PA, Pittsburgh (2022) [E1]
DOI 10.1145/3510003.3510050
Citations Scopus - 24Web of Science - 7
2022 Wan Y, He Y, Bi Z, Zhang J, Sui Y, Zhang H, et al., 'NATURALCC: An Open-Source Toolkit for Code Intelligence', 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2022), Pittsburgh, PA (2022) [E1]
DOI 10.1145/3510454.3516863
2022 Gu W, Wang Y, Du L, Zhang H, Han S, Zhang D, Lyu MR, 'Accelerating Code Search with Deep Hashing and Code Classification', PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), Dublin, IRELAND (2022) [E1]
Citations Scopus - 4
2022 Xie Y, Zhang H, Babar MA, 'LogGD: Detecting Anomalies from System Logs with Graph Neural Networks', 2022 IEEE 22ND INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, PEOPLES R CHINA, Guangzhou (2022) [E1]
DOI 10.1109/QRS57517.2022.00039
Citations Scopus - 5
2022 Liu Y, Yang H, Zhao P, Ma M, Wen C, Zhang H, et al., 'Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC (2022) [E1]
DOI 10.1145/3534678.3539176
Citations Scopus - 7Web of Science - 6
2022 Wan Y, He Y, Bi Z, Zhang J, Sui Y, Zhang H, et al., 'NATURALCC: An Open-Source Toolkit for Code Intelligence', 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2022), PA, Pittsburgh (2022) [E1]
DOI 10.1145/3510454.3516863
Citations Scopus - 5Web of Science - 1
2022 Ma M, Liu Y, Tong Y, Li H, Zhao P, Xu Y, et al., 'An empirical investigation of missing data handling in cloud node failure prediction', ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore (2022) [E1]
DOI 10.1145/3540250.3558946
Citations Scopus - 11Web of Science - 1
2022 Wang C, Yang Y, Gao C, Peng Y, Zhang H, Lyu MR, 'No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence', ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore (2022) [E1]
DOI 10.1145/3540250.3549113
Citations Scopus - 36
2022 Wan Y, Zhang S, Zhang H, Sui Y, Xu G, Yao D, et al., 'You see what I want you to see: poisoning vulnerabilities in neural code search', ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore (2022) [E1]
DOI 10.1145/3540250.3549153
Citations Scopus - 7
2022 Zhang Z, Zhang H, Shen B, Gu X, 'Diet code is healthy: simplifying programs for pre-trained models of code', ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore (2022) [E1]
DOI 10.1145/3540250.3549094
Citations Scopus - 11Web of Science - 2
2022 Luo C, Zhao Q, Cai S, Zhang H, Hu C, 'SamplingCA: effective and efficient sampling-based pairwise testing for highly configurable software systems', ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore (2022) [E1]
DOI 10.1145/3540250.3549155
Citations Scopus - 3
2022 Wang X, Zhang X, Li L, He S, Zhang H, Liu Y, et al., 'SPINE: a scalable log parser with feedback guidance', ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore (2022) [E1]
DOI 10.1145/3540250.3549176
Citations Scopus - 11Web of Science - 6
2022 Li H, Miao C, Leung C, Huang Y, Huang Y, Zhang H, Wang Y, 'Exploring Representation-Level Augmentation for Code Search', Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates (2022) [E1]
Citations Scopus - 5
2022 Shi E, Wang Y, Tao W, Du L, Zhang H, Han S, et al., 'RACE: Retrieval-Augmented Commit Message Generation', Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates (2022) [E1]
Citations Scopus - 10
2022 Wang L, Zhao P, Du C, Luo C, Su M, Yang F, et al., 'NENYA: Cascade Reinforcement Learning for Cost-Aware Failure Mitigation at Microsoft 365', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA (2022) [E1]
DOI 10.1145/3534678.3539127
Citations Scopus - 1Web of Science - 1
2022 Wang Y, Wang J, Zhang H, Ming X, Shi L, Wang Q, 'Where is Your App Frustrating Users?', 2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), PA, Pittsburgh (2022) [E1]
DOI 10.1145/3510003.3510189
Citations Scopus - 6
2022 Qi B, Sun H, Gao X, Zhang H, 'Patching Weak Convolutional Neural Network Models through Modularization and Composition', ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Michigan, USA (2022) [E1]
DOI 10.1145/3551349.3561153
Citations Scopus - 3
2021 Xie Y, Zhang H, Zhang B, Babar MA, Lu S, 'LogDP: Combining Dependency and Proximity for Log-Based Anomaly Detection', Service-Oriented Computing 19th International Conference, ICSOC 2021 Virtual Event, November 22 25, 2021 Proceedings, Virtual (2021) [E1]
DOI 10.1007/978-3-030-91431-8_47
Citations Scopus - 3
2021 Luo C, Qiao B, Xing W, Chen X, Zhao P, Chao D, et al., 'Correlation-Aware Heuristic Search for Intelligent Virtual Machine Provisioning in Cloud Systems', Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence AAAI 2021, Virtual (2021) [E1]
Citations Scopus - 13Web of Science - 7
2021 Gao Y, Zhu Y, Zhang H, Lin H, Yang M, 'Resource-Guided Configuration Space Reduction for Deep Learning Models', 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021) [E1]
DOI 10.1109/icse43902.2021.00028
Citations Scopus - 7Web of Science - 2
2021 Luo C, Zhao P, Chen C, Qiao B, Du C, Zhang H, et al., 'PULNS: Positive-Unlabeled Learning with Effective Negative Sample Selector', Proceedings of the AAAI Conference on Artificial Intelligence, Virtual (2021) [E1]
Citations Scopus - 15Web of Science - 3
2021 Le VH, Zhang H, 'Log-based Anomaly Detection Without Log Parsing', Proceedings - 2021 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia (2021) [E1]
DOI 10.1109/ASE51524.2021.9678773
Citations Scopus - 70Web of Science - 8
2021 Tao W, Wang Y, Shi E, Du L, Han S, Zhang H, et al., 'On the Evaluation of Commit Message Generation Models: An Experimental Study', Proceedings - 2021 IEEE International Conference on Software Maintenance and Evolution, ICSME 2021, Luxembourg (2021) [E1]
DOI 10.1109/ICSME52107.2021.00018
Citations Scopus - 17Web of Science - 1
2021 Zhang X, Du C, Li Y, Xu Y, Zhang H, Qin S, et al., 'HALO: Hierarchy-aware Fault Localization for Cloud Systems', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Virtual, Singapore (2021) [E1]
DOI 10.1145/3447548.3467190
Citations Scopus - 12Web of Science - 3
2021 Zhang X, Xu Y, Qin S, He S, Qiao B, Li Z, et al., 'Onion: Identifying incident-indicating logs for cloud systems', ESEC/FSE 2021 - Proceedings of the 29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece (2021) [E1]
DOI 10.1145/3468264.3473919
Citations Scopus - 16Web of Science - 3
2021 Qiao B, Yang F, Luo C, Wang Y, Li J, Lin Q, et al., 'Intelligent container reallocation at Microsoft 365', ESEC/FSE 2021 - Proceedings of the 29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece (2021) [E1]
DOI 10.1145/3468264.3473936
Citations Scopus - 4
2021 Luo C, Sun B, Qiao B, Chen J, Zhang H, Lin J, et al., 'LS-sampling: An effective local search based sampling approach for achieving high t-wise coverage', ESEC/FSE 2021 - Proceedings of the 29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece (2021) [E1]
DOI 10.1145/3468264.3468622
Citations Scopus - 8Web of Science - 3
2021 Dong H, Qin S, Xu Y, Qiao B, Zhou S, Yang X, et al., 'Effective low capacity status prediction for cloud systems', ESEC/FSE 2021 - Proceedings of the 29th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece (2021) [E1]
DOI 10.1145/3468264.3473917
Citations Scopus - 1
2021 Wu D, Jing XY, Zhang H, Zhou Y, Xu B, 'Leveraging Stack Overflow to Detect Relevant Tutorial Fragments of APIs', Proceedings - 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2021, Honolulu, HI (2021) [E1]
DOI 10.1109/SANER50967.2021.00020
Citations Scopus - 4Web of Science - 1
2021 Li L, Zhang X, Zhao X, Zhang H, Kang Y, Zhao P, et al., 'Fighting the Fog of War: Automated Incident Detection for Cloud Systems', PROCEEDINGS OF THE 2021 USENIX ANNUAL TECHNICAL CONFERENCE, ELECTR NETWORK (2021) [E1]
Citations Scopus - 16
2021 Gu X, Han YS, Kim S, Zhang H, 'Do bugs propagate? an empirical analysis of temporal correlations among software bugs', 35th European Conference on Object-Oriented Programming. Leibniz International Proceedings in Informatics, Aarhus, Denmark (2021) [E1]
DOI 10.4230/LIPIcs.ECOOP.2021.11
Citations Scopus - 1
2021 Luo C, Qiao B, Chen X, Zhao P, Yao R, Zhang H, et al., 'Intelligent Virtual Machine Provisioning in Cloud Computing', Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, Yokohama, Japan (2021) [E1]
DOI 10.24963/ijcai.2020/208
Citations Scopus - 22Web of Science - 8
2021 Luo C, Zhao P, Qiao B, Wu Y, Zhang H, Wu W, et al., 'NTAM: Neighborhood-temporal attention model for disk failure prediction in cloud platforms', The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021, Ljubljana, Slovenia (2021) [E1]
DOI 10.1145/3442381.3449867
Citations Scopus - 19Web of Science - 13
2021 Chen Z, Liu J, Su Y, Zhang H, Wen X, Ling X, et al., 'Graph-based Incident Aggregation for Large-Scale Online Service Systems', Proceedings - 2021 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne Australia (2021) [E1]
DOI 10.1109/ASE51524.2021.9678746
Citations Scopus - 8
2021 Wang W, Chen J, Yang L, Zhang H, Zhao P, Qiao B, et al., 'How Long Will it Take to Mitigate this Incident for Online Service Systems?', Proceedings - International Symposium on Software Reliability Engineering, ISSRE, Wuhan, china (2021) [E1]
DOI 10.1109/ISSRE52982.2021.00017
Citations Scopus - 6Web of Science - 2
2021 Wang Y, Li G, Wang Z, Kang Y, Zhou Y, Zhang H, et al., 'Fast Outage Analysis of Large-Scale Production Clouds with Service Correlation Mining', 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, ES (2021) [E1]
DOI 10.1109/icse43902.2021.00085
Citations Scopus - 15Web of Science - 3
2021 Luo C, Lin J, Cai S, Chen X, He B, Qiao B, et al., 'AutoCCAG: An Automated Approach to Constrained Covering Array Generation', 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, ES (2021) [E1]
DOI 10.1109/icse43902.2021.00030
Citations Scopus - 10Web of Science - 3
2021 Chen J, Xu N, Chen P, Zhang H, 'Efficient Compiler Autotuning via Bayesian Optimization', 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, ES (2021) [E1]
DOI 10.1109/icse43902.2021.00110
Citations Scopus - 22Web of Science - 8
2021 Shi E, Wang Y, Du L, Zhang H, Han S, Zhang D, Sun H, 'CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees', EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings, Online and Punta Cana, Dominican Republic (2021) [E1]
DOI 10.18653/v1/2021.emnlp-main.332
Citations Scopus - 29Web of Science - 14
2021 Kang Y, Wang Z, Zhang H, Chen J, You H, 'APIRecX: Cross-Library API Recommendation via Pre-Trained Language Model', EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings, Punta Cana, Dominican Republic (2021) [E1]
DOI 10.18653/v1/2021.emnlp-main.275
Citations Scopus - 12Web of Science - 5
2020 Mirjalili S, Zhang H, Mirjalili S, Chalup S, Noman N, 'A Novel U-Shaped Transfer Function for Binary Particle Swarm Optimisation', Soft Computing for Problem Solving 2019. Proceedings of SocProS 2019, Liverpool, UK (2020) [E1]
DOI 10.1007/978-981-15-3290-0_19
Citations Scopus - 43
Co-authors Stephan Chalup, Nasimul Noman
2020 Zhou J, Li F, Dong J, Zhang H, Hao D, 'Cost-Effective Testing of a Deep Learning Model through Input Reduction', 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), Coimbra, Portugal (2020) [E1]
DOI 10.1109/ISSRE5003.2020.00035
Citations Scopus - 11Web of Science - 3
2020 Xu Y, Sui K, Yao R, Zhang H, Lin Q, Dang Y, et al., 'Improving service availability of cloud systems by predicting disk error', Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018 (2020)

High service availability is crucial for cloud systems. A typical cloud system uses a large number of physical hard disk drives. Disk errors are one of the most important reasons ... [more]

High service availability is crucial for cloud systems. A typical cloud system uses a large number of physical hard disk drives. Disk errors are one of the most important reasons that lead to service unavailability. Disk error (such as sector error and latency error) can be seen as a form of gray failure, which are fairly subtle failures that are hard to be detected, even when applications are afflicted by them. In this paper, we propose to predict disk errors proactively before they cause more severe damage to the cloud system. The ability to predict faulty disks enables the live migration of existing virtual machines and allocation of new virtual machines to the healthy disks, therefore improving service availability. To build an accurate online prediction model, we utilize both disk-level sensor (SMART) data as well as system-level signals. We develop a cost-sensitive ranking-based machine learning model that can learn the characteristics of faulty disks in the past and rank the disks based on their error-proneness in the near future. We evaluate our approach using real-world data collected from a production cloud system. The results confirm that the proposed approach is effective and outperforms related methods. Furthermore, we have successfully applied the proposed approach to improve service availability of Microsoft Azure.

Citations Scopus - 86
2020 Zhang B, Zhang H, Moscato P, Zhang A, 'Anomaly Detection via Mining Numerical Workflow Relations from Logs', 2020 International Symposium on Reliable Distributed Systems (SRDS), online (2020) [E1]
DOI 10.1109/SRDS51746.2020.00027
Citations Scopus - 11Web of Science - 8
Co-authors Pablo Moscato
2020 Shu Y, Sui Y, Zhang H, Xu G, 'Perf-AL: Performance Prediction for Configurable Software through Adversarial Learning', Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Online (2020) [E1]
DOI 10.1145/3382494.3410677
Citations Scopus - 5
2020 Zhang J, Wang X, Zhang H, Sun H, Pu Y, Liu X, 'Learning to handle exceptions', Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (2020) [E1]
DOI 10.1145/3324884.3416568
Citations Scopus - 14Web of Science - 6
2020 Zhang R, Xiao W, Zhang H, Liu Y, Lin H, Yang M, 'An Empirical Study on Program Failures of Deep Learning Jobs', Proceedings of the 2020 ACM/IEEE 42nd International Conference on Software Engineering (ICSE), Seoul, South KOrea (2020) [E1]
DOI 10.1145/3377811.3380362
Citations Scopus - 53Web of Science - 25
2020 Zhang J, Wang X, Zhang H, Sun H, Liu X, 'Retrieval-Based Neural Source Code Summarization', Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, Seoul, South Korea (2020) [E1]
DOI 10.1145/3377811.3380383
Citations Scopus - 156Web of Science - 56
2020 Chen Y, Yang X, Dong H, He X, Zhang H, Lin Q, et al., 'Identifying Linked Incidents in Large-Scale Online Service Systems', Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, online (2020) [E1]
DOI 10.1145/3368089.3409768
Citations Scopus - 25Web of Science - 21
2020 Gao Y, Liu Y, Zhang H, Li Z, Zhu Y, Lin H, Yang M, 'Estimating GPU Memory Consumption of Deep Learning Models', Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, online (2020) [E1]
DOI 10.1145/3368089.3417050
Citations Scopus - 64Web of Science - 27
2020 Chen Z, Kang Y, Li L, Zhang X, Zhang H, Xu H, et al., 'Towards Intelligent Incident Management: Why We Need It and How We Make It', Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, online (2020) [E1]
DOI 10.1145/3368089.3417055
Citations Scopus - 42Web of Science - 23
2020 Jiang J, Lu W, Chen J, Lin Q, Zhao P, Kang Y, et al., 'How to Mitigate the Incident? An Effective Troubleshooting Guide Recommendation Technique for Online Service Systems', Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, online (2020) [E1]
DOI 10.1145/3368089.3417054
Citations Scopus - 24Web of Science - 13
2020 Gu J, Luo C, Qin S, Qiao B, Lin Q, Zhang H, et al., 'Efficient Incident Identification from Multi-Dimensional Issue Reports via Meta-Heuristic Search', Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, online (2020) [E1]
DOI 10.1145/3368089.3409741
Citations Scopus - 14Web of Science - 5
2020 Chen J, Zhang S, He X, Lin Q, Zhang H, Hao D, et al., 'How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems', ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, online (2020) [E1]
Citations Scopus - 28Web of Science - 19
2019 Zhang X, Xu Y, Lin Q, Qiao B, Zhang H, Dang Y, et al., 'Robust Log-based Anomaly Detection on Unstable Log Data', Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia (2019) [E1]
DOI 10.1145/3338906.3338931
Citations Scopus - 305Web of Science - 142
2019 Chen J, He X, Lin Q, Zhang H, Hao D, Gao F, et al., 'Continuous Incident Triage for Large-Scale Online Service Systems', 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA (2019) [E1]
DOI 10.1109/ASE.2019.00042
Citations Scopus - 54Web of Science - 37
2019 Gu X, Zhang H, Kim S, 'CodeKernel: A Graph Kernel Based Approach to the Selection of API Usage Examples', 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA (2019) [E1]
DOI 10.1109/ASE.2019.00061
Citations Scopus - 19Web of Science - 14
2019 Chen J, Wang G, Hao D, Xiong Y, Zhang H, Zhang L, 'History-guided configuration diversification for compiler test-program generation', Proceedings of the 34th International Conference on Automated Software Engineering, San Diego, CA (2019) [E1]
DOI 10.1109/ASE.2019.00037
Citations Scopus - 35Web of Science - 15
2019 Lin J, Cai S, Luo C, Lin Q, Zhang H, 'Towards More Efficient Meta-heuristic Algorithms for Combinatorial Test Generation', Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia (2019) [E1]
DOI 10.1145/3338906.3338914
Citations Scopus - 14Web of Science - 10
2019 Zhang X, Lin Q, Xu Y, Qin S, Zhang H, Qiao B, et al., 'Cross-dataset Time Series Anomaly Detection for Cloud Systems', Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference, Renton, WA (2019) [E1]
Citations Scopus - 56Web of Science - 26
2019 Chen J, He X, Lin Q, Xu Y, Zhang H, Hao D, et al., 'An Empirical Investigation of Incident Triage for Online Service Systems', 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Montreal, Canada (2019) [E1]
DOI 10.1109/ICSE-SEIP.2019.00020
Citations Scopus - 60Web of Science - 33
2019 Luo C, Hoos HH, Cai S, Lin Q, Zhang H, Zhang D, 'Local Search with Efficient Automatic Configuration for Minimum Vertex Cover', Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, Macau, China (2019) [E1]
DOI 10.24963/ijcai.2019/180
Citations Scopus - 32Web of Science - 19
2019 Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X, 'A Novel Neural Source Code Representation Based on Abstract Syntax Tree', Proceedings of the 41st International Conference on Software Engineering, Montreal, Canada (2019) [E1]
DOI 10.1109/ICSE.2019.00086
Citations Scopus - 417Web of Science - 218
2019 Ha H, Zhang H, 'DeepPerf: performance prediction for configurable software with deep sparse neural network', Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, Canada (2019) [E1]
DOI 10.1109/ICSE.2019.00113
Citations Scopus - 62Web of Science - 39
2019 Chen X, Qiao B, Zhang W, Wu W, Chintalapati M, Zhang D, et al., 'Neural feature search: A neural architecture for automated feature engineering', Proceedings - IEEE International Conference on Data Mining, ICDM, Beijing, China (2019) [E1]
DOI 10.1109/ICDM.2019.00017
Citations Scopus - 24Web of Science - 8
2019 Zhang B, Zhang H, Chen J, Hao D, Moscato P, 'Automatic Discovery and Cleansing of Numerical Metamorphic Relations', 2019 IEEE International Conference on Software Maintenance and Evolution, ICSME 2019, Cleveland, OH (2019) [E1]
DOI 10.1109/ICSME.2019.00035
Citations Web of Science - 9
Co-authors Pablo Moscato
2019 Li C, Zhou M, Gu Z, Gu M, Zhang H, 'Ares: Inferring error specifications through static analysis', Proceedings - 2019 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, San Diego, CA (2019) [E1]
DOI 10.1109/ASE.2019.00130
Citations Scopus - 4Web of Science - 3
2019 Ha H, Zhang H, 'Performance-Influence Model for Highly Configurable Software with Fourier Learning and Lasso Regression', Proceedings - 2019 IEEE International Conference on Software Maintenance and Evolution, ICSME 2019, Cleveland, OH (2019) [E1]
DOI 10.1109/ICSME.2019.00080
Citations Scopus - 20Web of Science - 8
2019 Chen Y, Zhang H, Yang X, Lin Q, Zhang D, Dong H, et al., 'Outage Prediction and Diagnosis for Cloud Service Systems', The Web Conference. Proceedings of The World Wide Web Conference WWW 2019, San Francisco, CA (2019) [E1]
DOI 10.1145/3308558.3313501
Citations Scopus - 50Web of Science - 32
2019 Zhang B, Zhang H, Chen J, Hao D, Moscato P, 'AutoMR: Automatic Discovery and Cleansing of Numerical Metamorphic Relations', 2019 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2019), OH, Cleveland (2019)
DOI 10.1109/ICSME.2019.00036
Citations Scopus - 20Web of Science - 3
Co-authors Pablo Moscato
2018 Lin Q, Hsieh K, Dang Y, Zhang H, Sui K, Xu Y, et al., 'Predicting node failure in cloud service systems', ESEC/FSE 2018 - Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Lake Buena Vista, FL (2018) [E1]
DOI 10.1145/3236024.3236060
Citations Scopus - 66Web of Science - 46
2018 He S, Lin Q, Lou J-G, Zhang H, Lyu MR, Zhang D, 'Identifying impactful service system problems via log analysis', ESEC/FSE 2018 - Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Lake Buena Vista, FL, USA (2018) [E1]
DOI 10.1145/3236024.3236083
Citations Scopus - 115Web of Science - 67
2018 Jiang J, Xiong Y, Zhang H, Gao Q, Chen X, 'Shaping program repair space with existing patches and similar code', ISSTA 2018 - Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, Amsterdam, Netherlands (2018) [E1]
DOI 10.1145/3213846.3213871
Citations Scopus - 206Web of Science - 100
2018 Tonelli R, Ducasse S, Fenu G, Bracciali A, Amaral V, Arcelli F, et al., 'Message from the chairs', 2018 IEEE 1st International Workshop on Blockchain Oriented Software Engineering, IWBOSE 2018 - Proceedings (2018)
DOI 10.1109/IWBOSE.2018.8327563
Citations Scopus - 1
2018 Abreu R, Zhang H, 'Message from the QRS 2018 program chairs', Proceedings - 2018 IEEE 18th International Conference on Software Quality, Reliability, and Security, QRS 2018 (2018)
DOI 10.1109/QRS.2018.00007
2018 Abreu R, Zhang H, 'Message from the QRS 2018 Program Chairs', Proceedings - 2018 IEEE 18th International Conference on Software Quality, Reliability, and Security Companion, QRS-C 2018 (2018)
DOI 10.1109/QRS-C.2018.00007
2018 Galster M, Zhang H, 'Message from the ASWEC 2018: Short research paper program committee chairs', Proceedings - 25th Australasian Software Engineering Conference, ASWEC 2018 (2018)
DOI 10.1109/ASWEC.2018.00007
2018 Washizaki H, Zhang H, 'Message from the APSEC 2018 Program Co-Chairs', Proceedings - Asia-Pacific Software Engineering Conference, APSEC (2018)
DOI 10.1109/APSEC.2018.00006
2018 Lin Q, Ke W, Lou JG, Zhang H, Sui K, Xu Y, et al., 'BigIN4: Instant, interactive insight identification for multi-dimensional big data', Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK (2018) [E1]
DOI 10.1145/3219819.3219867
Citations Scopus - 11Web of Science - 5
2018 Xu Y, Sui K, Yao R, Zhang H, Lin Q, Dang Y, et al., 'Improving Service Availability of Cloud Systems by Predicting Disk Error', Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 18), Boston, MA (2018) [E1]
Citations Scopus - 45Web of Science - 62
2018 Barbar M, Sui Y, Zhang H, Chen S, Xue J, 'Live Path CFI Against Control Flow Hijacking Attacks', Information Security and Privacy: 23rd Australasian Conference, ACISP 2018, Wollongong, NSW (2018) [E1]
DOI 10.1007/978-3-319-93638-3_45
Citations Scopus - 1Web of Science - 1
2018 Barbar M, Sui Y, Zhang H, Chen S, Xue J, 'Poster: Live Path Control Flow Integrity', PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING - COMPANION (ICSE-COMPANION, Gothenburg, SWEDEN (2018)
DOI 10.1145/3183440.3195093
Citations Scopus - 1
2018 Wu R, Wen M, Cheung S-C, Zhang H, 'ChangeLocator: Locate Crash-Inducing Changes Based on Crash Reports', PROCEEDINGS 2018 IEEE/ACM 40TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), Gothenburg, SWEDEN (2018)
DOI 10.1145/3180155.3182516
2018 Gu X, Zhang H, Kim S, 'Deep code search', ICSE '18 Proceedings of the 40th International Conference on Software Engineering, Gothenburg, Sweden (2018) [E1]
DOI 10.1145/3180155.3180167
Citations Scopus - 404Web of Science - 223
2017 Li Z, Jing X, Zhu X, Zhang H, 'Heterogeneous Defect Prediction Through Multiple Kernel Learning and Ensemble Learning', 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), Shanghai, China (2017) [E1]
DOI 10.1109/ICSME.2017.19
Citations Scopus - 51Web of Science - 39
2017 Gu X, Zhang H, Zhang D, Kim S, 'DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning', Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, , August 19-25, 2017, Melbourne, Australia (2017) [E1]
DOI 10.24963/ijcai.2017/514
Citations Scopus - 34Web of Science - 31
2017 Chen J, Bai Y, Hao D, Xiong Y, Zhang H, Xie B, 'Learning to prioritize test programs for compiler testing', ICSE'17 Proceedings of the 39th International Conference on Software Engineering, Buenos Aires, Argentina (2017) [E1]
DOI 10.1109/ICSE.2017.70
Citations Scopus - 63Web of Science - 39
2017 Shu C, Zhang H, 'Neural Programming by Example', Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., San Francisco, CA (2017) [E1]
Citations Scopus - 14Web of Science - 10
2016 Chen J, Hu W, Hao D, Xiong Y, Zhang H, Zhang L, Xie B, 'An empirical comparison of compiler testing techniques', Proceedings of the 38th International Conference on Software Engineering, Austin, TX (2016) [E1]
DOI 10.1145/2884781.2884878
Citations Scopus - 75Web of Science - 51
2016 Lin Q, Lou J-G, Zhang H, Zhang D, 'iDice: Problem Identification for Emerging Issues', 2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), Austin, TX (2016) [E1]
DOI 10.1145/2884781.2884795
Citations Scopus - 54Web of Science - 25
2016 Wu R, Xiao X, Cheung S-C, Zhang H, Zhang C, 'Casper: An Efficient Approach to Call Trace Collection', ACM SIGPLAN NOTICES, St Petersburg, FL (2016) [E1]
DOI 10.1145/2914770.2837619
Citations Scopus - 10Web of Science - 7
2016 Zhou M, Cheng X, Guo X, Gu M, Zhang H, Song X, 'Improving Failure Detection by Automatically Generating Test Cases Near the Boundaries', PROCEEDINGS 2016 IEEE 40TH ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE WORKSHOPS, VOL 1, Atlanta, GA (2016) [E1]
DOI 10.1109/COMPSAC.2016.137
Citations Scopus - 3Web of Science - 2
2016 Chen J, Bai Y, Hao D, Xiong Y, Zhang H, Zhang L, Xie B, 'Test Case Prioritization for Compilers: A Text-Vector Based Approach', 2016 9TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION (ICST), Chicago, IL (2016) [E1]
DOI 10.1109/ICST.2016.19
Citations Scopus - 48Web of Science - 37
2016 Lin Q, Zhang H, Lou J-G, Zhang Y, Chen X, 'Log Clustering based Problem Identification for Online Service Systems', 2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C), Austin, TX (2016) [E1]
DOI 10.1145/2889160.2889232
Citations Scopus - 296Web of Science - 137
2016 Gu X, Zhang H, Zhang D, Kim S, 'Deep API Learning', FSE'16: PROCEEDINGS OF THE 2016 24TH ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON FOUNDATIONS OF SOFTWARE ENGINEERING, Seattle, WA (2016) [E1]
DOI 10.1145/2950290.2950334
Citations Scopus - 410Web of Science - 269
2016 Zhang H, Jain A, Khandelwal G, Kaushik C, Ge S, Hu W, 'Bing Developer Assistant: Improving Developer Productivity by Recommending Sample Code', FSE'16: PROCEEDINGS OF THE 2016 24TH ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON FOUNDATIONS OF SOFTWARE ENGINEERING, Seattle, WA (2016) [E1]
DOI 10.1145/2950290.2983955
Citations Scopus - 43Web of Science - 22
2015 Ding S, Tan HBK, Zhang H, 'ABOR: An Automatic Framework for Buffer Overflow Removal in C/C plus plus Programs', ENTERPRISE INFORMATION SYSTEMS, ICEIS 2014, Lisbon, PORTUGAL (2015) [E1]
DOI 10.1007/978-3-319-22348-3_12
2015 Zhu J, He P, Fu Q, Zhang H, Lyu MR, Zhang D, 'Learning to Log: Helping Developers Make Informed Logging Decisions', 2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, VOL 1, Florence, ITALY (2015) [E1]
DOI 10.1109/ICSE.2015.60
Citations Scopus - 170Web of Science - 97
2015 Zhou H, Lou J-G, Zhang H, Lin H, Lin H, Qin T, 'An Empirical Study on Quality Issues of Production Big Data Platform', 2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, VOL 2, Florence, ITALY (2015) [E1]
DOI 10.1109/ICSE.2015.130
Citations Scopus - 34Web of Science - 22
2015 Ding R, Zhou H, Lou JG, Zhang H, Lin Q, Fu Q, et al., 'Log2: A cost-aware logging mechanism for performance diagnosis', Proceedings of the 2015 USENIX Annual Technical Conference, USENIX ATC 2015 (2015) [E1]

Logging has been a common practice for monitoring and diagnosing performance issues. However, logging comes at a cost, especially for large-scale online service systems. First, th... [more]

Logging has been a common practice for monitoring and diagnosing performance issues. However, logging comes at a cost, especially for large-scale online service systems. First, the overhead incurred by intensive logging is non-negligible. Second, it is costly to diagnose a performance issue if there are a tremendous amount of redundant logs. Therefore, we believe that it is important to limit the overhead incurred by logging, without sacrificing the logging effectiveness. In this paper we propose Log2, a cost-aware logging mechanism. Given a "budget" (defined as the maximum volume of logs allowed to be output in a time interval), Log2 makes the "whether to log" decision through a two-phase filtering mechanism. In the first phase, a large number of irrelevant logs are discarded efficiently. In the second phase, useful logs are cached and output while complying with logging budget. In this way, Log2 keeps the useful logs and discards the less useful ones. We have implemented Log2 and evaluated it on an open source system as well as a real-world online service system from Microsoft. The experimental results show that Log2 can control logging overhead while preserving logging effectiveness.

Citations Scopus - 79
2015 Lv F, Zhang H, Lou J-G, Wang S, Zhang D, Zhao J, 'CodeHow: Effective Code Search based on API Understanding and Extended Boolean Model', 2015 30TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), Lincoln, NE (2015) [E1]
DOI 10.1109/ASE.2015.42
Citations Scopus - 210Web of Science - 109
2014 'Proceedings of the 9th International Workshop on Advanced Modularization Techniques, AOAsia 2014, Hong Kong, China, November 16, 2014', AOAsia@SIGSOFT FSE (2014)
2014 Liu K, Tan HBK, Zhang H, 'Mining key and referential constraints enforcement patterns.', SAC (2014)
2014 'Proceedings of the 5th International Workshop on Emerging Trends in Software Metrics, WETSoM 2014, Hyderabad, India, June 3, 2014', WETSoM (2014)
2014 Ding S, Tan HBK, Zhang H, 'Automatic Removal of Buffer Overflow Vulnerabilities in C/C++ Programs.', ICEIS (2) (2014)
2014 Wong C-P, Xiong Y, Zhang H, Hao D, Zhang L, Mei H, 'Boosting Bug-Report-Oriented Fault Localization with Segmentation and Stack-Trace Analysis', 2014 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME), Victoria, CANADA (2014) [E1]
DOI 10.1109/ICSME.2014.40
Citations Scopus - 163Web of Science - 106
2014 Ding S, Zhang H, Tan HBK, 'Detecting Infeasible Branches Based on Code Patterns', 2014 SOFTWARE EVOLUTION WEEK - IEEE CONFERENCE ON SOFTWARE MAINTENANCE, REENGINEERING, AND REVERSE ENGINEERING (CSMR-WCRE), Antwerp, BELGIUM (2014) [E1]
Citations Web of Science - 5
2014 Counsell S, Marchesi M, Venkatasubramanyam R, Visaggio A, Zhang H, 'Message from the Chairs', 5th International Workshop on Emerging Trends in Software Metrics, WETSoM 2014 - Proceedings (2014)
2014 Wu R, Zhang H, Cheung SC, Kim S, 'Crashlocator: Locating crashing faults based on crash stacks', 2014 International Symposium on Software Testing and Analysis, ISSTA 2014 - Proceedings (2014) [E1]

Software crash is common. When a crash occurs, software developers can receive a report upon user permission. A crash report typically includes a call stack at the time of crash. ... [more]

Software crash is common. When a crash occurs, software developers can receive a report upon user permission. A crash report typically includes a call stack at the time of crash. An important step of debugging a crash is to identify faulty functions, which is often a tedious and labor-intensive task. In this paper, we propose CrashLocator, a method to locate faulty functions using the crash stack information in crash reports. It deduces possible crash traces (the failing execution traces that lead to crash) by expanding the crash stack with functions in static call graph. It then calculates the suspiciousness of each function in the approximate crash traces. The functions are then ranked by their suspiciousness scores and are recommended to developers for further investigation. We evaluate our approach using real-world Mozilla crash data. The results show that our approach is effective: We can locate 50.6%, 63.7% and 67.5% of crashing faults by examining top 1, 5 and 10 functions recommended by CrashLocator, respectively. Our approach outperforms the conventional stack-only methods significantly.

DOI 10.1145/2610384.2610386
Citations Scopus - 108
2014 Cao Y, Zhang H, Ding S, 'Symcrash: Selective recording for reproducing crashes', ASE 2014 - Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering (2014)

Software often crashes despite tremendous effort on software quality assurance. Once developers receive a crash report, they need to reproduce the crash in order to understand the... [more]

Software often crashes despite tremendous effort on software quality assurance. Once developers receive a crash report, they need to reproduce the crash in order to understand the problem and locate the fault. However, limited information from crash reports often makes crash reproduction difficult. Many "captureand-replay" techniques have been proposed to automatically capture program execution data from the failing code, and help developers replay the crash scenarios based on the captured data. However, such techniques often suffer from heavy overhead and introduce privacy concerns. Recently, methods such as BugRedux were proposed to generate test input that leads to crash through symbolic execution. However, such methods have inherent limitations because they rely on conventional symbolic execution techniques. In this paper, we propose a dynamic symbolic execution method called SymCon, which addresses the limitation of conventional symbolic execution by selecting functions that are hard to be resolved by a constraint solver and using their concrete runtime values to replace the symbols. We then propose SymCrash, a selective recording approach that only instruments and monitors the hard-to-solve functions. SymCrash can generate test input for crashes through SymCon. We have applied our approach to successfully reproduce 13 failures of 6 real-world programs. Our results confirm that the proposed approach is suitable for reproducing crashes, in terms of effectiveness, overhead, and privacy. It also outperforms the related methods.

DOI 10.1145/2642937.2642993
Citations Scopus - 25
2014 Sun C, Zhang H, Lou JG, Zhang H, Wang Q, Zhang D, Khoo SC, 'Querying sequential software engineering data', Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering (2014) [E1]

We propose a pattern-based approach to effectively and efficiently analyzing sequential software engineering (SE) data. Different from other types of SE data, sequential SE data p... [more]

We propose a pattern-based approach to effectively and efficiently analyzing sequential software engineering (SE) data. Different from other types of SE data, sequential SE data preserves unique temporal properties, which cannot be easily analyzed without much programming effort. In order to facilitate the analysis of sequential SE data, we design a sequential pattern query language (SPQL), which specifies the temporal properties based on regular expressions, and is enhanced with variables and statements to store and manipulate matching states. We also propose a query engine to effectively process the SPQL queries. We have applied our approach to analyze two types of SE data, namely bug report history and source code change history. We experiment with 181,213 Eclipse bug reports and 323,989 code revisions of Android. SPQL enables us to explore interesting temporal properties underneath these sequential data with a few lines of query code and low matching overhead. The analysis results can help better understand a software process and identify process violations.

DOI 10.1145/2635868.2635902
Citations Scopus - 5Web of Science - 4
2014 Hu H, Zhang H, Xuan J, Sun W, 'Effective bug triage based on historical bug-fix information', Proceedings - International Symposium on Software Reliability Engineering, ISSRE (2014) [E1]

For complex and popular software, project teams could receive a large number of bug reports. It is often tedious and costly to manually assign these bug reports to developers who ... [more]

For complex and popular software, project teams could receive a large number of bug reports. It is often tedious and costly to manually assign these bug reports to developers who have the expertise to fix the bugs. Many bug triage techniques have been proposed to automate this process. In this paper, we describe our study on applying conventional bug triage techniques to projects of different sizes. We find that the effectiveness of a bug triage technique largely depends on the size of a project team (measured in terms of the number of developers). The conventional bug triage methods become less effective when the number of developers increases. To further improve the effectiveness of bug triage for large projects, we propose a novel recommendation method called Bug Fixer, which recommends developers for a new bug report based on historical bug-fix information. Bug Fixer constructs a Developer-Component-Bug (DCB) network, which models the relationship between developers and source code components, as well as the relationship between the components and their associated bugs. A DCB network captures the knowledge of 'who fixed what, where'. For a new bug report, Bug Fixer uses a DCB network to recommend to triager a list of suitable developers who could fix this bug. We evaluate Bug Fixer on three large-scale open source projects and two smaller industrial projects. The experimental results show that the proposed method outperforms the existing methods for large projects and achieves comparable performance for small projects.

DOI 10.1109/ISSRE.2014.17
Citations Scopus - 84Web of Science - 65
2014 Lim MH, Lou JG, Zhang H, Fu Q, Teoh ABJ, Lin Q, et al., 'Identifying Recurrent and Unknown Performance Issues', Proceedings - IEEE International Conference on Data Mining, ICDM (2014) [E1]

For a large-scale software system, especially an online service system, when a performance issue occurs, it is desirable to check whether this issue has occurred before. If there ... [more]

For a large-scale software system, especially an online service system, when a performance issue occurs, it is desirable to check whether this issue has occurred before. If there are past similar issues, a known remedy could be applied. Otherwise, a new troubleshooting process may have to be initiated. The symptom of a performance issue can be characterized by a set of metrics. Due to the sophisticated nature of software systems, manual diagnosis of performance issues based on metric data is typically expensive and laborious. In this paper, we propose a Hidden Markov Random Field (HMRF) based approach to automatic identification of recurrent and unknown performance issues. We formulate the problem of issue identification as a HMRF-based clustering problem. Our approach incorporates the learning of metric discretization thresholds and the optimization of issue clustering. Based on the learned thresholds and cluster centroids, we can achieve accurate identification of recurrent issues and unknown issues. Experimental evaluations on an open benchmark and a large-scale industrial production system show that our approach is effective and outperforms the related state-of-the-art approaches.

DOI 10.1109/ICDM.2014.96
Citations Scopus - 22Web of Science - 17
2013 Hao D, Lan T, Zhang H, Guo C, Zhang L, 'Is This a Bug or an Obsolete Test?', ECOOP 2013 - OBJECT-ORIENTED PROGRAMMING, Sao Paulo, FRANCE (2013) [E1]
Citations Web of Science - 17
2013 Liu K, Tan HBK, Zhang H, 'Has This Bug Been Reported?', 2013 20TH WORKING CONFERENCE ON REVERSE ENGINEERING (WCRE), GERMANY, Univ Koblenz, Koblenz (2013) [E1]
Citations Scopus - 14Web of Science - 10
2013 Zhang H, Gong L, Versteeg S, 'Predicting Bug-Fixing Time: An Empirical Study of Commercial Software Projects', PROCEEDINGS OF THE 35TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2013), San Francisco, CA (2013) [E1]
Citations Scopus - 138Web of Science - 97
2013 Zhang H, Cheung SC, 'A cost-effectiveness criterion for applying software defect prediction models', 2013 9th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE 2013 - Proceedings (2013)

Ideally, software defect prediction models should help organize software quality assurance (SQA) resources and reduce cost of finding defects by allowing the modules most likely t... [more]

Ideally, software defect prediction models should help organize software quality assurance (SQA) resources and reduce cost of finding defects by allowing the modules most likely to contain defects to be inspected first. In this paper, we study the cost-effectiveness of applying defect prediction models in SQA and propose a basic cost-effectiveness criterion. The criterion implies that defect prediction models should be applied with caution. We also propose a new metric FN/(FN+TN) to measure the cost-effectiveness of a defect prediction model. Copyright 2013 ACM.

DOI 10.1145/2491411.2494581
Citations Scopus - 13
2013 Gong J, Zhang H, 'BugMap: A topographic map of bugs', 2013 9th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE 2013 - Proceedings (2013)

A large and complex software system could contain a large number of bugs. It is desirable for developers to understand how these bugs are distributed across the system, so they co... [more]

A large and complex software system could contain a large number of bugs. It is desirable for developers to understand how these bugs are distributed across the system, so they could have a better overview of software quality. In this paper, we describe BugMap, a tool we developed for visualizing large-scale bug location information. Taken source code and bug data as the input, BugMap can display bug localizations on a topographic map. By examining the topographic map, developers can understand how the components and files are affected by bugs. We apply this tool to visualize the distribution of Eclipse bugs across components/files. The results show that our tool is effective for understanding the overall quality status of a large-scale system and for identifying the problematic areas of the system. Copyright 2013 ACM.

DOI 10.1145/2491411.2494582
Citations Scopus - 9
2013 Hao D, Lan T, Zhang H, Guo C, Zhang L, 'Is this a bug or an obsolete test?', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013)

In software evolution, developers typically need to identify whether the failure of a test is due to a bug in the source code under test or the obsoleteness of the test code when ... [more]

In software evolution, developers typically need to identify whether the failure of a test is due to a bug in the source code under test or the obsoleteness of the test code when they execute a test suite. Only after finding the cause of a failure can developers determine whether to fix the bug or repair the obsolete test. Researchers have proposed several techniques to automate test repair. However, test-repair techniques typically assume that test failures are always due to obsolete tests. Thus, such techniques may not be applicable in real world software evolution when developers do not know whether the failure is due to a bug or an obsolete test. To know whether the cause of a test failure lies in the source code under test or in the test code, we view this problem as a classification problem and propose an automatic approach based on machine learning. Specifically, we target Java software using the JUnit testing framework and collect a set of features that may be related to failures of tests. Using this set of features, we adopt the Best-first Decision Tree Learning algorithm to train a classifier with some existing regression test failures as training instances. Then, we use the classifier to classify future failed tests. Furthermore, we evaluated our approach using two Java programs in three scenarios (within the same version, within different versions of a program, and between different programs), and found that our approach can effectively classify the causes of failed tests. © 2013 Springer-Verlag Berlin Heidelberg.

DOI 10.1007/978-3-642-39038-8_25
Citations Scopus - 19
2013 Wang J, Dang Y, Zhang H, Chen K, Xie T, Zhang D, 'Mining succinct and high-coverage API usage patterns from source code', IEEE International Working Conference on Mining Software Repositories (2013) [E1]

During software development, a developer often needs to discover specific usage patterns of Application Programming Interface (API) methods. However, these usage patterns are ofte... [more]

During software development, a developer often needs to discover specific usage patterns of Application Programming Interface (API) methods. However, these usage patterns are often not well documented. To help developers to get such usage patterns, there are approaches proposed to mine client code of the API methods. However, they lack metrics to measure the quality of the mined usage patterns, and the API usage patterns mined by the existing approaches tend to be many and redundant, posing significant barriers for being practical adoption. To address these issues, in this paper, we propose two quality metrics (succinctness and coverage) for mined usage patterns, and further propose a novel approach called Usage Pattern Miner (UP-Miner) that mines succinct and high-coverage usage patterns of API methods from source code. We have evaluated our approach on a large-scale Microsoft codebase. The results show that our approach is effective and outperforms an existing representative approach MAPO. The user studies conducted with Microsoft developers confirm the usefulness of the proposed approach in practice. © 2013 IEEE.

DOI 10.1109/MSR.2013.6624045
Citations Scopus - 159Web of Science - 113
2012 Zhou J, Zhang H, 'Learning to rank duplicate bug reports', ACM International Conference Proceeding Series (2012) [E1]

For a large and complex software system, the project team could receive a large number of bug reports. Some bug reports could be duplicates as they essentially report the same pro... [more]

For a large and complex software system, the project team could receive a large number of bug reports. Some bug reports could be duplicates as they essentially report the same problem. It is often tedious and costly to manually check if a newly reported bug is a duplicate of an already reported bug. In this paper, we propose BugSim, a method that can automatically retrieve duplicate bug reports given a new bug report. BugSim is based on learning to rank concepts. We identify textual and statistical features of bug reports and propose a similarity function for bug reports based on the features. We then construct a training set by assembling pairs of duplicate and non-duplicate bug reports. We train the weights of features by applying the stochastic gradient descent algorithm over the training set. For a new bug report, we retrieve candidate duplicate reports using the trained model. We evaluate BugSim using more than 45,100 real bug reports of twelve Eclipse projects. The evaluation results show that the proposed method is effective. On average, the recall rate for the top 10 retrieved reports is 76.11%. Furthermore, BugSim outperforms the previous state-of-art methods that are implemented using SVM and BM25F ext. © 2012 ACM.

DOI 10.1145/2396761.2396869
Citations Scopus - 39
2012 'Proceedings of the 3rd International Workshop on Emerging Trends in Software Metrics, WETSoM 2012, Zurich, Switzerland, June 3, 2012', WETSoM (2012)
2012 Anderson DJ, Concas G, Lunesu MI, Marchesi M, Zhang H, 'A Comparative Study of Scrum and Kanban Approaches on a Real Case Study Using Simulation', AGILE PROCESSES IN SOFTWARE ENGINEERING AND EXTREME PROGRAMMING, XP 2012, Malmo, SWEDEN (2012) [E1]
Citations Web of Science - 18
2012 Zhou J, Zhang H, Lo D, 'Where Should the Bugs Be Fixed? More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports', 2012 34TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), Zurich, SWITZERLAND (2012) [E1]
Citations Scopus - 523Web of Science - 334
2012 Dang Y, Wu R, Zhang H, Zhang D, Nobel P, 'ReBucket: A Method for Clustering Duplicate Crash Reports Based on Call Stack Similarity', 2012 34TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), Zurich, SWITZERLAND (2012) [E1]
Citations Scopus - 119Web of Science - 74
2012 Gong L, Lo D, Jiang L, Zhang H, 'Diversity Maximization Speedup for Fault Localization', 2012 PROCEEDINGS OF THE 27TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), Essen, GERMANY (2012) [E1]
Citations Scopus - 24Web of Science - 17
2012 Tran MH, Colman A, Han J, Zhang H, 'Modeling and Verification of Context-aware Systems', 2012 19TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE (APSEC), VOL 1, PEOPLES R CHINA, Hong Kong (2012) [E1]
DOI 10.1109/APSEC.2012.50
Citations Scopus - 7Web of Science - 2
2012 Wang J, Zhang H, 'Predicting Defect Numbers Based on Defect State Transition Models', PROCEEDINGS OF THE ACM-IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT (ESEM'12), Lund, SWEDEN (2012) [E1]
Citations Scopus - 19Web of Science - 10
2012 Gong L, Lo D, Jiang L, Zhang H, 'Interactive Fault Localization Leveraging Simple User Feedback', 2012 28TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE (ICSM), ITALY, Riva del Garda (2012) [E1]
Citations Scopus - 36Web of Science - 29
2012 Ding S, Tan HBK, Liu K, Chandramohan M, Zhang H, 'Detection of buffer overflow vulnerabilities in C/C++ with pattern based limited symbolic evaluation', Proceedings - International Computer Software and Applications Conference (2012)

Buffer overflow vulnerability is one of the major security threats for applications written in C/C++. Among the existing approaches for detecting buffer overflow vulnerability, th... [more]

Buffer overflow vulnerability is one of the major security threats for applications written in C/C++. Among the existing approaches for detecting buffer overflow vulnerability, though flow sensitive based approaches offer higher precision but they are limited by heavy overhead and the fact that many constraints are unsolvable. We propose a novel method to efficiently detect vulnerable buffer overflows in any given control flow graph through recognizing two patterns. The proposed approach first uses syntax analysis to filter away those branches that cannot possibly comply with any of the two patterns before applying a limited symbolic evaluation for a precise matching against the patterns. The proposed approach only needs to evaluate a limited set of selected branch predicates according to the patterns and avoids the need to deal with a large number of general branch predicates. This significantly improves the scalability while not sacrificing the detection precision. Our experiments demonstrate the scalability and efficiency of the proposed method, which demonstrates its applicability. © 2012 IEEE.

DOI 10.1109/COMPSACW.2012.103
Citations Scopus - 3
2012 Grieskamp W, Zhang H, 'Message from the QSIC 2012 Industry Track Chairs', Proceedings - International Conference on Quality Software (2012)
DOI 10.1109/QSIC.2012.50
2012 Concas G, Canfora G, Tempero E, Zhang H, 'Welcome to 3rd International Workshop on Emerging Trends in Software Metrics (WETSoM 2012)', 2012 3rd International Workshop on Emerging Trends in Software Metrics, WETSoM 2012 - Proceedings (2012)

Welcome to WETSoM2012, the 3rd International Workshop on Emerging Trends in Software Metrics. Since its start, WETSoM attracted a blend of academic and industrial researchers, cre... [more]

Welcome to WETSoM2012, the 3rd International Workshop on Emerging Trends in Software Metrics. Since its start, WETSoM attracted a blend of academic and industrial researchers, creating a stimulating atmosphere to discuss the progresses of software metrics. A key motivation for this workshop is to help overcoming the low impact that software metrics has on current software development. This is pursued by critically examining the evidence for the effectiveness of existing metrics and identifying new directions for metrics. Evidence for existing metrics includes how the metrics have been used in practice and studies showing their effectiveness. Identifying new directions includes use of new theories, such as complex network theory, on which to base metrics. We are pleased that this year WETSoMfeatures 12 technical paper and an exciting keynote on mining developers' communication to assess software quality by Massimiliano di Penta. The program of WETSoM2012 is the result of hard work by many dedicated people; we especially thank the authors of submitted papers and the members of the program committee. Above all, the greatest richness of this workshop is its participants, who shape the discussion and points into new directions for software metrics research and practice. We hope you will have a great time and an unforgettable experience at WETSoM2012. © 2012 IEEE.

DOI 10.1109/WETSoM.2012.6226985
2012 Anderson DJ, Concas G, Lunesu MI, Marchesi M, Zhang H, 'A comparative study of scrum and kanban approaches on a real case study using simulation', Lecture Notes in Business Information Processing (2012)

We present the application of software process modeling and simulation using an agent-based approach to a real case study of software maintenance. The original process used PSP/TS... [more]

We present the application of software process modeling and simulation using an agent-based approach to a real case study of software maintenance. The original process used PSP/TSP; it spent a large amount of time estimating in advance maintenance requests, and needed to be greatly improved. To this purpose, a Kanban system was successfully implemented, that demonstrated to be able to substantially improve the process without giving up PSP/TSP. We customized the simulator and, using input data with the same characteristics of the real ones, we were able to obtain results very similar to that of the processes of the case study, in particular of the original process. We also simulated, using the same input data, the possible application of the Scrum process to the same data, showing results comparable to the Kanban process. © 2012 Springer-Verlag Berlin Heidelberg.

DOI 10.1007/978-3-642-30350-0_9
Citations Scopus - 26
2011 Li YF, Zhang H, 'Integrating software engineering data using semantic web technologies', Proceedings - International Conference on Software Engineering (2011)

A plethora of software engineering data have been produced by different organizations and tools over time. These data may come from different sources, and are often disparate and ... [more]

A plethora of software engineering data have been produced by different organizations and tools over time. These data may come from different sources, and are often disparate and distributed. The integration of these data may open up the possibility of conducting systemic, holistic study of software projects in ways previously unexplored. Semantic Web technologies have been used successfully in a wide array of domains such as health care and life sciences as a platform for information integration and knowledge management. The success is largely due to the open and extensible nature of ontology languages as well as growing tool support. We believe that Semantic Web technologies represent an ideal platform for the integration of software engineering data in a semantic repository. By querying and analyzing such a repository, researchers and practitioners can better understand and control software engineering activities and processes. In this paper, we describe how we apply Semantic Web techniques to integrate object-oriented software engineering data from different sources. We also show how the integrated data can help us answer complex queries about large-scale software projects through a case study on the Eclipse system. © 2011 ACM.

DOI 10.1145/1985441.1985473
Citations Scopus - 9
2011 Wu R, Zhang H, Kim S, Cheung SC, 'ReLink: Recovering links between bugs and changes', SIGSOFT/FSE 2011 - Proceedings of the 19th ACM SIGSOFT Symposium on Foundations of Software Engineering (2011) [E1]

Software defect information, including links between bugs and committed changes, plays an important role in software maintenance such as measuring quality and predicting defects. ... [more]

Software defect information, including links between bugs and committed changes, plays an important role in software maintenance such as measuring quality and predicting defects. Usually, the links are automatically mined from change logs and bug reports using heuristics such as searching for specific keywords and bug IDs in change logs. However, the accuracy of these heuristics depends on the quality of change logs. Bird et al. found that there are many missing links due to the absence of bug references in change logs. They also found that the missing links lead to biased defect information, and it affects defect prediction performance. We manually inspected the explicit links, which have explicit bug IDs in change logs and observed that the links exhibit certain features. Based on our observation, we developed an automatic link recovery algorithm, ReLink, which automatically learns criteria of features from explicit links to recover missing links. We applied ReLink to three open source projects. ReLink reliably identified links with 89% precision and 78% recall on average, while the traditional heuristics alone achieve 91% precision and 64% recall. We also evaluated the impact of recovered links on software maintainability measurement and defect prediction, and found the results of ReLink yields significantly better accuracy than those of traditional heuristics. © 2011 ACM.

DOI 10.1145/2025113.2025120
Citations Scopus - 333
2011 'Proceedings of the 2nd International Workshop on Emerging Trends in Software Metrics, WETSoM 2011, Waikiki, Honolulu, HI, USA, May 24, 2011', WETSoM (2011)
2011 Kim S, Zhang H, Wu R, Gong L, 'Dealing with Noise in Defect Prediction', 2011 33RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), Honolulu, HI (2011) [E1]
Citations Scopus - 279Web of Science - 191
2011 Concas G, Di Penta M, Tempero E, Zhang H, 'Workshop on Emerging Trends in Software Metrics (WETSoM 2011)', 2011 33RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), Honolulu, HI (2011) [E3]
2011 'Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering' (2011)
DOI 10.1145/2025113
2011 Liu K, Tan HBK, Chen X, Zhang H, Padmanabhuni BM, 'Automated extraction of data lifecycle support from database applications', SEKE 2011 - Proceedings of the 23rd International Conference on Software Engineering and Knowledge Engineering (2011)

Database application is one of the most common types of systems. Grounded on the simple concept of data lifecycle-any data in database is created from insertion, used via selectio... [more]

Database application is one of the most common types of systems. Grounded on the simple concept of data lifecycle-any data in database is created from insertion, used via selection and modification and terminated at deletion-this paper proposes a novel approach to reverse engineer the data lifecycle automatically from the source code of database applications. The extracted information can be used for the selection of open-source database applications for adaptation. It can also be used for maintenance and verification of database applications. A tool has been developed to implement the proposed approach for PHP-based database applications. Case studies have also been conducted to evaluate the use of the proposed approach.

Citations Scopus - 4
2011 Concas G, Tempero E, Zhang H, Di Penta M, 'Workshop on Emerging Trends in Software Metrics (WETSoM 2011)', Proceedings - International Conference on Software Engineering (2011)
2011 Jarzabek S, Pettersson U, Zhang H, 'University-industry collaboration journey towards product lines', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2011)

Product Lines for mission critical Command and Control systems was a starting point for a long lasting research collaboration between National University of Singapore (NUS) and ST... [more]

Product Lines for mission critical Command and Control systems was a starting point for a long lasting research collaboration between National University of Singapore (NUS) and ST Electronics (Info-Software Systems) Pte Ltd (STEE-InfoSoft). Collaboration was intensified by a joint research project, also involving University of Waterloo and Netron Inc. that led to development of reuse technology called XVCL. The contribution of this paper is twofold: First, we describe collaboration modes, factors that were critical to sustain collaboration, and benefits for university and industry gained over years. Among the main benefits, STEE-InfoSoft advanced its reuse practice by applying XVCL in several software Product Line projects, while NUS team received early feedback from STEE-InfoSoft which helped refine XVCL reuse methods and keep academic research in sync with industrial realities. Academic findings and industrial pilots have opened new unexpected research directions. Second, we draw lessons learned from many projects, to explain the general nature and significance of problems addressed with the XVCL approach. © 2011 Springer-Verlag.

DOI 10.1007/978-3-642-21347-2_17
Citations Scopus - 4
2010 'Proceedings of the 2010 ICSE Workshop on Emerging Trends in Software Metrics, WETSoM 2010, Cape Town, South Africa, May 4, 2010', WETSoM (2010)
2010 Zhang H, Jarzabek S, 'A Hybrid Approach to Feature-Oriented Programming in XVCL', SOFTWARE PRODUCT LINES: GOING BEYOND, SOUTH KOREA, Jeju Island (2010)
Citations Scopus - 4Web of Science - 4
2010 Zhang H, Shi B, Zhang L, 'Automatic Checking of License Compliance', 2010 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, Timisoara, ROMANIA (2010)
Citations Scopus - 6Web of Science - 1
2010 Zhang H, Wu R, 'Sampling Program Quality', 2010 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, Timisoara, ROMANIA (2010)
Citations Scopus - 7
2010 Zhang H, Nelson A, Menzies T, 'On the value of learning from defect dense components for software defect prediction', ACM International Conference Proceeding Series (2010)

BACKGROUND: Defect predictors learned from static code measures can isolate code modules with a higher than usual probability of defects. AIMS: To improve those learners by focusi... [more]

BACKGROUND: Defect predictors learned from static code measures can isolate code modules with a higher than usual probability of defects. AIMS: To improve those learners by focusing on the defect-rich portions of the training sets. METHOD: Defect data CM1, KC1, MC1, PC1, PC3 was separated into components. A subset of the projects (selected at random) were set aside for testing. Training sets were generated for a NaiveBayes classifier in two ways. In sample the dense treatment, the components with higher than the median number of defective modules were used for training. In the standard treatment, modules from any component were used for training. Both samples were run against the test set and evaluated using recall, probability of false alarm, and precision. In addition, under sampling and over sampling was performed on the defect data. Each method was repeated in a 10-by-10 cross-validation experiment. RESULTS: Prediction models learned from defect dense components out-performed standard method, under sampling, as well as over sampling. In statistical rankings based on recall, probability of false alarm, and precision, models learned from dense components won 4-5 times more often than any other method, and also lost the least amount of times. CONCLUSIONS: Given training data where most of the defects exist in small numbers of components, better defect predictors can be trained from the defect dense components.

DOI 10.1145/1868328.1868350
Citations Scopus - 13
2010 Canfora G, Concas G, Marchesi M, Tempero E, Zhang H, 'Workshop on Emerging Trends in Software Metrics (WETSoM 2010)', Proceedings - International Conference on Software Engineering (2010)

The Workshop on Emerging Trends in Software Metrics aims at bringing together researchers and practitioners to discuss the progress of software metrics. The motivation for this wo... [more]

The Workshop on Emerging Trends in Software Metrics aims at bringing together researchers and practitioners to discuss the progress of software metrics. The motivation for this workshop is the low impact that software metrics has on current software development. The goals of this workshop are to critically examine the evidence for the effectiveness of existing metrics and to identify new directions for development of software metrics. © 2010 ACM.

DOI 10.1145/1810295.1810428
2010 Canfora G, Concas G, Marchesi M, Tempero E, Zhang H, 'Proceedings - International Conference on Software Engineering: Foreword', Proceedings - International Conference on Software Engineering (2010)
2009 Liu L, Zhang H, Ma W, Shan Y, Xu J, Peng F, Burda T, 'Understanding Chinese Characteristics of Requirements Engineering', PROCEEDINGS OF THE 2009 17TH IEEE INTERNATIONAL REQUIREMENTS ENGINEERING CONFERENCE, Atlanta, GA (2009)
DOI 10.1109/RE.2009.14
Citations Scopus - 9Web of Science - 4
2009 Jarzabek S, Xue Y, Zhang H, Lee Y, 'Avoiding Some Common Preprocessing Pitfalls with Feature Queries', APSEC 09: SIXTEENTH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE, PROCEEDINGS, MALAYSIA, Bat Ferringhi (2009)
DOI 10.1109/APSEC.2009.61
Citations Scopus - 1
2009 Zhang H, 'An Investigation of the Relationships between Lines of Code and Defects', 2009 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, CONFERENCE PROCEEDINGS, Edmonton, CANADA (2009)
DOI 10.1109/ICSM.2009.5306304
Citations Scopus - 112Web of Science - 75
2009 Jarzabek S, Zhang H, Lee Y, Xue Y, Shaikh N, 'Increasing Usability of Preprocessing for Feature Management in Product Lines with Queries', 2009 31ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, COMPANION VOLUME, Vancouver, CANADA (2009)
DOI 10.1109/ICSE-COMPANION.2009.5070985
Citations Scopus - 4Web of Science - 1
2008 Zhang H, 'Exploring Regularity in Source Code: Software Science and Zipf's Law', FIFTEENTH WORKING CONFERENCE ON REVERSE ENGINEERING, PROCEEDINGS, BELGIUM, Antwerp (2008)
DOI 10.1109/WCRE.2008.37
Citations Scopus - 12Web of Science - 9
2008 Zhang H, 'An initial study of the growth of Eclipse defects', Proceedings - International Conference on Software Engineering (2008)

We analyze the Eclipse defect data from June 2004 to November 2007, and find that the growth of the number of defects can be well modeled by polynomial functions. Furthermore, we ... [more]

We analyze the Eclipse defect data from June 2004 to November 2007, and find that the growth of the number of defects can be well modeled by polynomial functions. Furthermore, we can predict the number of future Eclipse defects based on the nature of defect growth. Copyright 2008 ACM.

DOI 10.1145/1370750.1370785
Citations Scopus - 10
2008 Hongyu Z, 'The scale-free nature of semantic web ontology', Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08 (2008)

Semantic web ontology languages, such as OWL, have been widely used for knowledge representation. Through empirical analysis of real-world ontologies we discover that, like many n... [more]

Semantic web ontology languages, such as OWL, have been widely used for knowledge representation. Through empirical analysis of real-world ontologies we discover that, like many natural and social phenomenon, the semantic web ontology is also "scale-free".

DOI 10.1145/1367497.1367649
Citations Scopus - 16
2007 Zhang H, Zhang X, Gu M, 'Predicting defective software components from code complexity measures', 13TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING, PROCEEDINGS, Melbourne, AUSTRALIA (2007)
DOI 10.1109/PRDC.2007.28
Citations Web of Science - 28
2007 Zhang H, Tan HBK, 'An empirical study of class sizes for large Java systems', 14TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE, PROCEEDINGS, Nagoya, JAPAN (2007)
DOI 10.1109/ASPEC.2007.64
Citations Web of Science - 13
2007 Zhang H, Tan HBK, 'An empirical study of class sizes for large java systems', Proceedings - Asia-Pacific Software Engineering Conference, APSEC (2007)

We perform an empirical study of class sizes (in terms of Lines of Code) on a number of large Java software systems, and discover an interesting pattern - that many classes have o... [more]

We perform an empirical study of class sizes (in terms of Lines of Code) on a number of large Java software systems, and discover an interesting pattern - that many classes have only small sizes whereas a few classes have large size. We call this phenomenon the small class phenomenon. Further analysis shows that the class sizes follow the lognormal distribution. Having understood the distribution of class sizes, we then derive a general size estimation model, which reveals the relationship between the size of a large Java system and the number oficiasses the system has. In this paper, we also show that the adoption of objectorientation is a possible cause of the small class phenomenon. We believe our study reveals the regularity that emerges from large-scale object-oriented software construction, and hope our research can contribute to a deep understanding of computer programming. © 2007 IEEE.

DOI 10.1109/APSEC.2007.20
Citations Scopus - 22
2007 Zhang H, Zhang X, Gu M, 'Predicting defective software components from code complexity measures', Proceedings - 13th Pacific Rim International Symposium on Dependable Computing, PRDC 2007 (2007)

The ability to predict defective modules can help us allocate limited quality assurance resources effectively and efficiently. In this paper, we propose a complexitybased method f... [more]

The ability to predict defective modules can help us allocate limited quality assurance resources effectively and efficiently. In this paper, we propose a complexitybased method for predicting defect-prone components. Our method takes three code-level complexity measures as input, namely Lines of Code, McCabe's Cyclomatic Complexity and Halstead's Volume, and classifies components as either defective or non-defective. We perform an extensive study of twelve classification models using the public NASA dataseis. Cross-validation results show that our method can achieve good prediction accuracy. This study confirms that static code complexity measures can be useful indicators of component quality. © 2007 IEEE.

DOI 10.1109/PRDC.2007.56
Citations Scopus - 45Web of Science - 11
2007 Peng D, Jarzabek S, Rajapakse DC, Zhang H, 'Reuse of database access layer components in JEE product lines: Limitations and a possible solution (Case Study)', 19th International Conference on Software Engineering and Knowledge Engineering, SEKE 2007 (2007)

We set up an experiment to evaluate JEE as a platform for product line development. While JEE provides many useful mechanisms for reuse of common services/components, still we fou... [more]

We set up an experiment to evaluate JEE as a platform for product line development. While JEE provides many useful mechanisms for reuse of common services/components, still we found that systematic across-the-board reuse in application domain-specific areas was hard. The main difficulty was the lack of a mechanism to represent groups of similar components in a generic, adaptable form. Such similar components arise as the number of variant features of a product line grows, and we need to accommodate legal combinations of variant features in components of a product line architecture. Such uncontrolled growth of similar component versions hinders productivity of reuse-based development and raises maintenance costs. In the paper, we study the manifestation of this problem in the JEE¿ database access layer. Interactive Development Environments such as NetBeans or JBuilder speed up the development process, but they do not address the source of the problem, which is the lack of mechanisms to design generic components capable of accommodating variant features in various combinations. We filled this gap with a "mixed strategy" solution based on generative programming technique of XVCL applied on top of JEE. In the paper, we highlight the nature of the problems we encountered and our solution. Copyright © (2007) by Knowledge Systems Institute (KSI).

Citations Scopus - 2
2006 Tan HBK, Zhao Y, Zhang H, 'Estimating LOC for information systems from their conceptual data models', Proceedings - International Conference on Software Engineering (2006)

Effort and cost estimation is crucial in software management. Estimation of software size plays a key role in the estimation. Line of Code (LOG) is still a commonly used software ... [more]

Effort and cost estimation is crucial in software management. Estimation of software size plays a key role in the estimation. Line of Code (LOG) is still a commonly used software size measure. Despite the fact that software sizing is well recognized as an important problem for more than two decades, there is still much problem in existing methods. Conceptual data model is widely used in the requirements analysis for information systems. It is also not difficult to construct conceptual data models in the early stage of developing information systems. Much characteristic of an information system is actually reflected from its conceptual data model. We explore into the use of conceptual data model for estimating LOC. This paper proposes a novel method for estimating LOG for an information system from its conceptual data model through the use of multiple linear regression model. We have validated the method through collecting samples from both the industry and open-source systems. Copyright 2006 ACM.

DOI 10.1145/1134285.1134331
Citations Scopus - 12
2006 Jarzabek S, Zhang H, Shen RU, Lam VT, Zhenxin S, 'Analysis of meta-programs: An example', International Journal of Software Engineering and Knowledge Engineering (2006)

Meta-programs are generic, incomplete, adaptable programs that are instantiated at construction time to meet specific requirements. Templates and generative techniques are example... [more]

Meta-programs are generic, incomplete, adaptable programs that are instantiated at construction time to meet specific requirements. Templates and generative techniques are examples of meta-programming techniques. Understanding of meta-programs is more difficult than understanding of concrete, executable programs. Static and dynamic analysis methods have been applied to ease understanding of programs - can similar methods be used for meta-programs? In our projects, we build meta-programs with a meta-programming technique called XVCL. Meta-programs in XVCL are organized into a hierarchy of meta-components from which the XVCL processor generates concrete, executable programs that meet specific requirements. We developed an automated system that analyzes XVCL meta-programs, and presents developers with information that helps them work with meta-programs more effectively. Our system conducts both static and dynamic analysis of a. meta-program. An integral part of our solution is a query language, FQL in which we formulate questions about meta-prograin properties. An FQL query processor automatically answers a class of queries. The analysis method described in the paper is specific to XVCL. However, the principle of our approach can be applied to other meta-programming systems. We believe readers interested in metaprogramming in general will find some of the lessons from our experiment interesting and useful. © World Scientific Publishing Company.

DOI 10.1142/S0218194006002689
Citations Scopus - 1
2005 Sun J, Zhang H, Li YF, Wang H, 'Formal semantics and verification for feature modeling', Proceedings of the IEEE International Conference on Engineering of Complex Computer Systems, ICECCS (2005)

Research on features has received much attention in the domain engineering community. Feature modeling plays an important role in the design and implementation of complex software... [more]

Research on features has received much attention in the domain engineering community. Feature modeling plays an important role in the design and implementation of complex software systems. However, the presentation and analysis of feature models are still largely informal. There is also an increasing need for methods and tools that can support automated feature model analysis. This paper presents a formal engineering approach to the specification and verification of feature models. A formal semantics for the feature modeling language is defined using first-order logic. It provides a precise and rigorous formal interpretation for the graphical notation. In addition, further validation of the semantics using the Z/EVES theorem prover is presented. Finally, we demonstrate that the consistency of a feature model and its configurations can be automatically verified by encoding the semantics into the Alloy Analyzer. A case study of the Key Word in Context (KWIC) index systems feature model is presented to illustrate the verification process. © 2005 IEEE.

Citations Scopus - 93Web of Science - 38
2005 Zhang HY, Bradbury JS, Cordy JR, Dingel J, 'Implementation and verification of implicit-invocation systems using source transformation', FIFTH IEEE INTERNATIONAL WORKSHOP ON SOURCE CODE ANALYSIS AND MANIPULATION, PROCEEDINGS, Budapest, HUNGARY (2005)
DOI 10.1109/SCAM.2005.15
Citations Web of Science - 3
2003 Zhang H, Jarzabek S, 'An XVCL approach to handling variants: A KWIC product line example', Proceedings - Asia-Pacific Software Engineering Conference, APSEC (2003)

We developed XVCL (XML-based Variant Configuration Language), a method and tool for product lines, to facilitate handling variants in reusable software assets (such as architectur... [more]

We developed XVCL (XML-based Variant Configuration Language), a method and tool for product lines, to facilitate handling variants in reusable software assets (such as architecture, code components or UML models). XVCL is a newer version of Bassett's frames [1], a technology that has achieved substantial productivity improvements in large data processing product lines written in COBOL. Despite its simplicity, XVCL can effectively manage a wide range of product line variants from a compact base of meta-components, structured for effective reuse. We applied XVCL in two medium-size product line projects and a number of smaller case studies. In this paper, we communicate XVCL's capabilities to support product lines by means of a simple, but still interesting, example of the KWIC system introduced by Parnas in 1970's. We show how we can handle functional variants, variant design decisions and implementation-level variants in a generic KWIC system.

DOI 10.1109/APSEC.2003.1254364
Citations Scopus - 6Web of Science - 3
2003 Jarzabek S, Ong WC, Zhang H, 'Handling variant requirements in domain modeling', Journal of Systems and Software (2003)

Domain models describe common and variant requirements for a family of similar systems. Although most of the notations, such as UML, are meant for modeling a single system, they c... [more]

Domain models describe common and variant requirements for a family of similar systems. Although most of the notations, such as UML, are meant for modeling a single system, they can be extended to model variants. We have done that and applied such extended notations in our projects. We soon found that our models with variants were becoming overly complicated, undermining the major role of domain analysis which is understanding. One variant was often reflected in many models and any given model was affected by many variants. The number of possible variant combinations was growing rapidly and mutual dependencies among variants even further complicated the domain model. We realized that our purely descriptive domain model was only useful for small examples but it did not scale up. In this paper, we describe a modeling method and a Flexible Variant Configuration tool (FVC for short) that alleviate the above mentioned problems. In our approach, we start by modeling so-called domain defaults, i.e., requirements that characterize a typical system in a domain. Then, we describe variants as deltas in respect to domain defaults. The FVC interprets variants to produce customized domain model views for a system that meets specific requirements. We implemented the above concepts using commercial tools Netron Fusion¿ and Rational Rose¿. In the paper, we illustrate our domain modeling method and tool with examples from the Facility Reservation System domain. © 2003 Elsevier Inc. All rights reserved.

DOI 10.1016/S0164-1212(03)00060-8
Citations Scopus - 12Web of Science - 9
2003 Jarzabek S, Bassett P, Zhang H, Zhang W, 'XVCL: XML-based variant configuration language', Proceedings - International Conference on Software Engineering (2003)

XML-based Variant Configuration Language (XVCL) is a meta-programming technique and tool that provides effective reuse mechanisms. It includes a methodology and a tool-the XVCL pr... [more]

XML-based Variant Configuration Language (XVCL) is a meta-programming technique and tool that provides effective reuse mechanisms. It includes a methodology and a tool-the XVCL processor. The methodology shows how to discover the structure of the solution for the application domain and for the types of variants one wants to address. The XVCL processor automates the routine yet error-prone program construction tasks, allowing to focus on what is novel about the problem domains, requiring creativity.

DOI 10.1109/icse.2003.1201298
Citations Scopus - 67Web of Science - 29
2002 Swe SM, Zhang H, Jarzabek S, 'XVCL: A tutorial', ACM International Conference Proceeding Series (2002)

XVCL (XML-based Variant Configuration Language) is a general-purpose mark-up language for configuring variants in programs and other types of documents. We can apply XVCL to confi... [more]

XVCL (XML-based Variant Configuration Language) is a general-purpose mark-up language for configuring variants in programs and other types of documents. We can apply XVCL to configure variants in a variety of software assets such as software architecture, program code, test cases, technical and user-level program documentation or requirement specifications. The principles of the XVCL have been thoroughly tested in practice. XVCL is based on the same concepts as the frame technology [1]. Frame technology has been extensively applied in industry to manage variants and evolve multi-million-line, COBOL-based, information systems. An independent analysis showed that frame technology has reduced large software project costs by over 84% and their times-to-market by 70%, when compared to industry norms [1, 2]. At the same time, we found that the principles of XVCL are not easy to communicate. In this paper, we describe a subset of XVCL. We trust this subset of XVCL is easy to understand and still effectively communicates essential XVCL concepts. To illustrate the XVCL method, we further describe an XVCL solution to handling variants in a Notepad system. Copyright 2002 ACM.

DOI 10.1145/568760.568821
Citations Scopus - 10
2001 Durrani TS, Leyman AR, 'Message from the chairmen', IEEE Workshop on Statistical Signal Processing Proceedings (2001)
2001 Wong TW, Jarzabek S, Swe SM, Shen R, Zhang H, 'XML implementation of frame processor', Proceedings of SSR'01 2001 Symposium on Software Reusability (2001)

A quantitative study has shown that frame technology [1] supported by Fusion¿ toolset can lead to reduction in time-to-market (70%) and project costs (84%). Frame technology has b... [more]

A quantitative study has shown that frame technology [1] supported by Fusion¿ toolset can lead to reduction in time-to-market (70%) and project costs (84%). Frame technology has been developed to handle large COBOL-based business software product families. We wished to investigate how the principle of frame approach can be applied to support product families in other application domains, in particular to build distributed component-based systems written in Object-Oriented languages. As Fusion¿ is tightly coupled with COBOL, we implemented our own tools based on frame concepts using the XML technology. In our solution, a generic architecture for a product family is a hierarchy of XML documents. Each such document contains a reusable program fragment instrumented for change with XML tags. We use a tool built on top of XML parsing framework JAXP to process documents in order to produce a custom member of a product family. Our solution is cost-effective and extensible. In the paper, we describe our solution, illustrating its use with examples. We intend to make our solution available to public in order to encourage investigation of frame concepts in other application domains, implementation languages and platforms.

DOI 10.1145/375212.375285
Citations Scopus - 17
2001 Zhang H, Jarzabek S, Swe SM, 'XVCL approach to separating concerns in product family assets', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2001)

In this paper, we describe an XML-based language, called XVCL, for managing variants in component-based product families. Using XVCL, we can organize product family assets and ins... [more]

In this paper, we describe an XML-based language, called XVCL, for managing variants in component-based product families. Using XVCL, we can organize product family assets and instrument them to accommodate variants. A tool that interprets XVCL and provides semi-automatic support for asset customization is also introduced. In our projects, we applied XVCL to manage variants in UML domain models and in generic architectures for product families. We have achieved simple forms of separation of concerns (in both models and architectures) and we are investigating advanced forms in current work. We plan to compare XVCL to other emerging techniques that lead to separating of concerns in software models, documents, architectures and code.

DOI 10.1007/3-540-44800-4_4
Citations Scopus - 13
2001 Jarzabek S, Zhang H, 'XML-based method and tool for handling variant requirements in domain models', Proceedings of the IEEE International Conference on Requirements Engineering (2001)

A domain model describes common and variant requirements for a system family. UML notations used in requirements analysis and software modeling can be extended with &quot;variatio... [more]

A domain model describes common and variant requirements for a system family. UML notations used in requirements analysis and software modeling can be extended with "variation points" to cater for variant requirements. However, UML models for a large single system are already complicated enough. With variants - UML domain models soon become too complicated to be useful. The main reasons are the explosion of possible variant combinations, complex dependencies among variants and inability to trace variants from a domain model down to the requirements for a specific system, member of a family. We believe that the above mentioned problems cannot be solved at the domain model description level alone. In the paper, we propose a novel solution based on a tool that interprets and manipulates domain models to provide analysts with customized, simple domain views. We describe a variant configuration language that allows us to instrument domain models with variation points and record variant dependencies. An interpreter of this language produces customized views of a domain model, helping analysts understand and reuse software models. We describe the concept of our approach and its simple implementation based on XML and XMI technologies.

Citations Scopus - 36Web of Science - 15
Show 190 more conferences
Edit

Grants and Funding

Summary

Number of grants 5
Total funding $710,646

Click on a grant title below to expand the full details for that specific grant.


20221 grants / $287,435

Intelligent Incident Management for Software-Intensive Systems$287,435

Funding body: ARC (Australian Research Council)

Funding body ARC (Australian Research Council)
Project Team Associate Professor Hongyu Zhang, Huong Ha, Associate Professor Hongyu Zhang, Dr Huong Ha
Scheme Discovery Projects
Role Lead
Funding Start 2022
Funding Finish 2024
GNo G2100087
Type Of Funding C1200 - Aust Competitive - ARC
Category 1200
UON Y

20202 grants / $287,989

Data-driven Approach to Resilient Online Service Systems$264,489

Funding body: ARC (Australian Research Council)

Funding body ARC (Australian Research Council)
Project Team Associate Professor Hongyu Zhang, Professor Michael Lyu
Scheme Discovery Projects
Role Lead
Funding Start 2020
Funding Finish 2022
GNo G1900151
Type Of Funding C1200 - Aust Competitive - ARC
Category 1200
UON Y

Machine learning (ML), statistical methods and simulations for signal-sorting$23,500

Funding body: University of Melbourne

Funding body University of Melbourne
Project Team Professor Stephan Chalup, Associate Professor Hongyu Zhang, Mr Thomas Dowdell
Scheme AMSI Australian Postgraduate Research Internships
Role Investigator
Funding Start 2020
Funding Finish 2020
GNo G2001206
Type Of Funding Scheme excluded from IGS
Category EXCL
UON Y

20172 grants / $135,222

Model Building based on Source Code for Problem Location$71,372

Funding body: Huawei Technologies Co.,Ltd.

Funding body Huawei Technologies Co.,Ltd.
Project Team Associate Professor Hongyu Zhang
Scheme Huawei Research Innovation Program (HIRP)
Role Lead
Funding Start 2017
Funding Finish 2017
GNo G1701312
Type Of Funding C3400 – International For Profit
Category 3400
UON Y

The Exploration of Auto-Code-Generation Technologies and Possible Applications$63,850

Funding body: Huawei Technologies Co.,Ltd.

Funding body Huawei Technologies Co.,Ltd.
Project Team Associate Professor Hongyu Zhang
Scheme Huawei Research Innovation Program (HIRP)
Role Lead
Funding Start 2017
Funding Finish 2017
GNo G1701333
Type Of Funding C3400 – International For Profit
Category 3400
UON Y
Edit

Research Supervision

Number of supervisions

Completed7
Current3

Current Supervision

Commenced Level of Study Research Title Program Supervisor Type
2022 PhD Intelligent Fault Detection for Belt Conveyor Idlers Using Machine Learning PhD (Information Technology), College of Engineering, Science and Environment, The University of Newcastle Co-Supervisor
2020 PhD Node Failure Prediction and Localisation for Cloud Service Systems PhD (Software Engineering), College of Engineering, Science and Environment, The University of Newcastle Principal Supervisor
2020 PhD Robust Optimization of Dynamic Steel Production Scheduling Processes PhD (Computer Science), College of Engineering, Science and Environment, The University of Newcastle Co-Supervisor

Past Supervision

Year Level of Study Research Title Program Supervisor Type
2022 PhD Exploring Factors that Influence the Acceptance of Clinical Decision Support Systems in Saudi Arabia PhD (Information Systems), College of Engineering, Science and Environment, The University of Newcastle Co-Supervisor
2022 PhD Mining Numerical Invariants for Improving Software Reliability PhD (Computer Science), College of Engineering, Science and Environment, The University of Newcastle Principal Supervisor
2021 PhD A Framework for Functional Feature and Crosscutting Concern Modelling in Software Product Lines PhD (Software Engineering), College of Engineering, Science and Environment, The University of Newcastle Co-Supervisor
2013 Masters Spectrum-based Fault Localization Computer Science, Tsinghua University Sole Supervisor
2013 Masters Analysis and Prediction of Software Team's Bug Fixing Ability Computer Science, Tsinghua University Sole Supervisor
2012 Masters Methods and Tools for Software Defect Prediction Computer Science, Tsinghua University Sole Supervisor
2012 Masters Techniques for Duplicate Bug Report Detection and Bug Localization Computer Science, Tsinghua University Sole Supervisor
Edit

Research Collaborations

The map is a representation of a researchers co-authorship with collaborators across the globe. The map displays the number of publications against a country, where there is at least one co-author based in that country. Data is sourced from the University of Newcastle research publication management system (NURO) and may not fully represent the authors complete body of work.

Country Count of Publications
China 187
Australia 139
United States 84
Singapore 38
Hong Kong 26
More...
Edit

News

Australian Researchers text

News • 1 Oct 2020

Our researchers recognised in The Australian’s Research 2020 magazine

The Australian's Research 2020 magazine paid tribute to several University of Newcastle researchers for their track record of excellence and contribution to their fields.

Associate Professor Hongyu Zhang

Position

Honorary Associate Professor
School of Information and Physical Sciences
College of Engineering, Science and Environment

Focus area

Computer Science and Software Engineering

Contact Details

Email hongyu.zhang@newcastle.edu.au
Phone (02) 4921 7790

Office

Room ES233
Building ES.
Location Callaghan
University Drive
Callaghan, NSW 2308
Australia
Edit