1Department of Biostatistics, School of Public Health, Fudan University, Shanghai 200032, China - 2Fudan University Library, Fudan University, Shanghai 200032, China - 3Northshore University HealthSystem, Evanston, IL 60201, USA - 4University of Chicago Medicine and Biological Sciences, Chicago, IL 60637, USA - 5Xuhui Central Hospital, Shanghai 200032, China - 6Fudan University Library, Fudan University, Shanghai 200032, China - 7Eye and ENT hospital of Fudan University, Shanghai 200032, China - 8Department of Biostatistics, School of Public Health, Fudan University, Shanghai 200032, China


Objective: Type 2 diabetes (T2D) is a complex disease caused by the combination of genetic factors and environmental fac- tors. To date, although many loci, including genes and single nucleotide polymorphisms (SNPs), have been identified as risk vari- ants of T2D, only approximately 10% of its heritability can be explained. In the current study, we proposed a data processing and analysis procedure to more accurately evaluate the association of the pathogenesis of T2D with copy number variations (CNVs).

Methods: The data in our study came from the WTCCC (Wellcome Trust Case Control Consortium) genome-wide CNV database. Individual CNVs were identified by SW-ARRAY and CBS algorithms and genotyped with a global threshold method. Overlapped CNVs among all samples were split into smaller but more accurate CNV segments (CNVSegs) after the CNV call; then, LASSO-based logistic regression models with 10-fold cross-validations were performed 100 times to examine the association of CNVSegs with T2D. The AUC (area under the curve) in every model was summarized to preliminarily verify the classification ability of the models.

Results: After quality control, 1,813 T2D cases and 2,777 controls were enrolled in the study. A total of 65,163 CNVs were identified, of which 25,512 were identified in the T2D group and 39,651 were identified in the healthy control group. A total of 22,279 CNVSegs were constructed after pre-processing the raw CNV data. By means of fitting 1,000 logistic regression models with the LASSO method, 26 CNVSegs were identified as T2D-associated CNVSegs according to pre-defined criteria (Frequency > 85% & Length > = 50 bp). Twenty-seven protein-coding genes were found to be overlapped with the CNVSegs, of which 11 were verified to be relevant to T2D, obesity or metabolic syndrome based on current published evidence. The average AUC of all mod- els was 0.611 with the maximum being 0.683.

Conclusions: Our study explored T2D-associated CNVSegs by LASSO-logistic regression models from the perspective of the whole genome for a more complete understanding of the genetic mechanisms of T2D. Further studies are necessary to verify the influence of the susceptibility loci on the pathogenesis or progression of T2D among different populations.


Type 2 diabetes, copy number variation, Genome-wide association, LASSO