The DNA barcoding project on German Diptera: An appreciative and critical analysis with four suggestions for improving the development and reliability of DNA-based identification

The progress in constructing a DNA barcode library for German Diptera as published by Morinière et al. (2019, Mol. Ecol. Resour. 19: 900–928) is appraised from a dipterists’ perspective. The coverage of the diversity of German Diptera in terms of barcode index numbers (BINs) and identifi ed barcodes is analysed and visualized in simple diagrams. The infl uence of the project setup, methodology and/or systematic effects on the emerging numbers and trends is elucidated and extensively discussed. In addition, the documentation on the species identifi cation methods in the database is assessed. Based on this evaluation, four ways for improving the future development, utility and reliability of this DNA database and similar projects in general are identifi ed: (1) Sample the collections of experts. This results in a greater and more reliable coverage within a limited time frame, as opposed to random collecting and relying on a posteriori identifi cation. (2) Give priority to medically, agriculturally or ecologically important families. Addressing these gaps will meet the most pressing needs of the community and serve as a good advertisement for the usefulness and wide applicability of the DNA barcode library. (3) Allocate resources to recruiting established experts as opposed to trainees. The fact that half of the recovered BINs remained unidentifi ed mostly results from the insuffi cient involvement of experts (and expert time). (4) Appropriately document the morphological identifi cations by experts in the database. This will allow to assess the reliability of DNA-based identifi cations and to prioritize confl icting identifi cations within a BIN accordingly.


INTRODUCTION
published an extensive DNA barcode library for German Diptera with results from two major barcoding projects, the "Barcoding Fauna Bavarica" project (Hendrich et al., 2010) and the "German Barcode of Life" project (Geiger et al., 2016). The data set published includes 40,753 records and roughly 5,200 BINs (barcode index numbers) of 2,453 named species and 2,700 "dark taxa", i.e. BINs that are unidentifi ed. In their paper, the authors propose their DNA barcode library as an "intermediate taxonomic system" that will provide a foundation for subsequent taxonomic and biodiversity studies, but they also address problematic issues, such as dark taxa and the "taxonomic impediment".
Undoubtedly DNA-based identifi cation, i.e. the identifi cation of specimens by comparing their DNA barcode to other, already identifi ed DNA barcodes, is becoming an important and powerful tool for applied projects, such as assessing and comparing biodiversity patterns (Ratnasingham & Hebert, 2013;Morinière et al., 2019). Some methodological shortcomings, such as BIN sharing etc., are inherent in the system (Meier & Zhang, 2009;Ratnasingham & Hebert, 2013;Morinière et al., 2019) and are likely to prevent achieving an identifi cation level of 100%, even if  Schumann et al. (1999); 2 number of recovered barcodes; 3 number of recovered barcodes divided by number of species reported in Germany; 4 number of identifi ed species divided by number of recovered barcodes; 5 number of identifi ed species divided by number of species reported in Germany; 6 refers to the quantities established in Fig. 1.
coded / species (%)", where the values for Canacidae, Ulidiidae and Diastatidae are well above 100% (450%, 225% and 133% respectively), while Tethinidae, Otitidae and Campichoetidae seem to be lacking. For Braulidae, the column "Total number of taxa / with barcode" lists "2" although there is a "0" in the column "BINs" and Braulidae are lacking from the spreadsheet with the individual records. The family assignments are basically irrelevant for the present analysis, but they must be treated consistently in order to obtain correct results. The data for the respective families were therefore revised in Table 1 of this paper to be consistent with the spreadsheet with the individual records, reducing the number of families to 111. Table 1 contains the original columns "Family", "Species reported in Germany" and "BINs". The number of "species reported in Germany" is consistent with Schumann et al. (1999), adding up to a total of 9,213. For the sake of consistency, all calculations are here based on that source and number and not on the 9,544 known species of German Diptera, which are cited in the text of the source study based on the Checklist of German Diptera and its three supplements (Schumann et al., 1999;Schumann, 2003Schumann, , 2005Schumann, , 2010. "Ratio barcoded / species" was entered into the column "BIN Ratio" as a decimal number instead of percentage, for reasons explained below. The remaining columns of the original table were replaced by new ones, due to the different focus of this study. The number of "Identifi ed species" was calculated by subtracting "Unnamed / with barcode" from "Total number of taxa/ with barcode" in the original table. "BIN Identifi cation Ratio" was calculated by dividing "Identifi ed species" by "BINs". "Identifi cation Ratio" was calculated by dividing "Identifi ed species" by "Species reported in Germany". In Table 2 the achieved coverage of the known diversity of German Diptera by identifi ed barcodes was determined for those families with an Identifi cation Ratio greater than 0.50 by individually cross checking the species listed in the spreadsheet with the individual records (Morinière et al., 2019: sup-0002-appendixS1) with those reported for Germany. At the same time the percentage of species of Diptera new to the German fauna was noted.
The data were analysed, graphs created and regression analyses done using the respective Excel functions. The fi le contains more families with 1-10 species than families with 11-100 species, and families larger than that are even less common. Family size is therefore mapped on a logarithmic scale in Figs 3a-c to achieve a better resolution. Logarithmic trend lines were added for all families (solid lines) and for families with more than 10 species (dotted lines), and the respective regression analyses were calculated based on log family size. To characterize the relations between the relevant quantities of data ( Fig. 1) some common symbols from set theory ( = subset; ∩ = intersection, i.e. overlap; \ = relative complement, i.e. objects that belong to A and not to B) are used.
The published Excel spreadsheet (Morinière et al., 2019: sup-0002-appendixS1) contains 40,753 individual records. Of these, 25,910 (64%) have an entry in the "species" column. These records were evaluated regarding the information provided in the columns "identifi cation method" and "identifi er". A small portion of the entries in the "species" column are not Linnaean names but codes such as "Limnophyes sp. 2SW". Because of the huge size of the data set it was not feasible to eliminate these individually and they were treated equal to other identifi ed records. Fig. 1 illustrates the relevant quantities (= sets) and their general relations for all German species of Diptera. Across Notes: 1 refers to the quantities established in Fig. 1. 2 E' = number of named species in the published spreadsheet with individual records (Morinière et al., 2019: sup-0002-appendixS1). For some families this slightly diverges from the number of identifi ed species, E, presented in Table 1.

RESULTS
the individual families, the absolute magnitudes and intersections of these sets vary greatly. A represents the set of all species occurring in Germany. B is a subset of A (B  A) that represents the species that already have a taxonomic name. The exact magnitude of A and B is unknown. C is a subset of B (C  B) that represents the named species that have been reported to occur in Germany (2 nd column in Table 1). In the source study the total for all families is 9,213. D represents the set of BINs recorded in the source study (3 rd column in Table 1). The total for all families is 5,207. As the study is predominantly based on material collected in Germany, this can be considered a subset of A. D is not a subset of B or C, because it can, and likely will, include species not yet described or not previously recorded in Germany (D \ C). Moreover, BINs do not translate directly into species, because some BINs comprise more than one species and, conversely, some species occur in more than one BIN (Morinière et al., 2019). This is refl ected in the divergence between the number of "BINs" and the "Total number of taxa / with barcode" in Table 1 of Morinière et al. (2019). In fact, the incongruence of BINs and species is even greater than that, because mathematically the effects of BIN sharing and BIN splitting cancel each other out. The incongruence is hard to appraise, has a minor effect on the subsequent calculations and is therefore disregarded here. The intersections between D and B (D ∩ B) and between D and C (D ∩ C) are unknown.
E is a subset of D (E  D) that represents the taxonomically identifi ed species (4 th column in Table 1). It is calculated by subtracting the number "unnamed / with barcode" from "total number of taxa / with barcode" in Table 1 of Morinière et al. (2019). The total for all families is 2,462. E is also a subset of B (E  B), because only named species can be taxonomically identifi ed. E is not a subset of C, however, because it can contain named species not previously reported for Germany (E \ C). Determining the intersection between E and C (E ∩ C) requires the cross checking of the individual species recorded in the source study with those reported for Germany. This was done only for some individual families (see below). D / C relates the number of recovered BINs to the number of species reported to occur in Germany (BIN Ratio). The gross BIN Ratio is 0.57 (5 th column in Table 1). Because D is not a subset of B or C, the ratio D / C neither indicates the fraction of species reported in Germany for which BINs were established, nor the respective fraction of all species occurring in Germany. It merely indicates that roughly half as many BINs were recovered as there were species reported in Germany before the source study.
The respective values are therefore given not in percent, but as a ratio. The average BIN Ratio across the families (0.42 ± 0.33 sd; median = 0.38) is smaller than the gross BIN Ratio, partially because for a large number of very small families the source study did not include any specimens (Fig. 2a). In Fig. 3a this is illustrated by the large number of data points in the bottom left. Conversely, the remainder of the very small families has a very high BIN Ratio of 1.00. Naturally, for families with only one species occurring in Germany, the BIN Ratio can only be either 0.00 or 1.00. The absence of many small families is also refl ected by the trend line across all families declining towards the left (R² = 0.08, p < 0.01). If only those families with more than 10 species are analysed, the correlation is insignifi cant (R² = 0.00, p = 0.77). In this range the BIN Ratio varies widely and apparently independent of the size of the families. A BIN Ratio larger than 1.00 is a strong indicator of species occurring, but not yet reported, in Germany. After the above revisions of Table 1 the only remaining families with BIN Ratios exceeding 1.00 are Cecidomyiidae, Milichiidae and Trichoceridae (discussed below). E / D relates the number of BINs that were identifi ed to species to the number of all recovered BINs (BIN Identifi cation Ratio). Roughly half of the recovered BINs were identifi ed to species (0.47, 6 th column in Table 1). The single BIN found for Acartophthalmidae combines two identifi ed species ("BIN sharing"), resulting in an extraordinarily high BIN Identifi cation Ratio of 2.00 for that family. To avoid distortion, this outlier was omitted from the following calculations and the diagrams. Still the average BIN Identifi cation Ratio across the families (0.70 ± 0.26 sd; Relevant quantities (= sets) of dipteran taxa in Germany. The relative magnitude of the areas C (described dipteran species known to occur in Germany), D (BINs recorded in the source study) and E (species identifi ed in the source study) corresponds to the respective species numbers (9,213, 5,207 and 2,462). The true magnitude of A (all dipteran species occurring in Germany) and B (described dipteran species occurring in Germany) and the intersections of D and E with B and C are unknown. median = 0.70) is considerably larger than the gross BIN Identifi cation Ratio. This is partially due to the large number of small families with comparatively high values, whereas there are only a few, albeit speciose, families that have low BIN Identifi cation Ratios (Figs 2b and 3b). The decline in the trend line in Fig. 3b towards the right (R² = 0.28, p < 0.01) remains signifi cant, when the trend line is calculated only for those families with more than 10 species (R² = 0.06, p = 0.05), suggesting that species in large families are less likely to be identifi ed.
D \ E is the set of BINs that remained unidentifi ed ("unnamed with barcode", "dark taxa" in Morinière et al., 2019). It comprises unknown taxa that have not yet been formally described and given a Linnaean name (outside of B), as well as known taxa (inside B) that were not identifi ed for other reasons.
E / C relates the number of BINs that were identifi ed to species to the number of species known to occur in Germany (Identifi cation Ratio). The gross Identifi cation Ratio is 0.27 (7 th column in Table 1). The average across the families is 0.29 (± 0.27 sd; median = 0.23). Because most of the BINs for the small families were identifi ed (see above), their Identifi cation Ratio is mostly identical to their BIN Ratio (left third of Figs 3a and 3c). For many more speciose families, however, only a fraction of the BINs were identifi ed, resulting in a downward shift of the respective data points on the right side of Fig. 3c in comparison to Fig. 3a. The regression analysis shows no correlation between the Identifi cation Ratio and the size of the family (R² = 0.00, p = 0.88) and the respective trend lines in Fig.  3c are almost level. But the data do not follow a Gaussian distribution. Instead, some outliers with particularly high Identifi cation Ratios (discussed below) are balanced by the bulk of values being in the 0.00-0.30 range (Fig. 2c).
E ∩ C / C is the achieved coverage of the species of Diptera known to occur in Germany by identifi ed barcodes. This was assessed for the 24 families with a high Identifi cation Ratio (> 0.50) by individually cross checking the named species recorded in the source study with those reported for Germany (Schumann et al., 1999, Table 2, Fig. 4). Complete coverage was achieved for six small families, for which one to three species are known to occur in Germany, and these species were found and successfully barcoded. In most of the families with more than four species the actual coverage is considerably smaller than the Identifi cation Ratio. The gap is somewhat reduced, however, when the species newly recorded for Germany in the source study itself are added to both sides of the fraction. This is particularly evident for Anthomyiidae and Sciaridae, for which comparatively high percentages of new records (E \ C) were found (32% and 37% of the identifi ed species).

Exemplary families
Because the values vary greatly across the families, the discussed averages are only partially informative for evaluating the progress of the project and the resulting implications for future strategies. For a more detailed analysis, the data for some exemplary families are identifi ed by coloured lines in Figs 3a-c, and their characteristics summarized in Table 3. All other families can be located in the diagrams by their respective coordinates in Table 1. In    Table 1). This translates into 42 identifi ed species. As 58 species had been previously reported in Germany, the resulting Identifi cation Ratio is 0.72, the highest among the speciose families. Cross checking with the Checklist of German Diptera (Schumann et al., 1999) reveals that only three of the identifi ed species (7%) are not listed there. After adding the three new records, the German Tabanidae fauna now includes 61 known species of which 69% are represented by barcodes (Fig. 4). Notably, all but one of the 42 identifi ed species (98%) are represented by one or more specimens identifi ed and almost always also collected by W. Schacht , a dedicated dipterist, excellent collector and renowned expert on Tabanidae (Kotrba, 2011). The vouch-ers drawn from his collection are documented by photographs in the BOLD database. Different from the general methodology of the source study (specimens up to fi ve years old, stored in 96% EtOH before DNA extraction), all of these specimens were collected more than 20 years ago and, as evident from the photographs on the BOLD internet site, were pinned, i.e. dried. From almost all of them a full 658 bp sequence was retrieved.
For Ephydridae the spreadsheet with the individual records contains 548 records, of which 488 (89%) are identifi ed to species (Morinière et al., 2019: sup-0002-ap-pendixS1). The number of taxa with a barcode is listed as 132, of which 16 are not identifi ed (Morinière et al., 2019: Table 1). This translates into 116 identifi ed species. As 177 species had been previously reported in Germany, the resulting Identifi cation Ratio is 0.66, the second best among the more speciose families. Cross checking with the Checklist of German Diptera (Schumann et al., 1999) shows, that only 12 of the identifi ed species (10%) are not listed there. After adding the new fi nds, the German fauna of Ephydridae now includes 189 known species of which 61% are represented by barcodes (Fig. 4). All but four of the 116 identifi ed species (97%) are represented by one or more specimens collected and identifi ed by J.-H. Stuke, a very successful contemporary German dipterist, collector and expert on Ephydridae (as well as Conopidae, Carnidae and several other acalyptrate families). Unfortunately, the preservation of the specimens sampled is generally not documented in the data set and could not be assessed from the photographs on the BOLD internet site. According to personal communications from J.-H. Stuke and D. Doczkal, the respective sequences were obtained from pinned material.
For Syrphidae the spreadsheet with the individual records contains 1,911 records, of which 1,381 (72%) are   Table 1). This translates into 273 identifi ed species. As 440 species had been previously reported in Germany, the Identifi cation Ratio is 0.62, the third best among the speciose families. For this family, the vertical part of the respective line in Fig. 3c leads upwards, indicating that the number of found taxa with barcodes is considerably (23%) greater than the number of established BINs and therefore several BINs include more than one taxon (Morinière et al., 2019). Out of the 273 species identifi ed, 106 (39%) are represented by one or more specimens identifi ed by W. Schacht (see above). The vast majority of these were collected more than 20 years ago by various collectors, and for almost all of these a sequence longer than 550 bp was recovered. A large part of the remaining specimens were collected and identifi ed by D. Doczkal, co-author of the source study and expert on Syrphidae.
A common factor for the three fl agship families is that experts for the respective families were intensely involved in the collecting and identifying of the relevant material.
Three families of medical importance, i.e. Culicidae, Simuliidae and Ceratopogonidae, are indicated by the colour red in Figs 3a-c. The respective BIN Ratios are comparatively low, especially for Culicidae, indicating that only part of the diversity was sampled and/or included in the source study (Fig. 3a). Only for Culicidae was a high BIN Identifi cation Ratio achieved (Fig. 3b). As a result, the Identifi cation Ratios are very low for all three families (0.15, 0.18 and 0.09). The situation is somewhat similar for three families, which may be considered to be collector favourites, i.e., Asilidae, Conopidae and Stratiomyidae, indicated by the colour blue. Similar to Syrphidae (above), these families are very well studied and described for Germany, and several experts on these families are currently present in this country. Accordingly, high BIN Identifi cation Ratios were achieved. But the Identifi cation Ratios are still very low (0.15, 0.17 and 0.24), mainly due to the low input of material as is evident from the low BIN Ratios. A common factor for the medically important and collector favourite families is that comparatively little material was included in the source study, and this appears to be the main reason for the low Identifi cation Ratios.  Black lines indicate some families that are characterized by large vertical portions of the respective lines in Fig. 3c. An extreme example is Cecidomyiidae, the most speciose of all families of Diptera in Germany. 836 species of Cecidomyiidae have been reported in this country (Table 1) and many more apparently occur: 927 BINs were recovered, resulting in a very high BIN Ratio of 1.11. At the same time, Cecidomyiidae have the lowest BIN Identifi cation Ratio (0.05) and, accordingly, the lowest Identifi cation Ratio (0.05) of the more speciose families. Phoridae and Chironomidae are other examples of very speciose families with comparatively high BIN Ratios, low BIN Identification Ratios and thus also low Identifi cation Ratios, while among the smaller families the Trichoceridae, Milichiidae and Carnidae stand out with similar characteristics. A common factor for these families is that high BIN Ratios indicate a comparatively good sampling of the diversity (but see below), but only few of the records could be identifi ed, leaving a large residual of dark taxa.

Identifi ers
The entries in the "identifi er" column of the spreadsheet with the individual records (Morinière et al., 2019: sup-0002-appendixS1) are summarized in Table 4. Out of the 34 listed identifi ers, 15 are members of the German society of dipterists "AK Diptera" and/or personally known to the author as experts on Diptera (just called "dipterists" below). The remaining identifi ers cannot be classifi ed as dipterists, but all or most of them are barcoding experts or technicians. Fig. 6 visualizes the individual contributions of the identifi ers in terms of identifi ed specimens (Fig. 6a) and identifi ed species (Fig. 6b), showing that the work load was very unevenly distributed among a small number of main contributors. More than half of the specimens were identifi ed by three non-dipterists, and among these, a single person provided the vast majority of the identifi cations. In the dipterist fraction, eight experts provided substantial numbers of identifi cations, and among these, two were responsible for the bulk of the work. Because many species were independently identifi ed by more than one of the identifi ers, the sets of identifi ed species intersect. It would be very time consuming to correct for this. Even without this correction, the species identifi ed by dipterists add up to only 2,013 (Table 4). This implies that at least 449 out of the 2,462 species identifi ed were not identifi ed by a dipterist. Among the dipterists, one identifi ed more species (1,192) than all others combined.

Constructing a DNA barcode library for German Diptera: Progress
The evaluated study met its goal to provide a DNA barcode library for 5,200 BINs of Diptera. Without question, this is impressive, especially given the limited time frame and largely uniform collecting methods. Likely one of the most interesting fi gures for the community would be the achieved coverage of the diversity of all Diptera in Ger- Table 4. Identifi ers listed in Morinière et al. (2019: sup-0002-appen-dixS1) with their respective contribution in terms of identifi ed species and specimens. many in terms of identifi ed barcodes (E / A). This cannot be assessed, however, because the true extent of the diversity of German Diptera is unknown. The coverage of the known part of the diversity (E ∩ C / C) was assessed for all families with Identifi cation Ratios larger than 0.50 by cross checking the list of found species with the species known to occur in Germany (Fig. 4). Complete coverage was achieved for six very small families. Among the larger families a very respectable coverage was achieved for the "fl agship" families Tabanidae (69%), Ephydridae (61%) and Syrphidae (53%). Among the medium sized families with up to 20 species, the Diastatidae (& Campichoetidae), Lonchopteridae, Rhinophoridae, Canacidae (& Tethinidae) and Piophilidae have a relatively high coverage of nearly 60%.
For the majority of families, which have Identifi cation Ratios below 0.50, the coverage of the known diversity was not assessed. The Identifi cation Ratio, i.e. the ratio between the number of identifi ed BINs and the number of species known to occur in Germany (E / C), generally overestimates the achieved coverage (Fig. 4), but it is much easier to assess, and may still be used as a rough indicator of the achieved coverage. Across all families this ratio is 0.29, with the majority of values less than average (Fig.  2c). Regrettably, the Identifi cation Ratio is also low for the medically important families, such as Culicidae, Simuliidae and Ceratopogonidae, and for some collector favourites, such as Asilidae, Conopidae and Stratiomyiidae, mainly due to low BIN Ratios showing that only a small fraction of the diversity was sampled (Figs 3a and 3c). The coverage provided by the entire BOLD database is probably higher, but this was not investigated here.
Unlike the quantity of identifi ed BINs, i.e. species, the quality of the identifi cations is not as easily appraised. If indeed only 472 of all records were identifi ed by traditional morphological methods (Fig. 5), then this would also apply to maximally 472 of the 2,462 identifi ed BINs. Very likely, the proportion of morphological identifi cations is higher, but this is not evident from the published fi le and remains a matter of speculation. From the number of species identifi ed by dipterists (Table 4) it follows that at least 449 species were not identifi ed by a dipterist, but by a barcoding expert or technician. Some unknown quantity between 449 and 1,981 species must have been exclusively identifi ed using DNA-based identifi cation. Identifying specimens for the establishment of a DNA barcode library based on a DNA barcode library seems circuitous (Kotrba, 2019). Basically, the responsibility for the correct identifi cation of these specimens is delegated and it is not immediately clear to whom (see below).
Extending the assessment of coverage to include the BINs that remained unidentifi ed results in a ratio of roughly 1 : 2. This neither indicates that the source study "covers ~ 55% of the known Diptera fauna from Germany" (Morinière et al., 2019: 3), nor that it covers "half of the German Diptera fauna" (Morinière et al., 2019: 16), but merely that roughly half as many BINs were recovered as there were Diptera species reported in Germany before the source study (see above).
The BIN Ratio varies greatly across the families (Fig.  3a). This could be due to an uneven representation of the families in the original samples. Borkent et al. (2018) report that Malaise trap catches at Zurquí, Costa Rica, include only about half of the diversity of Diptera collected using a wider range of methods, and Karlsson et al. (2020) report that Malaise traps are comparatively ineffi cient at catching large, active insect fl iers with good vision. The variation could also be a result of uneven processing of the individual families, favouring some and/or disregarding others. This may apply to the Scathophagidae, whose total absence in the source study comes as a surprise. There are 57 species of this family listed in the German checklist and at least Scathophaga stercoraria is very common. Moreover, the BIN Ratio is a function of the previous knowledge of the diversity of Diptera in Germany. In well studied families, where C in Fig. 1 includes also the rare species and thus approximates A, it is hard to achieve a high BIN Ratio. At the same time, in such cases the BIN Ratio is a good indicator of the degree to which the actual diversity of Diptera in Germany was covered by the source study. The collector favourite families, Asilidae, Conopidae and Stratiomyidae, exemplify this correlation. Syrphidae, which also undoubtedly qualify as collector favourites, have a higher BIN Ratio, possibly due to the substantial incorporation of specimens from the collections of experts. Conversely, it should be easy to achieve a high BIN Ratio for poorly studied families, where C is much smaller than A. In particular, a BIN Ratio larger than 1.00, as recorded for Cecidomyiidae, Milichiidae and Trichoceridae, is a strong indicator of species occurring, but not yet reported in Germany (Morinière et al., 2019) and even possibly undescribed. The same high BIN Ratio may thus indicate a good coverage of a well-studied family, as well as a moderate coverage of a very poorly studied family. Apart from the obvious underrepresentation of the smallest families with only up to 10 species, the size of the family has no signifi cant effect on the BIN Ratio.
Roughly half of the recovered BINs were identifi ed to species. The signifi cant negative correlation of the BIN Identifi cation Ratio with the size of the families (Fig. 3b) suggests that species belonging to large families are less likely to be identifi ed. Naturally, species of very small families are easier to identify, because the respective identification keys will be short. Conversely, some very speciose families such as Cecidomyiidae, Phoridae and Chironomidae are notoriously challenging, even for experts. The respective literature is vast and the relevant characters tiny, making identifi cations more diffi cult. Nevertheless, quite a number of specialists have dedicated their working lives to these taxa (e.g. R. Gagne, N. Dorchin and M. Jaschhof for Cecidomyiidae). As exemplifi ed by the "fl agship" families Ephydridae, Syrphidae and Tabanidae, a high BIN Identifi cation Ratio can be achieved also in speciose families, if qualifi ed experts are involved.
The other half of the recovered BINs remained unidentifi ed. These are classifi ed as dark taxa, comprising records of unknown taxa that have not yet been formally described and given a Linnaean name, as well as known taxa that could not be identifi ed, because they are "extremely diffi cult" to identify (Staatliche Naturwissenschaftlichen Sammlungen Bayerns, 2019). The fact that about 7,000 further named species are known to occur in Germany in addition to the 2,462 identifi ed in the source study, suggests that at least part of the remaining roughly 2,700 dark taxa will ultimately turn out to belong to known species (intersection of hatched area D \ E with C in Fig. 1). Likewise, at the family level, the fact that 792 further named Cecidomyiidae species are known to occur in Germany in addition to the 44 identifi ed in the source study, suggests that a good part of the 882 unnamed BINs in this family will ultimately turn out to be known and identifi able species (as soon as a Cecidomyiidae expert is recruited and dedicates the needed time to their identifi cation). Among the small families, Carnidae particularly stand out with 100% of dark taxa. Eleven Carnidae species are reported in Germany, but out of the seven BINs recovered for this family not a single one was identifi ed. At least some of these BINs will likely turn out to belong to known German species. Of course, many new species and/or new records for the German fauna are also likely to be discovered in the process. For the 24 families, in which the found species were cross checked with the list of species known to occur in Germany (Table 2), the overall percentage of new records for Germany is 18%. The highest individual values are 34% (Sciaridae) and 32% (Anthomyiidae).
There are many possible reasons for specimens of known species to remain unidentifi ed. "Extremely diffi cult to identify" may arguably apply, but diffi culty, like beauty, lies in the eye of the beholder. The vast majority, if not all, of the more than 9,000 species listed in the checklist of German Diptera (Schumann et al., 1999;Schumann, 2003Schumann, , 2005Schumann, , 2010 were identifi ed based on classical morphological characters, proving that this is feasible. This situation is different, e.g., for fungi, where a growing proportion of species are known only from sequence data and cannot be linked to any physical specimen or resolved taxonomic name; they are referred to as "dark taxa" or "dark matter fungi" (Ryberg & Nilsson, 2018). Sometimes "diffi cult" is extended to include options such as "requiring expertise", "time consuming", or simply "tedious". Page (2016) highlights taxonomic capacity as the main limiting factor for BIN identifi cation. This concerns not only a shortage of taxonomic experts as such, but, maybe even more importantly, a shortage of expert time available and recruited for this kind of studies. Clearly, therefore, it is important to reveal what efforts were made to identify the records. Especially, if the ratio of dark taxa is to be utilized to assess defi ciencies in taxonomic and faunistic knowledge.
Facing the well over 9,000 species in 111 dipteran families known to occur in Germany, and more than 40,000 records to be identifi ed within a very limited time frame, the present analysis indicates that way too few dipterists were involved and that the work load was very unevenly distributed among them. Specifi cally, only eight dipterists contributed any substantial number of identifi cations, and a single one accomplished the vast bulk of these (Table  4, Fig. 6a and b). For comparison, Karlsson et al. (2020) report that more than 130 experts were actively recruited for the Swedish Malaise Trap Project, which had a roughly comparable scope and time frame, and that many of them provided some number of identifi cations, which totalled to over 4,000 species. Borkent et al. (2018) report the involvement of 59 dipterists in a study of all Diptera collected in one year in a patch of cloud forest in Costa Rica.

Constructing a DNA barcode library: Ways of improving the develompent, utility and reliability
Based on the above analysis it is possible to identify some ways of improving the future development, utility and reliability of this DNA database.

Sample the collections of experts
A high Identifi cation Ratio can only be achieved by combining a high BIN Ratio with a high BIN Identifi cation Ratio. The high Identifi cation Ratios achieved for Tabanidae, Ephydridae and Syrphidae suggest that sampling the collections of experts results in a higher coverage within a reasonable time frame, as opposed to random collecting and relying on a posteriori identifi cation. Good results with complete or nearly complete barcodes were obtained even from comparatively old dry (pinned) material, and these barcodes are linked to very reliably identifi ed vouchers. Similarly, good results were achieved by Hausmann et al. (2016) and Dey et al. (2019) for Geometridae (Lepidoptera). Where possible, the barcoding of type material or specimens verifi ed by comparison with type material constitutes the ideal approach.

Prioritize important taxa
It is unfortunate that some of the most important families of Diptera are still very poorly covered. Naturally, achieving the same high coverage for all dipteran families occurring in Germany within a limited time frame cannot be expected. Giving priority to medically, agriculturally or ecologically important families and to collector favourite families will meet the most pressing needs of the community and serve as a good advertisement for the usefulness and wide applicability of the DNA barcode library. At the same time, prioritizing important families is likely to increase the rate of progress, as experts and material are more likely to be readily available. Moreover, these taxa are well studied, well documented and suffi ciently covered in the literature. Achieving near complete coverage for some important families this way might be preferable to a random 30% coverage of the entire diversity.

Allocate resources to recruiting experts
In their response to Kotrba (2019), Chimeno et al. (2019) state that "even authorities may fail (nobody is perfect), although this is more improbable by the expert rather than by the beginner". This is true and constitutes a good reason to preferentially rely on experts.
Of course, dipterology has not been spared the general restrictions of the "taxonomic impediment", i.e. the deplorable shortage of both professional and amateur experts, which is the logical outcome of today's failure to educate, employ, fund and generally support such scientists. Still, a considerable number of dipterists active in Germany and worldwide have recently been involved in comparable studies (see above), and can be contacted through a number of platforms such as AK Diptera (https://www.akdiptera.de/), NADS (http://www.nadsdiptera.org/), or "the new Diptera site" (http://diptera.myspecies.info/). Morinière et al. (2019) explicitly "invite the global community of dipteran taxonomists to improve identifi cations for the many "dark taxa" encountered in our study by identifying these vouchers using reverse taxonomic approaches." Those taxa with the most pressing defi cits can be identifi ed in Figs 3a-c and any contribution in that respect is strongly encouraged here for the common good. Surely, progress could be boosted by allotting money for remunerating such contributions.
It is also of great importance to educate new generations of experts. But it is not a promising strategy for the progress of the present project, i.e. establishing a DNA barcode library, to mix these tasks. Heavily relying on the contributions of trainees will negatively affect both the rate of progress and the reliability of the results.

Appropriately document identifi ers and identifi cation methods
The value of any DNA barcode library for identifying species is fi rst and foremost dependent on the reliability of the stored identifi cations (Kotrba, 2019). If there is no confl icting taxonomy within a BIN, the taxonomic assignment is unanimous, but its reliability still depends on the expertise of the identifi ers and the accuracy of the utilized literature and keys. Appropriate data about the identifying expert designates the scientifi c responsibility. Additional annotations regarding his or her fi eld of expertise could further help in appraising the reliability of the identifi cation. Documentation on the methodology is important, fi rstly, because only records identifi ed by methods other than comparison with sequences already in the system add to the availability of taxonomic identifi cation from that system. Secondly, for morphological identifi cations, documentation of the literature and/or keys used further helps in the appraisal of the reliability. Collins (2011) states that "... thoroughly demonstrating the characters used to identify your specimens will make the whole system more transparent and reliable". More specifi cally, Meier (2017) establishes that species identifi cations in biological publications should be treated as a 'Result' and the literature that was used for that identifi cation should be properly cited. This should also apply for species identifi cations presented in the form of a published DNA barcode library.
Matters are more complicated if there is confl icting taxonomy within a BIN. Ratnasingham & Hebert (2013) suggest that "Users will encounter discordant taxonomic assignments, especially among unpublished records, but majority rule is a useful way to gauge the validity of a particular identifi cation. For example, if most specimens are assigned to one species and these identifi cations derive from several taxonomists, this assignment is more likely to be correct than any 'outlier' identifi cations." This suggestion leads down a dark and dangerous path! On principle, "more likely correct" is not an acceptable degree of reliability in science. Moreover, the suggested approach depends on the user being able to recognize which records were identifi ed by taxonomists, or even experts on the relevant taxon or clade, and which were not. This precondition is hardly ever met. The majority rule approach will be deceptive, if, e.g., a single 'outlier' correctly identifi ed by an expert is outnumbered by numerous records identifi ed by laymen, e.g. in the context of trainee programs, or, even worse, by DNA-based identifi cation based on the very same BIN. The magnitude of this problem is constantly growing due to the uploading of thousands of sequences with incomplete documentation and/or with questionable or preliminary identifi cations.
According to Chimeno et al. (2019) "the truly crucial point is that the result of identifi cation can be checked and thus can be falsifi ed. By establishing voucher specimens..." But if it is unknown which identifi cations need to be checked and which are reasonably reliable, then all identifi cations may have to be considered doubtful. When in doubt, it will be more straightforward to directly identify the specimens at hand, than to get hold of the questionable vouchers and reidentify those. It might be naïve to assume the global scientifi c community will be able, let alone willing, to correct already uploaded misidentifi cations on a large scale a posteriori. Therefore "ultimately, the best prevention lies with collaboration, and working through identifi cation uncertainties between labs before data are uploaded as reference specimens" (Collins, 2011).

CONCLUSION
Although based on the scrutiny of a specifi c, regionally and taxonomically restricted project, the resulting insights very likely apply to any comparable project. The progress in the construction and the subsequent success of a DNA barcode library for the purpose of DNA-based species identifi cation is critically dependent on, and limited by, the expertise needed and available for the reliable identifi cation of the vouchers. If the project is aiming for a rather complete coverage rather than a partial outcome with a huge residual of "dark taxa", then the appropriate experts need to be actively recruited and involved from the beginning. Including already identifi ed material from the collections of experts may help speed up the progress considerably. Conversely, any number of trainees and any amount of collecting cannot make up for the lack of experts.
The efforts made to identify the vouchers need to be appropriately documented together with the results, especially the identity of the identifi er and the methodology utilized. It must be clearly evident, which identifi cations are based on expert morphological identifi cation as opposed to, e.g., mere comparison with other sequences already in the system. Such documentation will greatly enhance the value of the database as a dependable identifi cation tool. For example, in cases of confl icting identifi cations within a BIN, this will help to decide which of the records are more reliably identifi ed. The majority rule approach for BINs containing confl icting identifi cations is strongly discouraged. Instead, every resulting DNA-based species identifi cation should be retraceable to at least one reliable DNA barcode from a specimen that was identifi ed by an expert taxonomist using traditional morphological methods.