v5.1.1 05-06-2026 - New command AssemblyStats to calculate statistics of genome assemblies - ReadsAligner. The input format option -f can now receive the value 2 for mapping of reads in BAM format. - MultisampleVariantsDetector. Implemented multithreading. New option -t to specify the number of threads. - SingleSampleVariantsDetector. Implemented multithreading. New option -t to specify the number of threads. - SIH. New option -ob to output a bam file with assignments of aligned reads to haplotypes. - TransposonsFinder. The option -o is now a prefix to output multiple files. - TransposonsFinder. New option -n to activate the deNovo identification of TEs the current version implements a new algorithm to identify conserved LTRs. - TransposonsFinder. New options to control the kmer length for the de-novo algorithm (-nk), the kmer length for the search algorithm (-k) and the window length for the search algorithm (-w). - Diversity statistics. Included Fisher exact test for departure from HWE. v5.1.0 29-05-2025 - FastqFileFilter: New option -s to provide a file with read ids to select. - GenomesAligner: New option -sy to skip the synteny analysis. - TransposonsFinder: New option -l to limit the length of genome sequences analyzed together. - ReadsAligner and Assembler: New algorithm for chaining of kmer hits. v5.0.1 26-02-2025 - New command FastqFileFilter to filter raw reads by length and average quality. - New command TransposonStats to calculate statistics of transposable elements annotated with the TransposonsFinder command. - Assembler: Added option -q to select reads based on a minimum average quality. - Assembler: Added option -mpl to select reads based on a minimum average quality. v5.0.0 26-01-2024 - New command AssemblyReferenceSorter to sort de-novo genome assemblies according to a reference genome. - New command CircularSequencesProcessor to remove redundancies and define the start of sequences representing circular molecules in a genome assembly. - New command HierarchicalClustering. Replaces NeighborJoining - Assembler: Added option -ecr to determine the rounds of alignment based error correction. - Assembler: Added option -wid to define the weight given to small indels in the calculation of edge costs. - RelativeAlleleCountsCalculator: New option -of to generate a complete report of allele read depths for all sites having more than one allele called. - GenomesAligner: New output file for visualization of synteny blocks with Synvisio v4.3.2 19-07-2023 - VCFConverter: Added option -genepop to export to the GenePop file format - GenomesAligner: Added option -r to select one genome as reference and sort and orient contigs of the other genomes according to the reference - GenomesAligner: Fixed clustering error for families with large numbers of paralogs - TransposonsFinder: Added strand information to the output v4.3.1 10-01-2023 - KmersExtractor: The option -t now specifies the number of threads. New option -text to indicate that the input is free text. - GenomeAssemblyMask: Fixed error that duplicated some sequences. - Assembler. Improved memory management for ultralong reads. v4.3.0 16-12-2022 - New function TransposonsFinder to identify transposable elements in assembled genomes - New function GenomeAssemblyMask to mask with lowercase or with N characters a set of regions in a genome assembly - GenomesAligner: Added option -t to set the number of threads - CDNACatalogAligner: Added option -t to set the number of threads v4.2.1 28-06-2022 - Assembler: New options -cml and -cmof to set the maximum contig length and a catalog of start sequencs for circularization - Assembler: Switched the default consensus algorithm to Polishing - SingleSampleVariantsDetector: Added flag -runLongReadSVs to run a new algorithm for detection of structural variants from long reads - GenomesAligner: Added option -yh and -yd to control the minimum number of units and the maximum distance between units in an output synteny block. - GenomesAligner: Changed default values for the k-mer length and percentage of k-mers. v4.2.0 07-01-2022 - GenomesAligner: Improved quality of orthogroup identification. New options -d and -i to recieve a large group of genomes and transcriptomes. New option -f to determine the frequency to label an orthogroup as soft core. - Assembler: Improved the algorithm for diploid species. Added the option -m to specify the minimum read length. Removed the option -uid - CDNACatalogAligner: Improved quality of orthogroup identification. New option -y to determine if the input files are CDNA or protein sequences v4.1.0 12-02-2021 - New command Assembler to perform de-novo assembly of long reads. - New command SIH to perform single individual haplotyping of diploid samples. - IndividualGenomeBuilder: Added option ploidy to set the ploidy of the reconstructed genome. - TillingPoolsIndividualGenotyper: Added option -m to determine the maximum number of pools allowed to call an individual variant - TillingPoolsIndividualGenotyper: Added option -b to report only biallelic variants v4.0.3 01-11-2020 - New command TillingPopulationSimulator to perform simulations of TILLING experiments. New command VCFRelativeCoordinatesTranslator to translate coordinates for variants generated with the command DeNovoGBS. - Improved model of variants discovery and genotyping for polyploids and pooled samples. - KmersExtractor: Now the input sequences are assumed to be DNA sequences. Added option -t to allow processing of free text (non-DNA). - TranscriptomeFilter: New option -ioe to consider only exons to intersect input regions with transcripts. - DeNovoGBS: New parameter to control the maximum numnber of reads to keep in RAM during the sorting step. This greatly improved the amount of RAM needed by this process. v4.0.2 17-08-2020 - Support to CRAM files - New command TillingPoolsIndividualGenotyper for individual assignment of variants discovered in tilling experiments - New command CDNACatalogAligner to identifiy ortholog relationships and build ortholog groups from cDNA transcriptome catalogs. - GenomesAligner and CDNACatalogAligner. Implemented Markov clustering to cluster genes from homology relationships. - DeNovoGBS: Improved analysis for paired-end reads. Added options -ignore5 and -ignore3 to avoid false positive variants at alignment ends - SingleSampleVariantsDetector: Improved variants detection for polyploids and pools - TranscriptomeFilter: Added filters by gene id - TranscriptomeFilter: Fixed error when the output was a list of regions - SingleReadsSimulator: Added parameter to set the minimum read length to fix error. Adjusted default values to the current length distribution of HiFi PacBio reads. v4.0.1 03-05-2020 - New command DeNovoGBS for de-novo analysis of Genotype-by-Sequencing (GBS) data - ReadsAligner. Implemented algorithm based on minimizers to add support to long reads. Added sample id and platform fields (options -s and -p) - ReadsAligner. Added multithread support (option -t) and option -m to specify the maximum number of alignments to report for each read - ReadsAligner. Option -r now receives directly the reference genome. Alternatively, an FM-index for short read alignment can be provided with the option -d - VCFConverter. Added conversion to the input needed by FineStructure v4.0.0 18-02-2020 - Major changes across command names and arguments. Standarized options -i and -o to indicate input and output file or directory paths. See the README.txt file for new usage instructions - General. Fasta and gff3 files can now be gzip compressed - New command ReadsFileErrorsCorrector to perform k-mer based error correction on files with raw reads - New commands GenomeIndexer and ReadsAligner to perform alignment of short reads to a reference genome. This is the first release of our FM-Index based reads aligner. - New command TranscriptomeFilter to filter transcriptomes in gff3 format. Filtering options previously included in the TranscriptomeAnalyzer command were moved to this command. - New command SingleReadsSimulator to simulate single reads from a genome in fasta format. - VCFConverter: New option -printGWASPoly to convert a VCF file to the format received by GWASPoly. - KmersExtractor: Changed the default behavior. Now it counts kmers on both strands by default. Deprecated option -b and new option -f to count kmers only on the forward strand. - KmersCounter: Options -b and -fasta were replaced by options -s and -f to make them consistent with the new command ReadsFileErrorsCorrector - VCFFilter. Replaced option -minI with option -m to make it consistent with SummaryStats. - VCFFilter. Replaced option -minC with option -minRD to improve the description of the effect of this filter. - VCFAlleleSharingStats: Now the option -n is used to indicate that introns should be included in the calculation. - VCFImpute: Now the overlap between windows is set with the option -v - VCFIntrogressionAnalysis: Now the overlap between windows is set with the option -v. - VCFIntrogressionAnalysis: Now the mismatch score is set with the option -t - VCFIntrogressionAnalysis: Now the option -r is used to print a filtered VCF file with the variants used to run the analysis - SingleIndividualSimulator: Now the indel mutation rate is set with the option -n instead of -i - SingleIndividualSimulator: Now the indel mutation rate is set with the option -n instead of -i v3.3.2 15-07-2019 - New command TranscriptomeAnalyzer to evaluate genome annotations in gff format, calculate statistics on genes and transcripts and extract cdna, cds and protein sequences - GenomesAligner: New option to control the genes displayed in the D3 visualization based on number of connections - MultisampleVariantsDetector: Improved efficiency for large numbers of samples - VariantsDetector: Fixed errors when the reference sequence is lowercase and in some cases when reads contain N characters - Demultiplex: Option -t to trim contaminating sequences now accepts more than one sequence v3.3.1 14-03-2019 - GenomesAligner: New options -k an -p to set the k-mer size and the minimum percentage of common k-mers to identify ortholog relationships - New command VCFGoldStandardComparator to perform benchmark comparisons between a gold standard phased VCF and a test VCF - New command VCFVariantsDensityCalculator to calculate density of variants within non-overlapping windows across the genome - FindVariants: Improved realignment, clustering and genotype calling algorithms for small indels and STRs v3.3.0 05-10-2018 - New command GenomesAligner to perform pairwise alignments of complete genomes - New command VCFIndividualGenomeBuilder to create a fasta file representing a single individual from a reference genome and homozygous alternative calls within a VCF file. This is useful to perform the polishing step of a genome assembly project - FindVariants: New option to calculate a fisher exact test for strand bias between the reference and the alternative allele - ConvertVCF: Fixed error with sorting of parent genotypes in the conversion to the JoinMap format. Also inconsistent genotype calls in lmxll and nnxnp SNPs are now reported in the log and masked as unknown genotype calls. v3.2.0 04-07-2018 - Added the new command MultisampleVariantsDetector to perform integrated variant detection over a collection of multiple samples. This is now the recommended method to perform variants detection on genotype-by-sequencing (GBS), RAD sequencing, whole exome sequencing (WES), RNA-seq and low coverage whole genome sequencing (WGS) data. - FindVariants: Deprecated options minAltCoverage, maxAltCoverage and genotypeAll v3.1.2 14-04-2018 - Added new command to simulate individuals from reference genomes - FindVariants: Improved the realignment around small indels and STRs to improve quality of detection and genotyping of these variants - FindVariants: The default behavior now only runs the SNV and small indel caller. Replaced options -noRep, -noRD and -noRP with -runRep -runRD and -runRP. - FindVariants: The default behavior now considers only unique alignments defined by the mapping quality threshold (option -minMQ). New option -p to process also non-unique primary alignments - FindVariants: Improved counting of unique and non unique alignments in repetitive regions predicted by the clustering algorithm based on non-unique alignments - VCF IO: Genotype likelihoods can now be read and written from the GL field The genotype quality can be a real number to allow reading this field from Freebayes VCF files v3.1.1 30-01-2018 - RelativeAlleleCounts: Statistics are now calculated not only for the entire genome but also by chromosome - The name of the command Deconvolute is now Demultiplex. Added option "-a" to allow dual barcoding of paired-end reads. In this mode, the barcode map must have at least 5 columns, being column 3 and 4 the R1 and R2 barcodes and the 5th column the sample ID - QualStats: Deprecated option "-ignoreXS". Added option "-minMQ" to set the minimum mapping quality to consider an alignment unique - FindVariants: Deprecated option "-ignoreXS". Added option "-minMQ" to set the minimum mapping quality to consider an alignment unique - CoverageStats: Added option "-minMQ" to set the minimum mapping quality to consider an alignment unique - CompareVCF: Indel events in overlapping but not exact positions can now be matched if their alleles are consistent. This improves the accuracy of the comparison. To achieve this improvement, this module now requires the complete reference sequence and not just the sequence names - Annotate: Added annotations of variation in splicing donor and acceptor sites and splice regions. Loss of start codons previously annotated as missense are now properly annotated. Loss of stop codons previously annotated as nonsense are now properly annotated. Annotation terms are now compliant with sequence ontology terms - Annotate: Improved compliance reading gene annotations in gff3 format. In particular the phase field is now properly handled - SummaryStats: Improved statistics with separated counts for different variant categories and annotation types - SummaryStats: Fixed error calculating numbers of minor and unique alleles per sample - ImputeVCF: Fixed error for VCF files including information of allele copy number in the genotype entries v3.1.0 26-07-2017 - The VCF I/O now interprets correctly phasing information - ImputeVCF: The imputation module now can impute genotypes and perform statistical phasing on heterozygus individuals. Parents can now be phased heterozygous individuals - New command called VCFDistanceMatrixCalculator to calculate IBS distance matrices from a VCF - New command NeighborJoining to execute the neighbor joining clustering algorithm from distance matrixes v3.0.2 15-01-2017 - New command called RelativeAlleleCounts to calculate distributions of relative allele counts from BAM files and infer ploidy - New command called KmerCounter to perform counting of kmers in either fasta or fastq files - The VCF file writer was improved to print zeros in the fields BSDP and ADP when counts are not available. This improves compliance with the vcf validator of EBI - DiversityStats: Added reference allele and frequency of the reference allele for each population v3.0.1 08-08-2016 - New command called IntrogressionAnalysis which makes a window-based analysis to identify haplotypes common within one population and introgressed in samples of a different population. See the README.txt for usage details. - FindVariants: Solved concurrency issue that made the read pair (RP) analysis fail on the Multi variants detector option in the graphical interface. - FindVariants: Improved selection of possible alleles for indel or complex calls to improve speed on high coverage datasets - FindVariants: Fixed error loading the NUF field having the number of uniquely aligned fragments for the repeats saved in the GFF format for structural variants. The field was saved as double but it was loaded as an integer. The field is now saved as an integer but now can be loaded as a double for backwards compatibility. - Annotate: Fixed error that made the process fail with entries without alternative alleles. Now these sites are just written back to the output VCF as they, by definition, do not store any variation to annotate. v3.0.0 07-07-2016 - The parent package name for the complete aplication was changed from "net.sf.ngstools" to "ngsep". Several improvements in the object model were performed to improve memory management and in some cases reduce running times. The classes from picard for management of SAM/BAM files were updated to the latest version, now called htslib. Dependency to these classes was reduced to facilitate future updates. - Added the command "AlleleSharingStats" to calculate allele sharing diversity statistics on populations of inbred samples - Input VCF files over the whole application can now be gzip compressed. - FilterVCF: Added filters of minimum and maximum observed heterozygosity. - ConvertVCF: Conversion to treemix consumes much less memory and only outputs counts for biallelic variants - ConvertVCF: Added header with number of samples and number of alleles to the DARWin format, as specified in v6.0 - DiversityStats: Added informative headers - Fixed error that produced comma instead of dot for decimal separator depending on the language configurations - The "Clip" command is no longer available v2.1.5 26-01-2016 - Deconvolute: This function can now receive a tab-delimited file describing locations of several fastq files (See README.txt for details). This is useful to deconvolute more than one lane in one command and to merge automatically reads from samples sequenced more than one time. The flowcell, lane and fastq input files are now optional and are ignored if a file descriptor for multilane decolnvolution is provided. - Deconvolute: Input fastq files can now be gzip compressed. - Deconvolute: Lanes sequenced in paired-end mode can now be deconvoluted. In this mode, two files should be provided and two files are generated for each sample to preserve the read pair information. - Deconvolute: The barcode index file now must have a header line - FindVariants: Added read pair analysis within duplications predicted by read depth to predict if the duplication is in tandem or is interspersed. Added info fields to the output GFF file to indicate for each CNV if is a deletion, a tandem duplication or an interspersed (trans) duplication and to provide the number of reads supporting these predictions (See README.txt for details) - FindVariants: Fixed error in detection of inversions v2.1.4 19-10-2015 - FindVariants: Users can now add a file with short tandem repeat regions. These regions will now be called as one single region instead of individual pileups - FindVariants: Improved algorithm for indel realignment - FindVariants: Added a TYPE attribute to the variants different than biallelic SNVs - FindVariants: Information on copy number variation within the VCF file has been moved out of the GT field and now is stored in the format field ACN (allele copy number). The field LPL was also removed. - In the VCF format field, the fields with read counts per allele, previously called AC and AAC now are called ADP and BSDP respectively. - ConvertVCF: In the converter to Plink, changed genotype format from 1,2,3,4 to ACGT v2.1.3 14-07-2015 - Output fastq files after deconvolution are now gzip compressed. Added option "-u" to keep them uncompressed (Mainly for windows). v2.1.2 30-04-2015 - The distribution jar file now has version number - Added the options -maxPCTOverlapCNVs, -maxLenDeletion, -minSVQuality and -sizeSRSeed to control the behavior of the read pair analysis. SEE README.txt for details of each parameter - ConvertVCF: In the conversion to Plink, chromosomes in the .map file have now numerical indexes - ConvertVCF: Added option to convert to rrBLUP - Command line help is now available for every command using the --help option - FindVariants: Implemented the algorithm EWT to call CNVs using read depth information. Added the option -algCNV to decide which read depth algorithms should be called - Added breakpoint resolution to the read-pair analysis to find large indels based on realignment of reads likely to span the breakpoint site - Fixed error with the header in the conversion to DARWin v2.1.1 13-02-15 - ImputeVCF: Fixed error when a SNP has only heterozygous genotypes - ImputeVCF: Genotypes of unrelated inbred varieties can now be imputed - ImputeVCF: Removed options -m and -e for percentages of missing data and heterozygous. Now these parameters are estimated from the data. Added option -k to specify the number of haplotype clusters (or the number of parents for breeding populations) - DiversityStats: Added support to standard input v2.1.0 22-12-14 - Added new command to compare two VCF files. - Added new command to impute missing genotypes in inbred multiparental populations. - Added new command to identify significant differences between the read depth distributions of two samples. We implemented the algorithm called CNVSeq (See http://www.biomedcentral.com/1471-2105/10/80) - VCFConverter can now convert to Darwin, Joinmap and TreeMix - Added the options -u and -d to the functional annotator to control the offsets before and after each gene to classify a variant as upstream or downstream - QualStats now can receive multiple bam files. The read length is now calculated from the alignments so it does not need to be provided. Added the option "-ignoreXS" to ignore this field in the bam file as an indicator for a read with multiple alignments (Same behavior as that within the variants detector) - VCFFilter now allows to select more than one annotation with the -a parameter. Annotations should be comma-separated - Added option -ignoreProperPairFlag to the variants detector to ignore this flag while running the read pair algorithm - The options -noStructural and -noNewCNV have been replaced with -noRP and -noRD respectively to reflect better that they are used to turn down the read pair and read depth analysis but both produce structural variants. The option -noNewCNV still exists but has a new meaning. By default, the read depth analysis genotypes input CNVs and repeats and then tries to find new CNVs. Using the option -noNewCNV the variants detector still builds the read depth distribution and genotypes input CNVs but it does not try to find new CNVs - Eliminated output file .cnv with CNVs and repeats in the variants detector. The option -knownCNVs has been replaced with -knownSVs and the format of the input file for this option is now GFF. - MergeVCF now does not try to merge variants in the same location with different alleles. A warning is logged everytime this situation is identified and two variants are reported in the output VCF. This eliminates entries with inconsistent information between the variant alleles and the calls. v2.0.7 22-09-14 - Added module to calculate diversity statistics per site including a Chi-square test for deviation from HWE and F-statistics. - Included total alignment counts for each length in the quality statistics to deal with reads having variable lengths - Fixed error in the VCFFilter processing VCF files not generated by NGSEP - Added an option to the variants detector to ignore the XS field as an indicator for a read with multiple alignments. The XS field was used by default in previous versions because bowtie2 outputs this filed only for reads with more than one alignment. However bwa-mem and bwa-sw output this field for every aligment, which produced errors in the algorithms to identify CNVs, repeats and structural variation. v 2.0.6 04-08-2014 - Added conversion to PHASE for statistical phasing - Added a new command to produce summary statistics based on a VCF file - Added the options -saf and -fs to the FilterVCF command to select or filter samples with ids loaded from a text file - Changed the default behavior of FilterVCF regarding quality and fixed alleles. Now the program by default will not perform any filtering. Because the option -v loss meaning under this behavior, it was replaced by the options -fi, -fir and -fia to filter respectively all the invariant sites, invariant sites having only the reference allele, and invariant sites having only an alternative allele - Added an optional header with sample id and ploidy to the VCF file. This header is later used by the FilterVCF command to recalculate properly the number of samples with CNVs while filtering samples - Added a format field to the VCF files indicating for each genotype the local ploidy of the region surrounding the variant v 2.0.5 27-01-2014 - Improved speed loading vcf files - Added conversion to MapDisto - Variant ids in VCF files are now properly loaded - MergeVariants now keeps in the output vcf file SNVs within indels v 2.0.4. 07-10-2013 - Changed default min coverage in the filter to avoid filtering variants without coverage information - Added parameter in the filter to select variants within a set of regions - Conversion to matrix and to hapmap now consumes much less memory - Fixed error writing vcf files when allele count information is not available - Fixed error loading SNVs when the reference allele is N (e.g. VCF files produced by samtools) - Fixed error loading genotypes from a VCF file in which the number of fields per genotype does not match the number of fields in the genotype format column - Fixed error loading genotypes from a VCF file in which the genotype quality or the coverage is not available. - VCFMerge now keeps sites in which all called genotypes are homozygous reference